PointTriPE: Triadic Positional Encoding for Point Clouds with Multi-Scale, Local, and Relative Embeddings

Overview

Transformer architectures have been extensively validated for modeling structural relationships in natural language and vision tasks, yet positional encoding in 3D point cloud learning is often overlooked or oversimplified. In this paper, we systematically review and analyze existing point cloud positional encoding strategies, revealing their limitations in scale awareness and local geometric capture. Motivated by the observation that no single encoding can simultaneously capture global, intermediate, and local spatial dependencies, we propose a Triadic positional encoding framework comprising: Multi-Scale Positional Encoding, which leverages coarse-to-fine hierarchical attention combined with delta-coordinate aggregation to capture global and intermediate contextual cues; Local Geometric Encoding, which focuses on extracting fine-level local structural patterns; Relative Positional Encoding, which generates enriched relative position embeddings by applying expert-level gating and soft assignment to each point-pair’s relative coordinates, and then aggregating the corresponding codebook vectors via weighted combination. We integrate these three encoding modules into a hierarchical point cloud backbone and evaluate our approach on multiple challenging 3D benchmarks, including ModelNet40, ScanObjectNN, ShapeNetPart, S3DIS and ScanNet v2. Experimental results demonstrate consistent and significant improvements in semantic segmentation, part segmentation, and object classification tasks, validating the critical role of carefully designed positional encodings for 3D perception.

Method Overview

Overview of our proposed network architecture. The left side illustrates a U-Net–style point transformer framework used for semantic segmentation and classification. The right side shows the core attention-based feature extraction module integrating our triadic positional encoding.

Illustration of the proposed triadic positional encoding modules. From left to right: Multi-Scale Positional Encoding, Local Geometric Encoding, and Relative Positional Encoding

PointTriPE Modules

Below is the core implementation of our proposed PointTriPE framework, which integrates three complementary encoding strategies for 3D point clouds. The full codebase will be released upon paper acceptance.

MSPE (Multi-Scale Positional Encoding) captures global context across hierarchical geometric scales. It applies MLPs and multihead attention over a point cloud pyramid, integrating cross-scale features into a positional prior that enriches long-range dependencies efficiently.
LGE (Local Geometric Encoding) enhances sensitivity to fine-grained structures such as edges and curvature. It encodes relative positions to neighboring points using a shared MLP, followed by max pooling, enabling local geometry-aware embeddings.
RPE (Relative Positional Encoding) bridges mid-scale spatial relations via a Mixture-of-Experts (MoE) mechanism. It models pairwise offsets with directional codebooks and dynamic gating, enriching attention weights with asymmetric and nonlinear geometric priors.

Each module plays a distinct role in PointTriPE's positional encoding hierarchy—global, local, and relational—working together to enable expressive and efficient 3D representation learning.

Experiments

We evaluate our model on five widely-used point cloud benchmarks spanning three representative tasks. For shape classification, we use ModelNet40 and ScanObjectNN; for part segmentation, ShapeNetPart; and for indoor scene semantic segmentation, S3DIS and ScanNet v2.

Classification results on ModelNet40

OA: overall accuracy. mAcc: mean per-class accuracy.

Method	OA (%)	mAcc (%)
PointNet	89.2	86.0
PointNet++	93.0	90.7
DGCNN	92.9	90.2
PCT	93.2	-
PTv1	93.7	90.6
CurveNet	93.8	-
PointMLP	94.1	91.3
GBNet	93.8	91.0
PointNeXt	94.0	91.1
PTv2	94.2	91.6
DualMLP	93.7	-
PointTriPE (ours)	94.0	91.8

Part segmentation results on ShapeNetPart

“Ins. mIoU” denotes instance-average mIoU; “Cls. mIoU” denotes class-average mIoU.

Method	Ins. mIoU	Cls. mIoU
PointNet	83.7	80.4
PointNet++	85.1	81.9
DGCNN	85.2	82.3
PTv1	86.6	83.7
AGCN	86.9	85.1
RSCNN	86.2	84.0
CurveNet	86.8	84.2
OTMae3D	86.8	85.1
PointJEPA	83.9	85.8
AdaCrossNet	-	85.1
PointTriPE (ours)	87.0	85.1

Part segmentation results (mean IoU %) on ShapeNetPart.
“Ins. mIoU” denotes instance-average mIoU; “Cls. mIoU” denotes class-average mIoU.

Method	Ins. mIoU	Cls. mIoU
PointNet	83.7	80.4
PointNet++	85.1	81.9
DGCNN	85.2	82.3
PTv1	86.6	83.7
AGCN	86.9	85.1
RSCNN	86.2	84.0
CurveNet	86.8	84.2
OTMae3D	86.8	85.1
PointJEPA	83.9	85.8
AdaCrossNet	-	85.1
PointTriPE (ours)	87.0	85.1

Semantic segmentation results on ScanNet v2.
Mean IoU (%) on validation and test sets.

Method	Val mIoU	Test mIoU
O-CNN	74.0	72.8
PointNet++	53.5	33.9
PointConv	61.0	55.6
KpConv	69.2	68.0
Minkowski	72.2	73.4
SFormer	74.3	73.7
BPNet	73.9	74.9
PTv2	75.4	75.2
KPConvX-L	76.2	75.6
PointTriPE (ours)	75.0	74.8

Semantic segmentation results on S3DIS Area 5. Class IoUs of ceiling, floor, wall, and beam are omitted.

Method	OA	mAcc	mIoU	column	window	door	table	chair	sofa	book.	board	clutter
PTv1	90.8	76.5	70.4	38.0	63.4	74.3	89.1	82.4	74.3	80.2	76.0	59.3
FPT	-	77.6	71.0	53.8	71.2	77.3	81.3	89.4	60.1	72.8	80.4	58.9
PointNeXt	91.0	77.2	71.1	37.7	59.3	74.0	83.1	91.6	77.4	77.2	78.8	60.6
PointMixer	-	77.4	71.4	43.8	62.1	78.5	90.6	82.2	73.9	79.8	78.5	59.4
StratifiedFormer	91.5	78.1	72.0	46.1	60.0	76.8	92.6	84.5	77.8	75.2	78.1	64.0
DU-Net	-	-	72.2	40.0	60.7	82.7	90.8	83.1	78.5	83.5	75.9	64.1
PTv2	91.6	78.0	72.0	34.4	64.7	77.9	93.1	84.4	77.3	86.3	84.5	62.2
PointTriPE (ours)	91.3	78.2	71.6	49.2	59.4	74.1	82.5	92.8	85.0	80.1	73.3	64.1

Normalized attention variation on an airplane sample from ModelNet40. Each subfigure corresponds to one of the 9 selected query points (stars), projected onto the XY plane. Neighboring points are shown as circles, where the size reflects their distance to the query, and color denotes the attention change induced by the positional encoding.