PointTriPE: Triadic Positional Encoding for Point Clouds with Multi-Scale, Local, and Relative Embeddings

Overview

Transformer architectures have been extensively validated for modeling structural relationships in natural language and vision tasks, yet positional encoding in 3D point cloud learning is often overlooked or oversimplified. In this paper, we systematically review and analyze existing point cloud positional encoding strategies, revealing their limitations in scale awareness and local geometric capture. Motivated by the observation that no single encoding can simultaneously capture global, intermediate, and local spatial dependencies, we propose a Triadic positional encoding framework comprising: Multi-Scale Positional Encoding, which leverages coarse-to-fine hierarchical attention combined with delta-coordinate aggregation to capture global and intermediate contextual cues; Local Geometric Encoding, which focuses on extracting fine-level local structural patterns; Relative Positional Encoding, which generates enriched relative position embeddings by applying expert-level gating and soft assignment to each point-pair’s relative coordinates, and then aggregating the corresponding codebook vectors via weighted combination. We integrate these three encoding modules into a hierarchical point cloud backbone and evaluate our approach on multiple challenging 3D benchmarks, including ModelNet40, ScanObjectNN, ShapeNetPart, S3DIS and ScanNet v2. Experimental results demonstrate consistent and significant improvements in semantic segmentation, part segmentation, and object classification tasks, validating the critical role of carefully designed positional encodings for 3D perception.

Method Overview

Overview of our proposed network architecture. The left side illustrates a U-Net–style point transformer framework used for semantic segmentation and classification. The right side shows the core attention-based feature extraction module integrating our triadic positional encoding.

Illustration of the proposed triadic positional encoding modules. From left to right: Multi-Scale Positional Encoding, Local Geometric Encoding, and Relative Positional Encoding

PointTriPE Modules

Below is the core implementation of our proposed PointTriPE framework, which integrates three complementary encoding strategies for 3D point clouds. The full codebase will be released upon paper acceptance.

  • MSPE (Multi-Scale Positional Encoding) captures global context across hierarchical geometric scales. It applies MLPs and multihead attention over a point cloud pyramid, integrating cross-scale features into a positional prior that enriches long-range dependencies efficiently.
  • LGE (Local Geometric Encoding) enhances sensitivity to fine-grained structures such as edges and curvature. It encodes relative positions to neighboring points using a shared MLP, followed by max pooling, enabling local geometry-aware embeddings.
  • RPE (Relative Positional Encoding) bridges mid-scale spatial relations via a Mixture-of-Experts (MoE) mechanism. It models pairwise offsets with directional codebooks and dynamic gating, enriching attention weights with asymmetric and nonlinear geometric priors.

Each module plays a distinct role in PointTriPE's positional encoding hierarchy—global, local, and relational—working together to enable expressive and efficient 3D representation learning.

Experiments

We evaluate our model on five widely-used point cloud benchmarks spanning three representative tasks. For shape classification, we use ModelNet40 and ScanObjectNN; for part segmentation, ShapeNetPart; and for indoor scene semantic segmentation, S3DIS and ScanNet v2.

Classification results on ModelNet40

OA: overall accuracy. mAcc: mean per-class accuracy.

Method OA (%) mAcc (%)
PointNet 89.2 86.0
PointNet++ 93.0 90.7
DGCNN 92.9 90.2
PCT 93.2 -
PTv1 93.7 90.6
CurveNet 93.8 -
PointMLP 94.1 91.3
GBNet 93.8 91.0
PointNeXt 94.0 91.1
PTv2 94.2 91.6
DualMLP 93.7 -
PointTriPE (ours) 94.0 91.8

Part segmentation results on ShapeNetPart

“Ins. mIoU” denotes instance-average mIoU; “Cls. mIoU” denotes class-average mIoU.

Method Ins. mIoU Cls. mIoU
PointNet 83.7 80.4
PointNet++ 85.1 81.9
DGCNN 85.2 82.3
PTv1 86.6 83.7
AGCN 86.9 85.1
RSCNN 86.2 84.0
CurveNet 86.8 84.2
OTMae3D 86.8 85.1
PointJEPA 83.9 85.8
AdaCrossNet - 85.1
PointTriPE (ours) 87.0 85.1

Part segmentation results (mean IoU %) on ShapeNetPart.
“Ins. mIoU” denotes instance-average mIoU; “Cls. mIoU” denotes class-average mIoU.

Method Ins. mIoU Cls. mIoU
PointNet 83.7 80.4
PointNet++ 85.1 81.9
DGCNN 85.2 82.3
PTv1 86.6 83.7
AGCN 86.9 85.1
RSCNN 86.2 84.0
CurveNet 86.8 84.2
OTMae3D 86.8 85.1
PointJEPA 83.9 85.8
AdaCrossNet - 85.1
PointTriPE (ours) 87.0 85.1

Semantic segmentation results on ScanNet v2.
Mean IoU (%) on validation and test sets.

Method Val mIoU Test mIoU
O-CNN 74.0 72.8
PointNet++ 53.5 33.9
PointConv 61.0 55.6
KpConv 69.2 68.0
Minkowski 72.2 73.4
SFormer 74.3 73.7
BPNet 73.9 74.9
PTv2 75.4 75.2
KPConvX-L 76.2 75.6
PointTriPE (ours) 75.0 74.8

Semantic segmentation results on S3DIS Area 5. Class IoUs of ceiling, floor, wall, and beam are omitted.

Method OA mAcc mIoU column window door table chair sofa book. board clutter
PTv1 90.8 76.5 70.4 38.0 63.4 74.3 89.1 82.4 74.3 80.2 76.0 59.3
FPT - 77.6 71.0 53.8 71.2 77.3 81.3 89.4 60.1 72.8 80.4 58.9
PointNeXt 91.0 77.2 71.1 37.7 59.3 74.0 83.1 91.6 77.4 77.2 78.8 60.6
PointMixer - 77.4 71.4 43.8 62.1 78.5 90.6 82.2 73.9 79.8 78.5 59.4
StratifiedFormer 91.5 78.1 72.0 46.1 60.0 76.8 92.6 84.5 77.8 75.2 78.1 64.0
DU-Net - - 72.2 40.0 60.7 82.7 90.8 83.1 78.5 83.5 75.9 64.1
PTv2 91.6 78.0 72.0 34.4 64.7 77.9 93.1 84.4 77.3 86.3 84.5 62.2
PointTriPE (ours) 91.3 78.2 71.6 49.2 59.4 74.1 82.5 92.8 85.0 80.1 73.3 64.1

Normalized attention variation on an airplane sample from ModelNet40. Each subfigure corresponds to one of the 9 selected query points (stars), projected onto the XY plane. Neighboring points are shown as circles, where the size reflects their distance to the query, and color denotes the attention change induced by the positional encoding.