PatchAlign3D: Local Feature Alignment for Dense 3D Shape Understanding

Published in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

Q: What is the missing piece in current 3D foundation models?

While they excel at global tasks like classification, they often fail at dense, local part-level reasoning.


Q: How does PatchAlign3D solve this?

We introduce a two-stage training process that distills 2D visual features into 3D patches and aligns them with text, creating the first language-aligned local 3D encoder.


Q: Does it work on any object category?

Yes. It is fully zero-shot and open-vocabulary. You can segment parts of any object simply by typing a text query (e.g., “handle”, “landing gear”), even if the model has never seen that category before.


Q: Is it efficient at test time?

Yes. PatchAlign3D is a single feed-forward encoder that requires no multi-view rendering at inference, enabling fast zero-shot part segmentation.

Download paper here