PatchAlign3D: Local Feature Alignment for Dense 3D Shape Understanding
Published in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026
Q: What is the missing piece in current 3D foundation models?
While they excel at global tasks like classification, they often fail at dense, local part-level reasoning.
Q: How does PatchAlign3D solve this?
We introduce a two-stage training process that distills 2D visual features into 3D patches and aligns them with text, creating the first language-aligned local 3D encoder.
Q: Does it work on any object category?
Yes. It is fully zero-shot and open-vocabulary. You can segment parts of any object simply by typing a text query (e.g., “handle”, “landing gear”), even if the model has never seen that category before.
Q: Is it efficient at test time?
Yes. PatchAlign3D is a single feed-forward encoder that requires no multi-view rendering at inference, enabling fast zero-shot part segmentation.
