Category agnostic prior for non-rigid shape matching

Emery Pierson, Lei Li, Angela Dai, Maks Ovsjanikov

Objective

Hypothesis: We have access to large scale dataset of non-rigid, registered shapes.

Numerous registered human shapes

How can we learn to match shapes, without any hypothesis on the category of shapes we want to match?

Shape matching with functional maps

 

Shape matching with functional maps

Let \(\mathcal{M}\), \(\mathcal{N}\) two shapes. We aim to find a pointwise map \(T : \mathcal{M} \to \mathcal{N}\)

 

We can also see the pointwise map as a function transfer (here between diracs)

Shape matching with functional maps

 

The operator \(C: L_2(\mathcal{M}) \mapsto L_2(\mathcal{N})\) is linear!

Shape matching with functional maps

  • A set of basis functions (Laplace Beltrami eigenfunctions) on \(\mathcal{M}, \mathcal{N}\)

  • C, represented as a matrix (linear operator) basis function on M and N (note: a mapping matrix \(C\) does not necessarily correspond to a pointwise map).

  • The pointwise map \(T\) is then extracted from the mapping matrix.

Shape matching with functional maps

 

Shape matching with functional maps

  • We have a set of basis functions of \(\mathcal{M}\), and \(\mathcal{N}\).

  • We have a set of descriptors functions \(f_i\) on \(\mathcal{M}\) and \(g_j\) on \(\mathcal{N}\) such that \(g(x) \sim f \circ T^{-1} (x)\).

  • We decompose all \(f_i\) as \(a \in \mathbb{R}^{n \times m}\) and \(g_j\) as \(b \in \mathbb{R}^{n \times m}\). The functional map can be defined as the solution of: \[ C = \underset{C}{\text{argmin}} ||Ca - b||² = \underset{C}{\text{argmin}} \text{ data_loss(C)} \]

In practice, we compute the pointwise descriptors using a neural network. Since the output of the previous equation can be obtained in closed form, we optimize the output \(C\) with respect to the ground truth map \(C_{gt}\) or with axiomatic constraints, allowing to learn the descriptors.

Deep Functional maps regularization terms

  • “Maps should be as isometric as possible” (can be incorporated in FMReg layer, key for initialization [NCP, Neurips 2023]):

\[|| M_{\text{LBO}} * C ||^2\]

  • “Maps should be volume preserving”:

\[|| C C^T || ^2 \]

  • and many others (continuity, orientation, bijectivity, ….)

However, those conditions are not always met in practice.

Deep functional maps

 

Note: the loss looks like \(\text{ data_loss(C)} + \text{reg_loss(C)}\).

Experiment

Functional maps exhibit similar diagonal structures across humans, animals, and other categories.

 

Can we “learn” this structure?

New objective

Hypothesis: We have access to large scale dataset of non-rigid, registered shapes.

How can we learn a prior on functional maps, to regularize deep functional maps, without any hypothesis on the category of shapes we want to match?

Answer: diffusion models!

Diffusion models

 

Diffusion models

  • Forward SDE (\(t: 0 \to 1\)) (data to noise process):

    \[dx_t = h_t(x_t) dt+ g_tdw\]

  • Reverse SDE (\(t: 1 \to 0\)) (generative process):

    \[dx_t = \left(h_t(x_t) - g_t^2 \nabla_{x_t} \log p_t(x_t) \right)dt +g_t d\bar{w}\]

\(s(x_t, t) = \nabla_{x_t} \log p_t(x_t)\) is the score function (what we need to estimate). (We can condition the score function \(s(x_t, t, c)\) on any condition \(c\) e.g. text.)

https://yang-song.net/blog/2021/score/

Score function: langevin dynamics

Score function is \(s(x) = \nabla_x \log p(x)\).

By iteratively following the score and adding a little noise, we are generating samples !!

 

Score matching

 

Why we need a little more

Out of the data distribution, we don’t need the score. However, it is where the score is the highest!

 

Denoising score matching:

Perturbed noise distributions

By using different noise scales, we can estimate the score easily out of the data distribution.

 

Denoising score matching:

Annealed langevin dynamics

We can reproduce a better langevin dynamics by iteratively denoising.

 

Denoising score matching:

Practical

 

Diffusion models

In general, learning to denoise the data \(x_t\) is sufficient using a denoiser \(D_\psi(x_t, t)\), minimizing

\[ \mathbb{E}_{x \sim p_{\text{data}}} \mathbb{E}_{n_\sigma \sim \mathcal{N}(0, t^2 I)}|| D_\psi(x + n_t, t) - x ||^2,\]

where \(\psi\) are parameters (neural network weights).

Then, the score function is given by:

\[\nabla_{x_\sigma} \log p({x_\sigma}; \sigma) = (D({x_\sigma}; \sigma) - x)/\sigma^2\]

Diffusion models

Summary

  • Noising process, denoising - generative process
  • Denoising \(\sim\) following the score using Annealed Langevin dynamics
  • Learing the score \(\sim\) learning to denoise
  • Learning to denoise \(\sim \sim \sim\) learning the data probability density

Sounds like a good candidate for our task

How can we transfer our knowledge of data probability to downstream tasks?

Score Distillation Sampling

Main idea

 

Score Distillation Sampling

Main idea

 

“A hotdog in tutu skirt”

Text-to 3D generation

Score Distillation sampling

Details

We have a source domain (with lots of training data) and a target domain (with not so much training data) such that:

  • We have a denoiser \(D_\psi\) on the source domain (easy)
  • We have a differentiable representation \(y_\theta\) on the target domain.
  • We have a differentiable “source domain extractor” \(g(y_\theta)\) that maps the target domain representation to a source domain representation

We want to sample \(y_\theta\) with the learned denoiser

Score Distillation sampling

On text-to-3D

  • Our source domain is images

  • Our target domain is 3D shapes

  • Our differentiable representation is a NerF (original paper) or 3DGS

  • Our differentiable extractor is rendering.

Score Distillation sampling

On deep functional maps

  • Our source domain is functional maps (trained with human registered data) -> we train a Denoiser \(D_\psi(x_t, t)\) on human data functional maps

  • Our target domain is point-to-point maps

  • Our differentiable representation is pointwise features of deep functional maps

  • Our differentiable extractor is functional maps block

We can now transfer functional maps knowledge accross categories with SDS!! Let’s do it

Score Distillation sampling

On deep functional maps

 

We can now transfer functional maps knowledge accross categories with SDS!! Let’s do it

Diffumatch

 

Score Distillation Sampling

“When it sounds too good to be true, very often, it is too good to be true”

Poor results when applying SDS directly

Diffumatch

We just applied SDS as-is, but ignored completely that we are computing functional maps. We have to:

  • Make sure the initialization is correct.
  • Trick: compute a learned mask regularization from the score denoiser -> better initialization.
  • Make sure the functional maps correspond to a point-to-point map

Mask vizualization

 

Results

 

Diffumatch final pipeline

 

Some results

 

Generalization

 

Limitations

 

Take-aways

  • Functional maps, by nature, are good candidate for category agnostic learning

  • Recent technology (diffusion models, SDS) is important, even for geometry

  • Applying it to new domains require some domain knowledge

Future works

  • We only trained our diffusion model on humans. Can we improve the generalization with more data?

  • Are functional maps really the best candidate?

  • First potential candidate: Surface general features / distillation from image features.

  • Endgoal: foundation model for 3D shape matching/analysis.

PatchAlign3D

Our main limitation is the lack of 3D/surface general features.

Ideal: Dino/CLIP for 3D. Limitation: limited 3D training data?

Can we learn 3D features that are close to 2D features?

DinoV2 features

Solution

Distill DinoV2 features!!

 

Breakdown

  • Predict features on patches as for images
  • Use a transformer à-la Vision Transformer
  • Distill 2D features from rendering + backprojection.
  • For open-text segmentation, contrastive loss between distilled features and text features

Results

 

Conclusion

  • First steps towards 3D shape matching models: good features (patchalign3d) and correspondence prior (diffumatch).
  • Can we improve features with a surface network?
  • Can we improve the prior (more training data, deformation prior, better architecture)
  • Can we predict shape matching in a feed-forward way?

Score Distillation sampling

Details

This is done by minimizing the loss:

\[ \nabla_\theta \mathcal{L}_{\text{SDS}} = \mathbb{E}_{\sigma, x_t \sim \mathcal{N}(x, t)} [(x_t - D(x_t, t))/t] \frac{\partial g}{\partial \theta},\]

where \(x = g(y_\theta)\).

In the original paper, \(y_\theta\) is the Nerf representation, \(g\) is the differentiable rendering, \(x\) is an image.