Hypothesis: We have access to large scale dataset of non-rigid, registered shapes.

How can we learn to match shapes, without any hypothesis on the category of shapes we want to match?

Problem : Costly optimization or imposes prior to the deformation (when learning).

Problem : Descriptors are category specific, don’t transfer to new ones.

Let \(\mathcal{M}\), \(\mathcal{N}\) two shapes. We aim to find a pointwise map \(T : \mathcal{M} \to \mathcal{N}\)
We can also see the pointwise map as a function transfer (here between diracs)
The operator \(C: L_2(\mathcal{M}) \mapsto L_2(\mathcal{N})\) is linear!
A set of basis functions (Laplace Beltrami eigenfunctions) on \(\mathcal{M}, \mathcal{N}\)
C, represented as a matrix (linear operator) basis function on M and N (note: a mapping matrix \(C\) does not necessarily correspond to a pointwise map).
The pointwise map \(T\) is then extracted from the mapping matrix.
We have a set of basis functions of \(\mathcal{M}\), and \(\mathcal{N}\).
We have a set of descriptors functions \(f_i\) on \(\mathcal{M}\) and \(g_j\) on \(\mathcal{N}\) such that \(g(x) \sim f \circ T^{-1} (x)\).
We decompose all \(f_i\) as \(a \in \mathbb{R}^{n \times m}\) and \(g_j\) as \(b \in \mathbb{R}^{n \times m}\). The functional map can be defined as the solution of: \[ C = \underset{C}{\text{argmin}} ||Ca - b||² = \underset{C}{\text{argmin}} \text{ data_loss(C)} \]
In practice, we compute the pointwise descriptors using a neural network. Since the output of the previous equation can be obtained in closed form, we optimize the output \(C\) with respect to the ground truth map \(C_{gt}\) or with axiomatic constraints, allowing to learn the descriptors.
\[|| M_{\text{LBO}} * C ||^2\]
\[|| C C^T || ^2 \]
However, those conditions are not always met in practice.
Note: the loss looks like \(\text{ data_loss(C)} + \text{reg_loss(C)}\).
Functional maps exhibit similar diagonal structures across humans, animals, and other categories.
Can we “learn” this structure?
Hypothesis: We have access to large scale dataset of non-rigid, registered shapes.
How can we learn a prior on functional maps, to regularize deep functional maps, without any hypothesis on the category of shapes we want to match?
Answer: diffusion models!
Forward SDE (\(t: 0 \to 1\)) (data to noise process):
\[dx_t = h_t(x_t) dt+ g_tdw\]
Reverse SDE (\(t: 1 \to 0\)) (generative process):
\[dx_t = \left(h_t(x_t) - g_t^2 \nabla_{x_t} \log p_t(x_t) \right)dt +g_t d\bar{w}\]
\(s(x_t, t) = \nabla_{x_t} \log p_t(x_t)\) is the score function (what we need to estimate). (We can condition the score function \(s(x_t, t, c)\) on any condition \(c\) e.g. text.)
https://yang-song.net/blog/2021/score/
Score function is \(s(x) = \nabla_x \log p(x)\).
By iteratively following the score and adding a little noise, we are generating samples !!
Out of the data distribution, we don’t need the score. However, it is where the score is the highest!
By using different noise scales, we can estimate the score easily out of the data distribution.
We can reproduce a better langevin dynamics by iteratively denoising.
In general, learning to denoise the data \(x_t\) is sufficient using a denoiser \(D_\psi(x_t, t)\), minimizing
\[ \mathbb{E}_{x \sim p_{\text{data}}} \mathbb{E}_{n_\sigma \sim \mathcal{N}(0, t^2 I)}|| D_\psi(x + n_t, t) - x ||^2,\]
where \(\psi\) are parameters (neural network weights).
Then, the score function is given by:
\[\nabla_{x_\sigma} \log p({x_\sigma}; \sigma) = (D({x_\sigma}; \sigma) - x)/\sigma^2\]
Sounds like a good candidate for our task
How can we transfer our knowledge of data probability to downstream tasks?


Text-to 3D generation
We have a source domain (with lots of training data) and a target domain (with not so much training data) such that:
We want to sample \(y_\theta\) with the learned denoiser
Our source domain is images
Our target domain is 3D shapes
Our differentiable representation is a NerF (original paper) or 3DGS
Our differentiable extractor is rendering.
Our source domain is functional maps (trained with human registered data) -> we train a Denoiser \(D_\psi(x_t, t)\) on human data functional maps
Our target domain is point-to-point maps
Our differentiable representation is pointwise features of deep functional maps
Our differentiable extractor is functional maps block
We can now transfer functional maps knowledge accross categories with SDS!! Let’s do it
We can now transfer functional maps knowledge accross categories with SDS!! Let’s do it
“When it sounds too good to be true, very often, it is too good to be true”
Poor results when applying SDS directly
We just applied SDS as-is, but ignored completely that we are computing functional maps. We have to:
Functional maps, by nature, are good candidate for category agnostic learning
Recent technology (diffusion models, SDS) is important, even for geometry
Applying it to new domains require some domain knowledge
We only trained our diffusion model on humans. Can we improve the generalization with more data?
Are functional maps really the best candidate?
First potential candidate: Surface general features / distillation from image features.
Endgoal: foundation model for 3D shape matching/analysis.
Our main limitation is the lack of 3D/surface general features.
Ideal: Dino/CLIP for 3D. Limitation: limited 3D training data?
Can we learn 3D features that are close to 2D features?

Distill DinoV2 features!!
This is done by minimizing the loss:
\[ \nabla_\theta \mathcal{L}_{\text{SDS}} = \mathbb{E}_{\sigma, x_t \sim \mathcal{N}(x, t)} [(x_t - D(x_t, t))/t] \frac{\partial g}{\partial \theta},\]
where \(x = g(y_\theta)\).
In the original paper, \(y_\theta\) is the Nerf representation, \(g\) is the differentiable rendering, \(x\) is an image.