What Molecular ML Can Learn from the Vision Community’s Representation Revolution

Something remarkable happened in computer vision in 2025: the fields of generative modeling and representation learning, which had developed largely independently, suddenly converged. Diffusion models started leveraging pretrained vision encoders like DINOv2 to dramatically accelerate training. Researchers discovered that aligning generative models to pretrained representations doesn’t just speed things up—it often produces better results.

As someone who works on generative models for (among other things) molecules and proteins, I’ve been watching this unfold with great interest. Could we do the same thing for molecular ML? We now have foundation models like MACE that learn powerful atomic representations. Could aligning molecular generative models to these representations provide similar benefits?

In this post, I’ll summarize what happened in vision (organized into four “phases”), and then discuss what I think are the key lessons for molecular machine learning. The punchline: many of these ideas are already starting to appear in our field, but we’re still in the early stages compared to vision.

For a more detailed treatment of the vision developments with full references and figures, see the extended blog post on my website.

The Vision Story: Four Phases of Representation-Guided Generation

Background: Latent Diffusion Models

Two-stage training: first, a VAE compresses images into latent space; second, diffusion operates in this latent space.

Modern image generation is dominated by Latent Diffusion Models (LDMs)—a two-stage approach where a VAE compresses images into a lower-dimensional latent space, and diffusion operates in that space. This modular design, pioneered by Stable Diffusion, enabled the scaling that led to SDXL, FLUX, and Sora.

Meanwhile, Vision Foundation Models (VFMs) like DINOv2, CLIP, and MAE learn general-purpose visual representations that transfer remarkably well across tasks. The Platonic Representation Hypothesis suggests that as these models scale, they converge toward a shared structure—a kind of “optimal” visual representation.

This set the stage for a natural question: if VFMs already capture useful visual structure, can we leverage them to accelerate diffusion training?

Phase 1: Align Diffusion Features to Pretrained Encoders

REPA adds an alignment loss from intermediate diffusion features to frozen DINOv2 representations.

REPA introduced a simple idea: extract features from intermediate diffusion layers and maximize their similarity to frozen DINOv2 features. This auxiliary loss accelerates training dramatically—achieving in hours what previously took days.

Key insight for molecules: A surprising finding from iREPA is that spatial structure matters more than global semantics. Encoders with strong local self-similarity patterns work better than encoders with high classification accuracy. For molecules, this suggests that local geometric structure (what MACE excels at) might be more valuable for generative alignment than global molecular properties.

Phase 2: Bake Semantic Structure into the Latent Spac

VA-VAE aligns the VAE’s latent space with pretrained VFM features during tokenizer training.

Phase 1 aligned the diffusion model, but left the VAE latent space unchanged. VA-VAE and REPA-E went deeper: they aligned the VAE’s latent space itself with VFM features, creating latents that are both reconstructive and semantically meaningful.

Key insight for molecules: This is analogous to training a molecular autoencoder where the latent space is aligned to MACE embeddings—ensuring the latent representation captures physically meaningful structure, not just reconstruction capability.

Phase 3: Diffuse Directly in Representation Space

RAE uses frozen pretrained encoders (DINO, SigLIP, MAE) directly as the latent space for diffusion.

RAE asked: do we need VAE compression at all? It uses frozen VFM encoders directly as the latent space, paired with trained decoders. This required some architectural modifications, but demonstrated that the “latent space” can simply be a pretrained representation space. And some of the architectural headaches seemed to go away if you just scale enough.

Key insight for molecules: Could we generate molecules directly in MACE embedding space? The decoder would need to map from embeddings back to atomic coordinates—a challenging inverse problem, but potentially powerful.

Phase 4: Train from Scratch with Better Objectives

Finally, Phase 4 asked whether pretrained representations are even necessary. Methods like USP train jointly for generative and discriminative objectives, and the pixel-space diffusion renaissance showed that architectural innovation can substitute for latent compression.

Key insight for molecules: Perhaps the best approach isn’t to borrow representations, but to borrow training objectives. Physics-grounded losses (energies, forces) might play the role that reconstruction + discrimination plays in vision.

The Molecular Parallel: MACE as Our DINO?

Neural network potentials like MACE have emerged as foundation models for atomistic chemistry—and they exhibit striking parallels to vision foundation models

The embedding paradigm across domains: foundation models produce local embeddings that are aggregated and fed to task-specific prediction heads.

Parallel 1: Embeddings Transfer Across Tasks

MACE’s internal representations, learned for predicting quantum-mechanical energies and forces, generalise remarkably well to diverse downstream tasks. REM3DI aggregates MACE atomic descriptors into molecular representations that achieve state-of-the-art performance on property prediction. Similar approaches work for proteins, pooling MACE descriptors to predict per-residue properties like NMR chemical shifts.

This mirrors exactly how DINO embeddings are used in vision—the [CLS] token or pooled patch embeddings fed to downstream prediction heads

Left: REM3DI aggregates MACE atomic descriptors into molecular descriptors. Right: MACE descriptors pooled to residue-level representations for protein property prediction.

Parallel 2: Platonic Convergence with Scale

Just as vision models trained with different objectives converge toward similar representations, independently trained molecular models exhibit the same phenomenon. Different NNPs can be mapped into a common latent space with minimal performance loss—a molecular analogue of the Platonic Representation Hypothesis.

This suggests that MACE-like representations aren’t arbitrary—they’re discovering something fundamental about molecular structure.

Parallel 3: Representation Alignment Benefits Generative Models

MACE-REPA directly applies the Phase 1 alignment paradigm to molecular force fields, aligning generative model representations to frozen MACE features. Early results are promising, but this work is still in its infancy compared to the sophisticated alignment methods in vision.

Three Strategies for Molecular ML

Based on the vision lessons, I see three complementary strategies for molecular generative modeling:

Three approaches: (1) using pretrained embeddings for downstream tasks, (2) aligning generative models to foundation model features, (3) drawing architectural inspiration from what makes foundation models work.

Strategy 1: Use Pretrained Embeddings Directly

The simplest approach: take MACE embeddings and use them for downstream prediction. This works well for property prediction, but naively incorporating them doesn’t always help—as we explored in the RF3 paper, simply conditioning on MACE embeddings did not improve structure prediction accuracy.

The vision lesson: Embeddings need to be incorporated thoughtfully. In vision, spatial alignment matters more than global semantics. For molecules, perhaps local environment descriptors matter more than global molecular embeddings.

Strategy 2: Align Generative Models to Foundation Model Features

MACE-REPA is the beginning of this, but we’re far behind vision in sophistication. The Phase 2 insight—aligning the latent space of autoencoders, not just the generative model—hasn’t been explored for molecules yet.

Concrete opportunity: Train a molecular VAE where the latent space is regularized to align with MACE atomic descriptors. This would create a latent space that’s both reconstructive (can decode back to coordinates) and physically meaningful (aligned to a foundation model).

Strategy 3: Architectural Inspiration—Why Does MACE Work?

Perhaps the deepest lesson isn’t about borrowing representations, but understanding why MACE works so well:

Large-scale physics data: Training on diverse DFT calculations like OMol25
Physics-grounded objectives: Predicting energies and forces, not just structures
Strong locality bias: Operating on strictly local atomic environments

SLAE takes exactly this approach for proteins: it adopts physics-grounded objectives by predicting Rosetta energy terms (hydrogen bonding, solvation, electrostatics) and embraces strict locality by encoding all-atom environments rather than full protein graphs, and it pays off: simple linear interpolations in the SLAE latent space produces physically meaningful conformational changes that in some changes even align with MD simulations.

The vision analogy: SLAE is conceptually close to aligned VAEs in vision—reconstruction ensures geometric fidelity while auxiliary physics heads encourage the latent space to align with physically meaningful axes.

Key Takeaways for Molecular ML

Alignment accelerates training: In vision, representation alignment achieves in hours what previously took days (for an extreme case of this see the Speedrunning ImageNet Diffusion project). We should expect similar speedups for molecular generative models aligned to MACE or similar foundations.
Local structure > global properties: The vision community learned that spatial structure matters more than classification accuracy. For molecules, local geometric environments (MACE’s specialty) may be more valuable than global molecular descriptors.
Align the latent space, not just the model: Phase 2 methods that bake VFM structure into the VAE latent space work better than surface-level alignment. We haven’t done this for molecules yet.
Physics objectives may be our secret weapon: Vision models need complex self-supervised objectives (contrastive, masked prediction). We have something better: physics. Energies and forces are natural, well-defined objectives that capture meaningful structure.
The Platonic Representation exists for molecules too: Different NNPs converge toward similar representations with scale. This suggests aligning to any good foundation model should help.

The vision community’s representation revolution happened remarkably fast—most of this work appeared in 2025. We’re just beginning to see similar ideas in molecular ML. I’m excited to see how far we can push these parallels!

Author

Kieran Didi

View all posts

Oxford Protein Informatics Group

or "OPIG" to friends