Current volumetric biomedical foundation models struggle to generalize as public 3D datasets are small and do not cover the broad diversity of medical procedures, conditions, anatomical regions, and imaging protocols. We address this by creating a representation learning method that instead anticipates strong domain shifts at training time itself. We first propose a data engine that synthesizes highly variable training samples that would enable generalization to new biomedical contexts. To then train a single 3D network for any voxel-level task, we develop a contrastive learning method that pretrains the network to be stable against nuisance imaging variation simulated by the data engine, a key inductive bias for generalization. This network's features can be used as robust representations of input images for downstream tasks and its weights provide a strong, dataset-agnostic initialization for finetuning on new datasets. As a result, we set new standards across both multimodality registration and few-shot segmentation, a first for any 3D biomedical vision model, all without (pre-)training on any existing dataset of real images.
Medical image registration is traditionally formulated as the optimization problem:
\[\mathcal{L}(\varphi) = d(I_{\text{input}}, I_{\text{target}} \circ \varphi) + \lambda \, \text{Reg}(\varphi)\]
where \(d(\cdot)\) is a similarity measure between images and \(\text{Reg}(\cdot)\) is a regularization term on the deformation field \(\varphi\)
However, in cross-modality registration tasks, images no longer have comparable intensities, making traditional similarity metrics like mean squared error inapplicable. To align images across modalities, we instead use modality-agnostic anatomix features. By aligning the extracted features instead of raw intensities, we simply reformulate the registration loss as:
\[\mathcal{L}(\varphi) = d(F(I_{\text{input}}), F(I_{\text{target}}) \circ \varphi) + \lambda \, \text{Reg}(\varphi)\]
where \(F(\cdot)\) is the anatomix feature extractor. One can use any existing registration solver (like ANTs) with any previous similarity measure.
Using any off-the-shelf solver, this approach yields SOTA unsupervised 3D multimodal image registration across multiple datasets, as below:
Self-supervised pretraining in medical imaging is in a weird place. You collect 100s—1000s of volumes, annotate a few of them, and spend months to a year developing a pretraining strategy for your data to get a 2—10% boost. You could have just spent that time annotating more data for better results (and many projects do not have vast unlabeled pools, either).
However, through extensive pretraining on diverse synthetic data, anatomix is a pretrained U-Net for any biomedical dataset. You can finetune its weights efficiently on just a few annotated volumes from any new domain.
W.r.t. existing 3D foundation models, an anatomix pretrained U-Net is a consistently strong initialization for arbitrarily chosen datasets, both qualitatively (top) and quantitatively (bottom).
Annotated 3D medical image datasets are scarce. While GANs or DDPMs can be used to synthesize new training samples, but are definitionally limited to reproducing their training distribution and cannot extrapolate to new domains. We instead generate samples are not intended to be necessarily realistic, but rather to serve as diverse and useful training data for learning general tasks in arbitrary biomedical contexts.
Our engine begins by generating spatial 3D ensembles of biomedical shape templates sampled randomly from a whole-body segmentation dataset. We then use a stochastic appearance model to synthesize intensity volumes from the layouts, as illustrated below:
It is not entirely clear how we should train on these weird images. A semantic segmentation loss won't work because there's no spatial or semantic structure in the labels across the dataset. A denoising loss won't take advantage of the semantic supervision we could use as we have the label maps already.
Instead, we exploit our control over the synthetic data. By generating paired volumes that share spatial layouts but differ in appearance, we can pretrain a U-Net using a label-supervised patch contrastive loss. This encourages features from patches within the same label to be similar regardless of appearance, and distinct from features of other labels.
More concretely, our training procedure is shown below. We first sample a synthetic label map, from which we sample two intensity volumes. Each is then processed by a shared U-Net and we sample random spatial indices at each iteration to compute the contrastive loss. We apply this training loss to several multi-scale decoder layers:
@inproceedings{
dey2025learning,
title={Learning General-purpose Biomedical Volume Representations using Randomized Synthesis},
author={Neel Dey and Benjamin Billot and Hallee E. Wong and Clinton Wang and Mengwei Ren and Ellen Grant and Adrian V Dalca and Polina Golland},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=xOmC5LiVuN}
}