anatomix

Learning General-Purpose Biomedical Volume Representations using Randomized Synthesis

ICLR 2025

Neel Dey
MIT CSAIL
Benjamin Billot
MIT CSAIL
Hallee E. Wong
MIT CSAIL
Clinton J. Wang
MIT CSAIL
Mengwei Ren
New York University
P. Ellen Grant
Boston Children's Hospital and Harvard Medical School
Adrian V. Dalca
MIT CSAIL, MGH, and Harvard Medical School
Polina Golland
MIT CSAIL

anatomix is a general-purpose stable feature extractor for 3D volumes, trained entirely on synthetic data. It is highly shape-biased and roughly invariant to nuisance imaging variation. Given volumes from different modalities and poses (left), output anatomix features (right) co-activate in similar manners.

TL;DR

  • Extract modality-agnostic 3D features for any biomedical imaging task.
  • A pretrained 3D UNet that can be efficiently finetuned on any biomedical task.
  • No need for any dataset-specific pretraining.
  • SOTA 3D multi-modality registration & few-shot segmentation

Abstract

Current volumetric biomedical foundation models struggle to generalize as public 3D datasets are small and do not cover the broad diversity of medical procedures, conditions, anatomical regions, and imaging protocols. We address this by creating a representation learning method that instead anticipates strong domain shifts at training time itself. We first propose a data engine that synthesizes highly variable training samples that would enable generalization to new biomedical contexts. To then train a single 3D network for any voxel-level task, we develop a contrastive learning method that pretrains the network to be stable against nuisance imaging variation simulated by the data engine, a key inductive bias for generalization. This network's features can be used as robust representations of input images for downstream tasks and its weights provide a strong, dataset-agnostic initialization for finetuning on new datasets. As a result, we set new standards across both multimodality registration and few-shot segmentation, a first for any 3D biomedical vision model, all without (pre-)training on any existing dataset of real images.

Results

(R1) Register any medical image across modalities using anatomix

Medical image registration is traditionally formulated as the optimization problem:

\[\mathcal{L}(\varphi) = d(I_{\text{input}}, I_{\text{target}} \circ \varphi) + \lambda \, \text{Reg}(\varphi)\]

where \(d(\cdot)\) is a similarity measure between images and \(\text{Reg}(\cdot)\) is a regularization term on the deformation field \(\varphi\)

However, in cross-modality registration tasks, images no longer have comparable intensities, making traditional similarity metrics like mean squared error inapplicable. To align images across modalities, we instead use modality-agnostic anatomix features. By aligning the extracted features instead of raw intensities, we simply reformulate the registration loss as:

\[\mathcal{L}(\varphi) = d(F(I_{\text{input}}), F(I_{\text{target}}) \circ \varphi) + \lambda \, \text{Reg}(\varphi)\]

where \(F(\cdot)\) is the anatomix feature extractor. One can use any existing registration solver (like ANTs) with any previous similarity measure.

Using any off-the-shelf solver, this approach yields SOTA unsupervised 3D multimodal image registration across multiple datasets, as below:

Quantitative registration results
(a) Dice boxplots for each method for Learn2Reg-AbdomenMRCT (left group) and MM-WHS (right group), with corresponding medians reported on top of each box and the mean percentages of voxels with folds produced by each method reported at the bottom; (b) Using anatomix features leads to consistent registration improvements at the subject-level.

(R2) Efficiently finetune anatomix on just 1—3 labeled volumes

Qualitative segmentation results
Few-shot 3D segmentation qualitative results. All methods (columns 2--7) were fine-tuned on 3, 3, and 1 multi-label annotated volume(s) for each respective dataset (rows 1 — 3).

Self-supervised pretraining in medical imaging is in a weird place. You collect 100s—1000s of volumes, annotate a few of them, and spend months to a year developing a pretraining strategy for your data to get a 2—10% boost. You could have just spent that time annotating more data for better results (and many projects do not have vast unlabeled pools, either).

However, through extensive pretraining on diverse synthetic data, anatomix is a pretrained U-Net for any biomedical dataset. You can finetune its weights efficiently on just a few annotated volumes from any new domain.

W.r.t. existing 3D foundation models, an anatomix pretrained U-Net is a consistently strong initialization for arbitrarily chosen datasets, both qualitatively (top) and quantitatively (bottom).

Quantitative segmentation results
Few-shot 3D segmentation Dice means and their bootstrapped std. deviations. Bolding and underlining represent the best and second-best Dice, respectively.

Methods

  • Existing representation learning methods for medical images fail to generalize to new domains.
  • We propose a new approach that instead anticipate domain shifts at training time and exploits them to pretrain a stable feature extractor.
  • To enable this, we develop a data engine synthesizes highly variable training samples, intentionally detached from any existing static biomedical context.
  • We then contrastively pretrain a single network on samples from this data engine to learn invariance to nuisance imaging variation across arbitrary domains.

(M1) A synthetic data engine for wildly variable but useful data

Samples from data engine

Annotated 3D medical image datasets are scarce. While GANs or DDPMs can be used to synthesize new training samples, but are definitionally limited to reproducing their training distribution and cannot extrapolate to new domains. We instead generate samples are not intended to be necessarily realistic, but rather to serve as diverse and useful training data for learning general tasks in arbitrary biomedical contexts.

Our engine begins by generating spatial 3D ensembles of biomedical shape templates sampled randomly from a whole-body segmentation dataset. We then use a stochastic appearance model to synthesize intensity volumes from the layouts, as illustrated below:



(M2) Large-scale contrastive pretraining

It is not entirely clear how we should train on these weird images. A semantic segmentation loss won't work because there's no spatial or semantic structure in the labels across the dataset. A denoising loss won't take advantage of the semantic supervision we could use as we have the label maps already.

Instead, we exploit our control over the synthetic data. By generating paired volumes that share spatial layouts but differ in appearance, we can pretrain a U-Net using a label-supervised patch contrastive loss. This encourages features from patches within the same label to be similar regardless of appearance, and distinct from features of other labels.

More concretely, our training procedure is shown below. We first sample a synthetic label map, from which we sample two intensity volumes. Each is then processed by a shared U-Net and we sample random spatial indices at each iteration to compute the contrastive loss. We apply this training loss to several multi-scale decoder layers:

BibTeX

@inproceedings{
dey2025learning,
title={Learning General-purpose Biomedical Volume Representations using Randomized Synthesis},
author={Neel Dey and Benjamin Billot and Hallee E. Wong and Clinton Wang and Mengwei Ren and Ellen Grant and Adrian V Dalca and Polina Golland},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=xOmC5LiVuN}
}