MobileFetalCLIP Selective Repulsive KD for Mobile Fetal Ultrasound Analysis
Bringing expert-level fetal ultrasound AI to handheld devices. 26Γ smaller, 24Γ faster, and more accurate than the teacher.
Designed around handheld workflows where latency, memory footprint, and clarity matter as much as benchmark performance.
Bringing Fetal Ultrasound AI to the Point of Care
Fetal ultrasound AI could transform prenatal care in low-resource settings, yet current foundation models exceed 300M visual parameters, precluding deployment on point-of-care devices. Standard knowledge distillation fails under such extreme capacity gaps (~26×), as compact students waste capacity mimicking architectural artifacts of oversized teachers.
We introduce Selective Repulsive Knowledge Distillation, which decomposes contrastive KD into diagonal and off-diagonal components: matched pair alignment is preserved while the off-diagonal weight decays into negative values, repelling the student from the teacher's inter-class confusions and forcing discovery of architecturally native features.
Our 11.4M parameter student surpasses the 304M-parameter FetalCLIP teacher on zero-shot HC18 biometry validity (88.6% vs. 83.5%) and brain sub-plane F1 (0.784 vs. 0.702), while running at 1.6 ms on iPhone 16 Pro, enabling real-time assistive AI on handheld ultrasound devices.
Key Contributions
Selective Repulsive KD
A novel architecture-agnostic methodology decomposing contrastive KD into diagonal (matched-pair) and off-diagonal (non-target) components. Repulsion is applied selectively to off-diagonal while preserving matched-pair alignment.
MobileFetalCLIP Model
A mobile-scale vision-language model (75M total, 11.4M visual parameters) that surpasses the 427M FetalCLIP teacher on HC18 validity (+5.1pp) and brain sub-plane F1 (+8.2pp), while retaining 97β98% of linear probing performance.
Mechanistic Analysis
Comprehensive analysis via embedding geometry, logit distributions, and controlled ablations demonstrating that Selective Repulsive KD produces structured decorrelationβsilhouette score +40% over static KD.
Selective Repulsive Knowledge Distillation
Overcoming the 26Γ capacity gap by learning what the teacher doesn't know
The Problem
Standard KD forces a small student to strictly mimic a massive teacher (304M params). At a 26Γ capacity gap, the student wastes parameters learning the teacher's architectural artifacts (ViT-specific self-attention confusions) instead of discriminative medical features.
Our Solution
During the Repulsive Phase, the off-diagonal loss weight Ξ²(t) becomes negative. Instead of copying the teacher's mistakes, the student is actively repelled from the teacher's confusion patterns, forcing discovery of native local-texture features.
Method Overview: Selective Repulsive Knowledge Distillation
Overview of the training signal used in MobileFetalCLIP. The frozen FetalCLIP teacher and trainable FastViT student produce teacher and student similarity matrices, which are compared through a diagonal-protected decomposition: matched pairs remain fixed while the off-diagonal term is scheduled from attractive to repulsive to encourage architecturally native fetal ultrasound representations.
Surpassing the Teacher at 26Γ Fewer Parameters
Zero-shot evaluation on fetal ultrasound benchmarks
HC18 Biometry Validity (%) β Zero-Shot
Full Zero-Shot Comparison on Fetal Ultrasound Benchmarks
| Model | Params | HC18 (%) | F1-5Plane | F1-3Brain | F1-all |
|---|---|---|---|---|---|
| Teacher | |||||
| Teacher FetalCLIP (ViT-L/14) | 427M | 83.5 | 0.973 | 0.702 | 0.871 |
| General VLMs (not fetal-specific) | |||||
| CLIP (ViT-L/14) | 427M | 11.0 | 0.308 | 0.206 | 0.270 |
| BiomedCLIP (ViT-B/16) | 150M | 24.0 | 0.603 | 0.236 | 0.466 |
| UniMed-CLIP (ViT-B/16) | 150M | 9.0 | 0.679 | 0.187 | 0.495 |
| MobileFetalCLIP variants (FastViT, 75M total) | |||||
| No KD (CLIP only) | 75M | 71.3 | 0.889 | 0.712 | 0.823 |
| Static Logit KD (CLIP-KD baseline) | 75M | 79.4 | 0.946 | 0.715 | 0.859 |
| Coupled Repulsive KD (r=β0.8) | 75M | 84.4 | 0.933 | 0.763 | 0.869 |
| Ours Selective Repulsive KD (Ξ²β=2, r=β0.8) | 75M | 88.6 | 0.946 | 0.784 | 0.886 |
Feature Space Analysis: t-SNE Projections of Brain Sub-plane Embeddings
(a) No KD: Overlapping transthalamic/transventricular clusters. (b) Static KD: Marginal improvement. (c) Selective Repulsive KD: Well-separated, compact clusters consistent with the +8.2pp F1-3Brain gain over the teacher (silhouette score +40% over static KD).
Embedding Geometry on Planes DB (5-plane, 8,187 images)
| Method | d_eff β | Silhouette β | Intra β | Inter β | Uniformity β |
|---|---|---|---|---|---|
| Static KD (Ξ»=1.0) | 8.0 | 0.375 | 0.712 | 0.445 | β1.662 |
| Confidence Penalty | 9.0 | 0.406 | 0.693 | 0.389 | β1.811 |
| Coupled r=β0.8 | 6.4 | 0.509 | 0.645 | 0.010 | β2.231 |
| Selective Ξ²β=2 (Ours) | 10.0 | 0.525 | 0.623 | 0.076 | β2.308 |
Real-Time AI at the Point of Care
32Γ fewer GMACs Β· 26Γ fewer parameters Β· 24Γ lower latency on iPhone 16 Pro
Inference Efficiency Comparison
| Model | Params | GMACs | iPhone 14 | iPhone 16 Pro |
|---|---|---|---|---|
| FetalCLIP (Teacher) | 304M | 49.4G | OOM | 37.6 ms |
| Static KD (Baseline) | 11.4M | 1.5G | 3.8 ms | 1.6 ms |
| MobileFetalCLIP (Ours) | 11.4M | 1.5G | 3.8 ms | 1.6 ms |
Linear Probing: Frozen Feature Quality
MobileFetalCLIP retains 97β98% of the FetalCLIP teacher's linear probing performance at 26Γ fewer visual parameters β frozen encoder + single linear layer, 5-fold Γ 5 seeds, 95% CI.
| Model | 6-View F1 | Brain F1 | CHD AUROC |
|---|---|---|---|
| CLIP (ViT-L/14) | .867 | .634 | .679 |
| BiomedCLIP (ViT-B/16) | .856 | .582 | .643 |
| UniMed-CLIP (ViT-B/16) | .860 | .607 | .718 |
| FetalCLIP (ViT-L/14) | .947 | .820 | .787 |
| MobileFetalCLIP (FastViT) | .930 98.2% | .799 97.4% | .769 97.7% |
Cite This Work
@article{saeed2026mobilefetalclip,
title = {MobileFetalCLIP: Selective Repulsive Knowledge Distillation
for Mobile Fetal Ultrasound Analysis},
author = {Saeed, Numan and Maani, Fadillah Adamsyah and Yaqub, Mohammad},
journal = {arXiv preprint arXiv:2603.05421},
year = {2026}
}