Point of Care

MobileFetalCLIP Selective Repulsive KD for Mobile Fetal Ultrasound Analysis

Numan Saeed^1* · Fadillah Adamsyah Maani¹ · Mohammad Yaqub¹

¹Computer Vision Department, Mohamed bin Zayed University of AI (MBZUAI), Abu Dhabi, UAE

* Corresponding author

Bringing expert-level fetal ultrasound AI to handheld devices. 26× smaller, 24× faster, and more accurate than the teacher.

arXiv Paper Code & Models Hugging Face SonoSight Demo

Prototype Demo SonoSight

Watch a 25-second in-app walkthrough of SonoSight, our mobile experience built around MobileFetalCLIP for on-device fetal ultrasound AI.

Watch Below

Clinical Focus Biometry validity and brain sub-plane assistance

Deployment Target Real-time inference on handheld and phone-class hardware

Validation Zero-shot gains over the FetalCLIP teacher on fetal benchmarks

Deployment View

Designed around handheld workflows where latency, memory footprint, and clarity matter as much as benchmark performance.

HC18 Validity

+5.1% over teacher

Brain Sub-plane F1

+8.2pp over teacher

0 ms

iPhone 16 Pro Latency

24× faster than teacher

Visual Parameters

26× fewer than teacher

SonoSight Demo

See the Mobile App Running the Model

25-Second Walkthrough

SonoSight on iPhone

SonoSight is the mobile experience built on top of MobileFetalCLIP for real-time fetal ultrasound assistance. This short prototype demo shows the app flow, model-driven guidance, and the kind of handheld interaction the project is targeting.

Runs the released MobileFetalCLIP model stack in a handheld workflow
Shows the product direction beyond benchmark tables and paper figures
Matches the point-of-care deployment story presented in the paper

Abstract

Bringing Fetal Ultrasound AI to the Point of Care

Prenatal Care

Low-Resource Settings

Mobile Deployment

SOTA Results

Fetal ultrasound AI could transform prenatal care in low-resource settings, yet current foundation models exceed 300M visual parameters, precluding deployment on point-of-care devices. Standard knowledge distillation fails under such extreme capacity gaps (~26×), as compact students waste capacity mimicking architectural artifacts of oversized teachers.

We introduce Selective Repulsive Knowledge Distillation, which decomposes contrastive KD into diagonal and off-diagonal components: matched pair alignment is preserved while the off-diagonal weight decays into negative values, repelling the student from the teacher's inter-class confusions and forcing discovery of architecturally native features.

Our 11.4M parameter student surpasses the 304M-parameter FetalCLIP teacher on zero-shot HC18 biometry validity (88.6% vs. 83.5%) and brain sub-plane F1 (0.784 vs. 0.702), while running at 1.6 ms on iPhone 16 Pro, enabling real-time assistive AI on handheld ultrasound devices.

Contributions

Key Contributions

Selective Repulsive KD

A novel architecture-agnostic methodology decomposing contrastive KD into diagonal (matched-pair) and off-diagonal (non-target) components. Repulsion is applied selectively to off-diagonal while preserving matched-pair alignment.

MobileFetalCLIP Model

A mobile-scale vision-language model (75M total, 11.4M visual parameters) that surpasses the 427M FetalCLIP teacher on HC18 validity (+5.1pp) and brain sub-plane F1 (+8.2pp), while retaining 97–98% of linear probing performance.

Mechanistic Analysis

Comprehensive analysis via embedding geometry, logit distributions, and controlled ablations demonstrating that Selective Repulsive KD produces structured decorrelation—silhouette score +40% over static KD.

Method

Selective Repulsive Knowledge Distillation

Overcoming the 26× capacity gap by learning what the teacher doesn't know

The Problem

Standard KD forces a small student to strictly mimic a massive teacher (304M params). At a 26× capacity gap, the student wastes parameters learning the teacher's architectural artifacts (ViT-specific self-attention confusions) instead of discriminative medical features.

Our Solution

During the Repulsive Phase, the off-diagonal loss weight β(t) becomes negative. Instead of copying the teacher's mistakes, the student is actively repelled from the teacher's confusion patterns, forcing discovery of native local-texture features.

Phase 1

Attractive Phase

β(t) > 0: Student absorbs domain knowledge from teacher's similarity structure

β > 0

→

Phase 2

Transition

β(t) ≈ 0: KD term contributes negligibly; student driven by L_CLIP objective

β ≈ 0

→

Phase 3

Repulsive Phase

β(t) < 0: Gradient inverts. Student learns to separate classes differently from the teacher

β < 0

Method Overview: Selective Repulsive Knowledge Distillation

MobileFetalCLIP method overview showing paired ultrasound inputs, frozen teacher and trainable student encoders, teacher and student similarity matrices, selective repulsive knowledge distillation, and the attractive-to-repulsive phase schedule

Overview of the training signal used in MobileFetalCLIP. The frozen FetalCLIP teacher and trainable FastViT student produce teacher and student similarity matrices, which are compared through a diagonal-protected decomposition: matched pairs remain fixed while the off-diagonal term is scheduled from attractive to repulsive to encourage architecturally native fetal ultrasound representations.

Results

Surpassing the Teacher at 26× Fewer Parameters

Zero-shot evaluation on fetal ultrasound benchmarks

HC18 Biometry Validity (%) — Zero-Shot

Teacher (83.5%)

MobileFetalCLIP (Ours) 75M · FastViT

88.6%

FetalCLIP Teacher 427M · ViT-L/14

83.5%

Static KD Baseline 75M

79.4%

BiomedCLIP 150M · ViT-B/16

24.0%

CLIP 427M · ViT-L/14

11.0%

Full Zero-Shot Comparison on Fetal Ultrasound Benchmarks

Swipe sideways on smaller screens to view the full table.

Full Zero-Shot Comparison on Fetal Ultrasound Benchmarks
Model	Params	HC18 (%)	F1-5Plane	F1-3Brain	F1-all
Teacher
Teacher FetalCLIP (ViT-L/14)	427M	83.5	0.973	0.702	0.871
General VLMs (not fetal-specific)
CLIP (ViT-L/14)	427M	11.0	0.308	0.206	0.270
BiomedCLIP (ViT-B/16)	150M	24.0	0.603	0.236	0.466
UniMed-CLIP (ViT-B/16)	150M	9.0	0.679	0.187	0.495
MobileFetalCLIP variants (FastViT, 75M total)
No KD (CLIP only)	75M	71.3	0.889	0.712	0.823
Static Logit KD (CLIP-KD baseline)	75M	79.4	0.946	0.715	0.859
Coupled Repulsive KD (r=−0.8)	75M	84.4	0.933	0.763	0.869
Ours Selective Repulsive KD (β₀=2, r=−0.8)	75M	88.6	0.946	0.784	0.886

Feature Space Analysis: t-SNE Projections of Brain Sub-plane Embeddings

t-SNE visualizations comparing No KD, Static KD, and Selective Repulsive KD brain sub-plane cluster separation

(a) No KD: Overlapping transthalamic/transventricular clusters. (b) Static KD: Marginal improvement. (c) Selective Repulsive KD: Well-separated, compact clusters consistent with the +8.2pp F1-3Brain gain over the teacher (silhouette score +40% over static KD).

Embedding Geometry on Planes DB (5-plane, 8,187 images)

Swipe sideways on smaller screens to view the full table.

Ablation Study Results for FetalCLIP Knowledge Distillation
Method	d_eff ↑	Silhouette ↑	Intra ↑	Inter ↓	Uniformity ↓
Static KD (λ=1.0)	8.0	0.375	0.712	0.445	−1.662
Confidence Penalty	9.0	0.406	0.693	0.389	−1.811
Coupled r=−0.8	6.4	0.509	0.645	0.010	−2.231
Selective β₀=2 (Ours)	10.0	0.525	0.623	0.076	−2.308

On-Device Performance

Real-Time AI at the Point of Care

32× fewer GMACs · 26× fewer parameters · 24× lower latency on iPhone 16 Pro

iPhone 16 Pro

1.6 ms

MobileFetalCLIP

37.6 ms

FetalCLIP Teacher

24× speedup

MobileFetalCLIP 1.6ms

FetalCLIP 37.6ms

>600 FPS — exceeds 30–60 fps diagnostic ultrasound by 10–20×

iPhone 14

3.8 ms

MobileFetalCLIP

OUT OF MEMORY

FetalCLIP Teacher

Teacher runs Out-of-Memory; MobileFetalCLIP runs seamlessly

Inference Efficiency Comparison

Swipe sideways on smaller screens to view the full table.

Mobile Device Efficiency and Speed Metrics
Model	Params	GMACs	iPhone 14	iPhone 16 Pro
FetalCLIP (Teacher)	304M	49.4G	OOM	37.6 ms
Static KD (Baseline)	11.4M	1.5G	3.8 ms	1.6 ms
MobileFetalCLIP (Ours)	11.4M	1.5G	3.8 ms	1.6 ms

32× fewer GMACs

26× fewer params

24× faster on device

Linear Probing: Frozen Feature Quality

MobileFetalCLIP retains 97–98% of the FetalCLIP teacher's linear probing performance at 26× fewer visual parameters — frozen encoder + single linear layer, 5-fold × 5 seeds, 95% CI.

Swipe sideways on smaller screens to view the full table.

Linear Probing: Frozen Feature Quality
Model	6-View F1	Brain F1	CHD AUROC
CLIP (ViT-L/14)	.867	.634	.679
BiomedCLIP (ViT-B/16)	.856	.582	.643
UniMed-CLIP (ViT-B/16)	.860	.607	.718
FetalCLIP (ViT-L/14)	.947	.820	.787
MobileFetalCLIP (FastViT)	.930 98.2%	.799 97.4%	.769 97.7%

Citation

Cite This Work

BibTeX

@article{saeed2026mobilefetalclip,
  title     = {MobileFetalCLIP: Selective Repulsive Knowledge Distillation
               for Mobile Fetal Ultrasound Analysis},
  author    = {Saeed, Numan and Maani, Fadillah Adamsyah and Yaqub, Mohammad},
  journal   = {arXiv preprint arXiv:2603.05421},
  year      = {2026}
}