GENEB: Why Genomic Foundation Models Are So Hard to Compare

Quick answer

GENEB evaluates frozen representations from 40 genomic foundation models on 100 downstream tasks spanning 13 functional categories, and the headline finding is that there is no stable winner: a model that tops one category can sit mid-pack in another, so a single aggregate leaderboard hides more than it shows. Scaling up parameters yields only modest and inconsistent gains, and matching architecture and pretraining data to the task type frequently outweighs raw model size.

What GENEB actually measures

GENEB is a diagnostic benchmark, not a single accuracy number. Every model is kept frozen and probed with a unified protocol: extract embeddings from the pretrained encoder, then fit a lightweight probe on top for each task. Because the probe is the only thing that trains, differences in the score reflect what the pretrained representation already encodes, not how well someone tuned a downstream head. The 100 tasks are grouped into 13 functional categories — regulatory elements, splicing, chromatin, variant effects, and the like — so you can read a model’s profile across biology rather than collapsing it to one mean.

The suite also includes few-shot regimes. That matters because most genomics labs do not have thousands of labeled examples per task; the realistic question is how good a frozen representation is when you can only afford a handful of labels.

Why a unified probing protocol is the real contribution

The honest problem in genomic ML is that almost every foundation model reports its own benchmark, with its own splits, its own fine-tuning budget, and its own preprocessing. Two papers claiming “state of the art” are often not measuring the same thing, so the field cannot answer a basic question: which pretrained DNA model should I start from? GENEB’s value is the controlled comparison — same frozen-probing protocol, same tasks, same splits across all 40 models — which lets it isolate the effect of scale, architecture, tokenization, and pretraining data instead of confounding them.

This is the part worth taking seriously even if you never run the benchmark: it reframes “which model is best” as “best for what,” and shows the first framing is close to meaningless in this domain.

Key results

40 models, 100 tasks, 13 categories — the largest controlled, probing-based comparison of genomic foundation models to date, all evaluated frozen under one protocol.
Rankings are unstable. Model orderings vary sharply across the 13 task categories, so an aggregate leaderboard is misleading; a top-3 model in regulatory tasks can drop well down the list on others.
Scaling is weak. More parameters deliver only modest and inconsistent improvements — there is no clean scaling curve like the one seen in language models.
Architecture and pretraining alignment beat size. Whether the model’s architecture and pretraining data match the downstream task type frequently matters more than parameter count.
Few-shot exposes fragility. Under limited-label regimes the gaps between models shift again, so apparent leaders in the full-data setting are not always the right choice when labels are scarce.

Why it matters now

Genomic foundation models are multiplying faster than anyone can fairly compare them, and grants and clinical pipelines are starting to pick one as a default encoder. GENEB gives practitioners a way to choose based on the task category they actually care about, and gives model builders a warning: reporting one aggregate score, or pointing at parameter count as evidence of progress, does not survive a controlled probe. The blunt takeaway is that much of the “our genomic LLM is bigger and therefore better” narrative is not supported once you hold the evaluation fixed.

Limits and open questions

Frozen probing is a deliberate choice with a cost: it measures what a representation already encodes, not the ceiling a model can reach with full fine-tuning, so a model that probes poorly might still win after end-to-end tuning. The benchmark is also a snapshot — 40 models on a fixed task set — and genomic models, tokenizers, and context lengths are moving fast, so the leaderboard will age. GENEB diagnoses instability without fully explaining its mechanism: it shows architecture and pretraining alignment matter, but does not hand you a predictive rule for which model will win a new task category. And like any benchmark, its 100 tasks encode choices about what “genomic understanding” means; tasks it omits could reorder the conclusions.

FAQ

What is the GENEB benchmark?

GENEB is a large-scale diagnostic benchmark that evaluates frozen representations from 40 genomic foundation models across 100 tasks in 13 functional categories, using a unified probing protocol so the comparison is controlled rather than self-reported.

What does GENEB find about scaling genomic models?

GENEB finds scaling is weak: adding parameters produces only modest and inconsistent gains, with no clean scaling law, and architecture plus pretraining-data alignment to the task often matters more than model size.

Why are genomic foundation models hard to compare?

Because each model typically reports its own splits, fine-tuning budget, and preprocessing, so two state-of-the-art claims rarely measure the same thing. GENEB fixes the protocol across all 40 models to make the comparison fair, and shows rankings flip across task categories.

Should I pick a genomic model from GENEB’s aggregate leaderboard?

No — GENEB’s central warning is that aggregate rankings are unstable. Choose based on the functional category and label budget your task falls into, since the best model for regulatory tasks may rank far lower elsewhere.

One line: hold the evaluation fixed and the “bigger genomic model wins” story falls apart — pick by task category, not parameter count. Read the original paper on arXiv.