Opinion20 Apr 2026 · 5 min read

Why evaluation-first is the only honest way to ship AI

Benchmark theater is easy. Evidence you can trust is hard. Here's how Born thinks about measurement, and why we publish evals before we publish claims.

The problem with claims-first

Most AI model releases follow the same pattern: pick the benchmarks you win on, show the chart, skip the ones you don't. The result is a cloud of performance claims that nobody can reproduce and nobody trusts.

The pressure to publish impressive numbers creates a structural incentive to cherry-pick. Run enough benchmarks, and you'll find one where your model looks good. The question is whether that benchmark tells you anything about the task your users actually care about.

What evaluation-first means in practice

At Born, evaluation-first means three things:

1. Baseline before training

We run the base model on the exact same evaluation harness we will use after fine-tuning. The baseline is the anchor. If the fine-tune doesn't beat the baseline on that harness, we say so.

2. Task-specific harnesses

We don't use generic leaderboard benchmarks as the primary evaluation. We build harnesses that test the actual work the model is supposed to do. If the model is for code repair, the harness tests code repair — not MMLU.

3. Caveats are part of the release

Every Born model ships with a "What this model is not" section. If there are task categories where it underperforms, those are documented. Shipping with honest caveats is a feature.

Born-9B as an example

Born-9B v0 tied the base model at 10/20 on our coding probe. That is not a good marketing result. But it is the real result, and publishing it meant the v1 data work was grounded in evidence instead of hope.

We could have picked a different evaluation, or a smaller subset where the LoRA happened to score well. Instead, we published the probe we had already committed to, ran the comparison, and reported the tie. The v1 data expansion — which grew the dataset from 225 to 2,400+ unique rows — was directly motivated by that honest result.

Why this matters for applied AI

If you are deploying a fine-tuned model in production, the only number that matters is the delta between your baseline and your fine-tune on your task. Not HumanEval. Not MMLU. Not whatever benchmark was trending when the model launched.

Evaluation-first protects the team that deploys the model. It forces the builder to prove value on the actual task. And when the fine-tune doesn't work, it gives the team a clear signal to iterate on data, not hope for magic.

That's the discipline Born was founded on. Every artifact we ship will follow this pattern: measure first, train, measure again, report the delta, and document the caveats. Anything else is benchmark theater.

Ring teacher mix post All posts