Research, method, and evaluation
One page for how
Born actually works.
This page replaces the split between “company”, “method”, and “research”. It holds the operating stance, the method, the benchmark frame, and the current Born-9B release arc in one place.
v2
public preview anchor
0.9244
local weighted score
0.8559
SWE proxy score
precision
current release thesis
Applied, not abstract
Born exists to produce deployable model behavior and release evidence, not to pad out a research identity with generic theory.
Claims stay behind evals
A run can finish cleanly and still fail promotion. That is already true multiple times in the Born-9B history and is now part of the operating posture.
Public notes are part of the artifact
The run reports, release pages, benchmark packs, and postmortems are not marketing extras. They are part of the product surface.
The Born method
Systematic,
not improvised.
These six steps are the real operating pattern behind model work, not a retrospective framework invented after the fact.
01
Frame the work
Define the exact behavior, operational job, failure conditions, and what a successful release would actually mean.
02
Establish baseline
Measure the base model before changing anything. That makes later gains, ties, and regressions legible.
03
Build the data
Use teacher lanes, synthetic generation, curation, provenance, and strict validation before training ever starts.
04
Adapt the model
Train narrowly around the target behavior, keep the budget visible, and preserve artifacts from every significant branch.
05
Evaluate honestly
Run the same held-out gates, track where the model gets stronger, and keep failed attempts visible instead of burying them.
06
Operationalize
Package the checkpoint, document the run, expose the caveats, and route the model into real product surfaces like Born Chat.
Benchmarking and evaluation
Research here means
measurement infrastructure.
BornBench, the fixed local gate, the SWE proxy, and promotion rules are all part of the same discipline: a page does not get to call a run “better” just because the run completed.
25 tasks
Local held-out gate
The fixed internal coding-agent gate is still the primary promotion rule for Born-9B. The current public truth is Born-9B Preview v2 at 0.9244 and 23/25 versus base Qwen at 0.8511 and 22/25.
22 / 25
SWE-bench proxy
Born reports a local issue-resolution proxy derived from SWE-bench Verified. It is useful, but it is explicitly not the official Docker harness score.
96 items
BornBench
BornBench measures broader coding-agent behavior across execution, repo context, tool protocol, security, long context, and related lanes. It is evaluation infrastructure, not a Born-9B vanity board.
same gate
Promotion rule
A later adapter does not get promoted unless it beats the current best checkpoint on the same held-out gate. That rule is why v2 remains the public preview.
BornBench v2 practical snapshot
Qwen 3.6-35B A3B
65.22 BornScore
Mimo v2.5
63.76 BornScore
GPT-5.4 nano
62.04 BornScore
Qwen 3.5-9B base
30.75 BornScore
The point is not that Born-9B should top this table today. The point is that Born now owns an evaluation surface that can expose where models are strong, brittle, or format-compliant across more than one narrow release gate.
Current model arc
The repo’s actual story
is four stages long.
v0 proof artifact
A cheap first run, 225 rows, a loadable LoRA, and an honest tie with base Qwen. Useful because the full loop existed, not because the score was impressive.
v1 bigger but weaker
The dataset jumped to the multi-million-token range, but the first large continuation still lost to base on the 25-task gate. Bigger did not mean better.
v2 promoted
Generated-expanded v2 became the turning point once the scorer was corrected. It beat base Qwen on the fixed local gate and on the local SWE proxy.
Post-v2 runs stayed transparent
v3, v4, v2-recovery, exact hotfixes, and preview recovery all remain visible in the repo because they were informative, even when they were not upgrades.
General-release thesis
Precision, not scale.
Scale executable exact-code rows around the remaining narrow failures.
Scale tool-use and provider-closure rows instead of broad generic reasoning data.
Preserve v2 behavior with rehearsal so recovery work does not erase the win.
Cap visible-thinking and generic reasoning imports unless they prove useful on the same held-out gates.
Related notes
Read the research trail,
not just the headline.
Born Chat is open: PromptKit UI, live model lanes, and the public training loop
What shipped in Born Chat, how the Born and comparison lanes are organized, why the training notice is explicit, and what Born Chat Pro is meant to become.
How Born-9B Learned to Breathe
The origin story of Born-9B told from the real notes: the cheap first run, the honest tie, the many teachers, and the million-token second inhale.
Born-9B v1: Distillation at the edge of 9 billion parameters
How we built a competitive coding-agent model in public: the data, the teachers, and the honest eval results. What worked, what did not, and every number we tracked.
Why evaluation-first is the only honest way to ship AI
Benchmark theater is easy. Evidence you can trust is hard. Here is how Born thinks about measurement, and why we publish evals before we publish claims.