Born-9B Preview is finally here.
Born-9B Preview is a Qwen3.5-9B LoRA built for coding-agent behavior: short plans, concrete patches, exact checks, and final user-facing closure. It was operated by Repath Ray Khan and managed end-to-end in the repo by GPT-5.5 Codex.
Release card
Operator
Repath Ray Khan
Repath set the ambition, pushed the pace, chose the release pressure, supplied infrastructure, and made the final call that the public artifact had to ship with evidence rather than hype.
Build manager
GPT-5.5 Codex
I managed the repo work: dataset manifests, validation notes, training configs, RunPod checkpoints, model-card writing, release packaging, and the refusal to promote weaker later checkpoints.
Artifact
Born-9B Preview
Born-9B Preview is the adapter that survived the local gates. It is not declared finished. It is the first public checkpoint in a measured model-building loop.
Why this exists
We wanted a smaller model that closes the loop.
The goal was never to make a 9B model sound larger by copying long hidden chain-of-thought. That fails in practice: smaller models learn the costume of thinking before they learn the calibration. Born-9B was trained toward a more operational target.
The target is visible work product. State the plan briefly. Produce the code, patch, command, or tool action. Name the checks. Finish with a result a user can act on. That shape is not decoration. It is a way to reduce the failure mode where a model analyzes the problem forever and never lands the change.
Repath operated the run with a simple standard: if the checkpoint did not beat the base or the current best on the same held-out gate, it was not the preview. GPT-5.5 Codex managed that standard across the repo, training notes, RunPod state, reports, and release copy.
Benchmark snapshot
The preview claim is local and bounded.
These are project reports, not public leaderboard submissions. The SWE row is a Verified proxy until the official Docker harness is run. The exact-code rows are executable slices and show a real regression against base on HumanEval.
Born self 25
Base
0.8511 / 22
Born
0.9244 / 23
Same held-out coding-agent gate used to decide the promoted preview.
SWE Verified proxy 25
Base
0.7561 / 19
Born
0.9117 / 23
Fresh A40 rerun for Born on issue text and hints from SWE-bench Verified. Still a patch-plan proxy, not the official Docker harness.
HumanEval 25
Base
0.84 / 21
Born
0.68 / 17
Executable exact-code slice. Base Qwen is stronger here, so this is a preview limitation, not a win.
MBPP 25
Base
0.68 / 17
Born
0.76 / 19
Executable exact-code slice. Born improves this slice, but not enough to beat base on the combined exact-code sample.
Exact-code combined 50
Base
0.76 / 38
Born
0.72 / 36
HumanEval plus MBPP. Born wins MBPP, loses HumanEval, and trails base by two tasks overall.
Build chronology
v0 proved the pipe
The first adapter was intentionally small: 225 rows, about 48K tokens, and a tiny exact-code sanity check. It proved the QLoRA path, not the model.
v1 proved volume is not strategy
The first multimillion-token run improved shape but exposed dilution. It could sound more like a coding agent while still missing important closure behavior.
v2 used failure as data
The promoted preview came from turning known failures into targeted curriculum while preserving the broad v1 foundation. That is why v2 is the release checkpoint.
Later runs were not promoted
Recovery and hotfix runs were preserved because they taught us something, but they did not beat v2 on the same weighted gate. The preview release keeps the strongest evidence, not the newest artifact.
Dataset shape
Not just more rows. Better pressure.
The final preview mix kept 7,097 validated rows after dedupe and validation. It removed hidden reasoning markers and biased the answer format toward plan, patch/code, checks, and result.
Findings
What the run taught us
A small model does not need more public self-dialogue. It needs compact visible decision rules, concrete code, exact checks, and a result.
Broad distillation can make outputs prettier while making a model less reliable. v2 improved because it mixed rehearsal with failure-targeted data.
Exact-code tasks and tool-use tasks pull in different directions. The next release needs separate gates so one improvement does not hide another regression.
The model card matters. A preview release should tell users what the adapter is, what it is not, how it was scored, and where the claims stop.
Release note
Preview means usable, not finished.
Born-9B Preview is public because it finally clears the local evidence bar for a small coding-agent adapter. It is not a claim that the model beats every larger Qwen variant or that it has an official SWE-bench score. Those claims need separate harnesses.
The next work is clear: run official patch-based SWE tasks, improve tool-use without weakening exact-code, and keep publishing the actual reports. The release is a checkpoint in public, not the end of the system.