Born-9B v1: Distillation at the edge of 9 billion parameters
How we built a coding-agent model in public — the data, the teachers, the honest eval results, and what we learned when v0 tied the base model.
The premise
Born-9B started as a question: can you build a useful coding-agent LoRA on a single RTX 6000 Ada (48 GB VRAM), using only openly available teacher models, and release the result with full provenance?
The answer, after v0, was "not yet" — and that honesty is the point. The v0 adapter tied base Qwen3.5-9B at 10/20 on our probe. No improvement was claimed. What we did prove was the complete pipeline: teacher selection, dataset construction, QLoRA training, evaluation, and release.
Teacher selection
We evaluated six teacher models on real benchmark-family rows before choosing which to distill from. The probe covered exact code generation (MBPP+ tail), judged repair tasks, and agentic multi-step closures.
The v0 result: a tie
The v0 adapter was trained on 225 rows — a proof-of-life mix. On a 20-task coding probe, both the LoRA and the base model scored 10/20. We published this result unchanged.
10 / 20
Base Qwen3.5-9B
10 / 20
Born-9B v0 LoRA
Tie
Result
This result drove the v1 decision: more data, better coverage across benchmark families, separate generation lanes per teacher, and a strict deduplication pipeline before the next training run.
v1: the data pass
v1 is not a training run — it is a data expansion phase. We built separate JSONL generation lanes for each teacher model, covering different benchmark families to prevent overlap. The dataset grew from 225 rows to over 2,400 unique prompts across DS-1000, BigCodeBench, MBPP+, SWE-bench, RepoExec, and Irish-language synthesis.
External trace imports from Claude Opus reasoning, Tachibana4 DeepSeek V4 Pro, Hermes agent traces, and CodeX-2M-Thinking were sanitized — raw hidden thinking blocks stripped, observable implementation content retained, then rewritten into Born's closure format.
What we learned
225 rows is not enough. A proof-of-life mix can validate the pipeline, but cannot meaningfully shift model behavior on diverse coding tasks. The v1 target is 2,000+ unique prompts with verified provenance.
Teacher quality varies by task family. Kimi dominated MBPP+ tail and BigCode. DeepSeek excelled at DS-1000. Neither was universally superior. The Ring architecture — separate lanes per teacher — was the correct design.
Honest measurement protects the project. Publishing a 10/20 tie is uncomfortable. But it anchored the v1 data work in real evidence instead of hope. Every Born release will follow this pattern: baseline, delta, caveats.
What comes next
The v1 dataset is nearing completion. Once curation and validation finish, we will run a continuation train on Runpod — same QLoRA config, same RTX 6000 Ada hardware, same evaluation probe. The delta between v0 and v1 will be reported honestly, whatever it shows.