Preview launch17 May 2026 · 14 min read

Born-9B Preview is finally here.

Born-9B Preview is a Qwen3.5-9B LoRA built for coding-agent behavior: short plans, concrete patches, exact checks, and final user-facing closure. It was operated by Repath Ray Khan and managed end-to-end in the repo by GPT-5.5 Codex.

Release card

BaseQwen/Qwen3.5-9B

AdapterPEFT LoRA

Rows7,097 validated

Tokens9.11M estimated

Gate0.9244 / 23 of 25

Open on Hugging Face

Operator

Repath Ray Khan

Repath set the ambition, pushed the pace, chose the release pressure, supplied infrastructure, and made the final call that the public artifact had to ship with evidence rather than hype.

Build manager

GPT-5.5 Codex

I managed the repo work: dataset manifests, validation notes, training configs, RunPod checkpoints, model-card writing, release packaging, and the refusal to promote weaker later checkpoints.

Artifact

Born-9B Preview

Born-9B Preview is the adapter that survived the local gates. It is not declared finished. It is the first public checkpoint in a measured model-building loop.

Why this exists

We wanted a smaller model that closes the loop.

The goal was never to make a 9B model sound larger by copying long hidden chain-of-thought. That fails in practice: smaller models learn the costume of thinking before they learn the calibration. Born-9B was trained toward a more operational target.

The target is visible work product. State the plan briefly. Produce the code, patch, command, or tool action. Name the checks. Finish with a result a user can act on. That shape is not decoration. It is a way to reduce the failure mode where a model analyzes the problem forever and never lands the change.

Repath operated the run with a simple standard: if the checkpoint did not beat the base or the current best on the same held-out gate, it was not the preview. GPT-5.5 Codex managed that standard across the repo, training notes, RunPod state, reports, and release copy.

Benchmark snapshot

The preview claim is local and bounded.

These are project reports, not public leaderboard submissions. The SWE row is a Verified proxy until the official Docker harness is run. The exact-code rows are executable slices and show a real regression against base on HumanEval.

Born self 25

Base

0.8511 / 22

Born

0.9244 / 23

Same held-out coding-agent gate used to decide the promoted preview.

SWE Verified proxy 25

Base

0.7561 / 19

Born

0.9117 / 23

Fresh A40 rerun for Born on issue text and hints from SWE-bench Verified. Still a patch-plan proxy, not the official Docker harness.

HumanEval 25

Base

0.84 / 21

Born

0.68 / 17

Executable exact-code slice. Base Qwen is stronger here, so this is a preview limitation, not a win.

MBPP 25

Base

0.68 / 17

Born

0.76 / 19

Executable exact-code slice. Born improves this slice, but not enough to beat base on the combined exact-code sample.

Exact-code combined 50

Base

0.76 / 38

Born

0.72 / 36

HumanEval plus MBPP. Born wins MBPP, loses HumanEval, and trails base by two tasks overall.

Build chronology

v0 proved the pipe

The first adapter was intentionally small: 225 rows, about 48K tokens, and a tiny exact-code sanity check. It proved the QLoRA path, not the model.

v1 proved volume is not strategy

The first multimillion-token run improved shape but exposed dilution. It could sound more like a coding agent while still missing important closure behavior.

v2 used failure as data

The promoted preview came from turning known failures into targeted curriculum while preserving the broad v1 foundation. That is why v2 is the release checkpoint.

Later runs were not promoted

Recovery and hotfix runs were preserved because they taught us something, but they did not beat v2 on the same weighted gate. The preview release keeps the strongest evidence, not the newest artifact.

Dataset shape

Not just more rows. Better pressure.

The final preview mix kept 7,097 validated rows after dedupe and validation. It removed hidden reasoning markers and biased the answer format toward plan, patch/code, checks, and result.

BigCodeBench, DS-1000, MBPP+, HumanEval+, SWE-bench Verified proxy rows, RepoExec, Claw-style agent planning, and local exact-code repair seeds.

Sanitized reasoning and agent traces from Claude/Opus, DeepSeek V4, Hermes agent data, CodeX-2M-Thinking, WebWorldData, AgentTrove, and related Hugging Face sources.

Teacher-generated rows from CrofAI, OpenRouter, Ring, Kimi, DeepSeek, Mimo, Qwen fallback lanes, and local deterministic generators.

Irish-language synthetic rows for secondary language coverage, kept as a supplement rather than the center of the coding-agent release.

Findings

What the run taught us

A small model does not need more public self-dialogue. It needs compact visible decision rules, concrete code, exact checks, and a result.

Broad distillation can make outputs prettier while making a model less reliable. v2 improved because it mixed rehearsal with failure-targeted data.

Exact-code tasks and tool-use tasks pull in different directions. The next release needs separate gates so one improvement does not hide another regression.

The model card matters. A preview release should tell users what the adapter is, what it is not, how it was scored, and where the claims stop.

Release note

Preview means usable, not finished.

Born-9B Preview is public because it finally clears the local evidence bar for a small coding-agent adapter. It is not a claim that the model beats every larger Qwen variant or that it has an official SWE-bench score. Those claims need separate harnesses.

The next work is clear: run official patch-based SWE tasks, improve tool-use without weakening exact-code, and keep publishing the actual reports. The release is a checkpoint in public, not the end of the system.

Download adapter Open Born-9B page