Born
Applied Intelligence

Research, method, and evaluation

One page for how
Born actually works.

This page replaces the split between “company”, “method”, and “research”. It holds the operating stance, the method, the benchmark frame, and the current Born-9B release arc in one place.

v2

public preview anchor

0.9244

local weighted score

0.8559

SWE proxy score

precision

current release thesis

Applied, not abstract

Born exists to produce deployable model behavior and release evidence, not to pad out a research identity with generic theory.

Claims stay behind evals

A run can finish cleanly and still fail promotion. That is already true multiple times in the Born-9B history and is now part of the operating posture.

Public notes are part of the artifact

The run reports, release pages, benchmark packs, and postmortems are not marketing extras. They are part of the product surface.

The Born method

Systematic,
not improvised.

These six steps are the real operating pattern behind model work, not a retrospective framework invented after the fact.

01

Frame the work

Define the exact behavior, operational job, failure conditions, and what a successful release would actually mean.

02

Establish baseline

Measure the base model before changing anything. That makes later gains, ties, and regressions legible.

03

Build the data

Use teacher lanes, synthetic generation, curation, provenance, and strict validation before training ever starts.

04

Adapt the model

Train narrowly around the target behavior, keep the budget visible, and preserve artifacts from every significant branch.

05

Evaluate honestly

Run the same held-out gates, track where the model gets stronger, and keep failed attempts visible instead of burying them.

06

Operationalize

Package the checkpoint, document the run, expose the caveats, and route the model into real product surfaces like Born Chat.

Benchmarking and evaluation

Research here means
measurement infrastructure.

BornBench, the fixed local gate, the SWE proxy, and promotion rules are all part of the same discipline: a page does not get to call a run “better” just because the run completed.

25 tasks

Local held-out gate

The fixed internal coding-agent gate is still the primary promotion rule for Born-9B. The current public truth is Born-9B Preview v2 at 0.9244 and 23/25 versus base Qwen at 0.8511 and 22/25.

22 / 25

SWE-bench proxy

Born reports a local issue-resolution proxy derived from SWE-bench Verified. It is useful, but it is explicitly not the official Docker harness score.

96 items

BornBench

BornBench measures broader coding-agent behavior across execution, repo context, tool protocol, security, long context, and related lanes. It is evaluation infrastructure, not a Born-9B vanity board.

same gate

Promotion rule

A later adapter does not get promoted unless it beats the current best checkpoint on the same held-out gate. That rule is why v2 remains the public preview.

BornBench v2 practical snapshot

Qwen 3.6-35B A3B

65.22 BornScore

Mimo v2.5

63.76 BornScore

GPT-5.4 nano

62.04 BornScore

Qwen 3.5-9B base

30.75 BornScore

The point is not that Born-9B should top this table today. The point is that Born now owns an evaluation surface that can expose where models are strong, brittle, or format-compliant across more than one narrow release gate.

Current model arc

The repo’s actual story
is four stages long.

v0 proof artifact

A cheap first run, 225 rows, a loadable LoRA, and an honest tie with base Qwen. Useful because the full loop existed, not because the score was impressive.

v1 bigger but weaker

The dataset jumped to the multi-million-token range, but the first large continuation still lost to base on the 25-task gate. Bigger did not mean better.

v2 promoted

Generated-expanded v2 became the turning point once the scorer was corrected. It beat base Qwen on the fixed local gate and on the local SWE proxy.

Post-v2 runs stayed transparent

v3, v4, v2-recovery, exact hotfixes, and preview recovery all remain visible in the repo because they were informative, even when they were not upgrades.

General-release thesis

Precision, not scale.

Scale executable exact-code rows around the remaining narrow failures.

Scale tool-use and provider-closure rows instead of broad generic reasoning data.

Preserve v2 behavior with rehearsal so recovery work does not erase the win.

Cap visible-thinking and generic reasoning imports unless they prove useful on the same held-out gates.

Related notes

Read the research trail,
not just the headline.

All Born writing