Technical28 Apr 2026 · 12 min read

Building the Ring teacher mix: six models, one student

The data pipeline, teacher selection criteria, and quality filters that went into Born-9B's training corpus. Why we chose six teachers and how their outputs were weighted.

Why multiple teachers?

A single teacher model biases the student toward its own failure modes. If DeepSeek struggles with multi-file repository context, every row it generates for that task type inherits that weakness. The Ring architecture addresses this by assigning different benchmark families to different teachers — each covering the task surface where it is strongest.

The name "Ring" comes from the circular nature of the evaluation: each teacher is probed on every benchmark family first, then assigned only the families where it scored above baseline. The assignments form a ring of complementary coverage.

Teacher probe results

Before generating any training data, we ran all six candidate teachers on a shared probe of 40 coding tasks across four categories: exact code (MBPP+), data science (DS-1000), repository repair (SWE-bench), and agentic closure (CLAW). The probe results determined lane assignments:

Kimi K2.6MBPP+ tail, BigCodeBench, CLAW multiturn

Strongest on multi-step agentic tasks and exact function signatures.

Ring 2.6 1TDS-1000, SWE-bench, BigCode tail

Reliable on data science and repair-oriented tasks.

DeepSeek v4 ProDS-1000 tail, SWE-bench Verified

Highest quality on numpy/pandas/scipy correctness.

DeepSeek v4 FlashSupplementary fast passes

Cost-effective for large batch generation where quality bar is lower.

Owl AlphaBigCode tail

Solid coverage on less common library functions.

TrinityAgent closure and tool-use format

Best at tool-call structuring and action/result supervision.

The generation pipeline

Each teacher writes to its own JSONL output file in a dedicated lane directory. A seed manifest (YAML) defines the benchmark rows assigned to that teacher. The generation system reads the manifest, prompts the teacher via OpenRouter or direct API, validates the response format, and writes the result.

After generation, a merge script collects all lane outputs, deduplicates on prompt hash (SHA-256), and produces the combined raw dataset. Duplicate prompts are logged but excluded from the training split. The merge script also generates a provenance report linking every surviving row to its source file, teacher model, and benchmark family.

Quality filtering

Not every teacher response is usable. The curation step scores each row on structural completeness (does the response contain a Plan, Code/Patch, Checks, and Result?), format compliance (are tool calls properly closed?), and length sanity (is the response between 200 and 8,000 tokens?). Rows that fail any gate are logged in the curation report and excluded.

For the v1 dataset, 8 low-score groups were rejected by curation. The remaining rows were split into train (92%) and validation (8%) sets, stratified by benchmark family to ensure each family is represented in both splits.

Provenance as a first-class artifact

Every row in the final dataset can be traced back to: which teacher model generated it, which benchmark family it came from, which seed manifest assigned it, and which generation system ran the job. This provenance chain is packaged alongside the model weights in every Born release.

We believe this level of documentation is not optional for applied AI work. If you cannot explain where every training row came from, you cannot defend your model's behavior in production. The Ring architecture makes this tractable by keeping lanes isolated and manifests explicit.

Born-9B distillation post All posts