Research5 Apr 2026 · 7 min read

Dataset provenance: why every row needs a paper trail

How Born tracks every training row from source benchmark to teacher model to final dataset — and why this level of documentation is non-negotiable for trustworthy AI.

The provenance problem

Most fine-tuned models ship without any documentation about their training data. You get the weights, maybe a model card that names the training dataset, and no way to trace any individual prediction back to its training source. If something goes wrong in production, you have no forensic path.

For enterprises deploying fine-tuned models, this is a compliance risk. For researchers trying to reproduce results, it's a dead end. For the teams building the models, it's a missed opportunity to learn from their own data.

Born's provenance chain

Every row in a Born training dataset carries the following metadata:

source_benchmark

The benchmark family the task came from (e.g., MBPP+, DS-1000, SWE-bench, BigCodeBench).

source_id

The original task identifier within that benchmark.

teacher_model

The teacher model that generated the response (e.g., kimi-k2.6, deepseek-v4-pro).

generation_lane

The manifest lane that assigned this task to this teacher.

generation_timestamp

ISO timestamp of when the response was generated.

prompt_hash

SHA-256 hash of the prompt, used for deduplication.

curation_score

Score assigned by the curation pipeline (structural, format, length gates).

source_file

The output JSONL file this row was collected from.

External trace imports

Born's v1 dataset includes external trace imports from several public datasets: Claude Opus reasoning traces, Tachibana4 DeepSeek V4 Pro conversations, Hermes agent traces, and CodeX-2M-Thinking entries. Each import goes through a sanitization step:

Raw hidden thinking blocks are stripped. Any internal chain-of-thought markers from the original model are removed.
Observable implementation content is retained. The code, tool calls, and structured outputs that a user would see are kept.
Content is rewritten into Born's closure format. The response is restructured to match Born's Plan → Code/Patch → Checks → Result format.
Provenance is preserved. The original dataset and source ID are recorded, so the import can be traced back to its origin.

The dataset transparency table

Born publishes a manifest-level summary for every dataset release. The Born-9B v1 transparency table currently lists 7 active manifests across 5 benchmark families, with row counts, teacher assignments, and lane identifiers for each. This table is versioned and updated as new generation lanes are added.

ManifestSource laneRows

DS-1000 DeepSeekds1000-deepseek380+

BigCode Ringbigcode-ring320+

MBPP+ Kimi K2mbpp-kimi280+

SWE-bench Verifiedswe-deepseek250+

CLAW Multiturnclaw-kimi180+

RepoExec Owlrepoexec-owl150+

Irish Syntheticirish-synth200+

Why this matters

Dataset provenance is not a compliance exercise. It is a prerequisite for:

Debugging model behavior. When a model fails on a task, provenance lets you trace back to the training rows most similar to that task and inspect their quality.
Iterating on data. If a benchmark family is underrepresented, the provenance table shows it immediately. New generation lanes can be targeted precisely.
Trust. When you hand a model to a client, provenance is the receipt. It says: here is exactly what this model learned from, and here is how we verified it.

Research & data All posts