Dataset provenance: why every row needs a paper trail
How Born tracks every training row from source benchmark to teacher model to final dataset — and why this level of documentation is non-negotiable for trustworthy AI.
The provenance problem
Most fine-tuned models ship without any documentation about their training data. You get the weights, maybe a model card that names the training dataset, and no way to trace any individual prediction back to its training source. If something goes wrong in production, you have no forensic path.
For enterprises deploying fine-tuned models, this is a compliance risk. For researchers trying to reproduce results, it's a dead end. For the teams building the models, it's a missed opportunity to learn from their own data.
Born's provenance chain
Every row in a Born training dataset carries the following metadata:
The benchmark family the task came from (e.g., MBPP+, DS-1000, SWE-bench, BigCodeBench).
The original task identifier within that benchmark.
The teacher model that generated the response (e.g., kimi-k2.6, deepseek-v4-pro).
The manifest lane that assigned this task to this teacher.
ISO timestamp of when the response was generated.
SHA-256 hash of the prompt, used for deduplication.
Score assigned by the curation pipeline (structural, format, length gates).
The output JSONL file this row was collected from.
External trace imports
Born's v1 dataset includes external trace imports from several public datasets: Claude Opus reasoning traces, Tachibana4 DeepSeek V4 Pro conversations, Hermes agent traces, and CodeX-2M-Thinking entries. Each import goes through a sanitization step:
- Raw hidden thinking blocks are stripped. Any internal chain-of-thought markers from the original model are removed.
- Observable implementation content is retained. The code, tool calls, and structured outputs that a user would see are kept.
- Content is rewritten into Born's closure format. The response is restructured to match Born's Plan → Code/Patch → Checks → Result format.
- Provenance is preserved. The original dataset and source ID are recorded, so the import can be traced back to its origin.
The dataset transparency table
Born publishes a manifest-level summary for every dataset release. The Born-9B v1 transparency table currently lists 7 active manifests across 5 benchmark families, with row counts, teacher assignments, and lane identifiers for each. This table is versioned and updated as new generation lanes are added.
Why this matters
Dataset provenance is not a compliance exercise. It is a prerequisite for:
- Debugging model behavior. When a model fails on a task, provenance lets you trace back to the training rows most similar to that task and inspect their quality.
- Iterating on data. If a benchmark family is underrepresented, the provenance table shows it immediately. New generation lanes can be targeted precisely.
- Trust. When you hand a model to a client, provenance is the receipt. It says: here is exactly what this model learned from, and here is how we verified it.