# BornBench v4 Method

BornBench v4 is a 150-item deterministic benchmark for coding and agentic LLMs. It is intentionally not a clone of SWE-bench, LeetCode, BigCodeBench, LiveCodeBench, BrowseComp, GAIA, AppWorld, tau-bench, Terminal-Bench, or any other public benchmark. The public suites were used to study task families and failure modes; the V4 prompts and answer keys are original BornBench content.

## Why v4 Exists

BornBench v2 expanded the original suite from a narrow exact-answer smoke test into a 96-item coding-agent benchmark. V4 widens that surface again to 150 items and separates the answer key into an explicit answer sheet:

- `bornbench-v4.json`: public benchmark prompts, metadata, canonical answers, aliases, lanes, and scoring contract.
- `bornbench-v4-answer-sheet.json`: machine-readable exact answer sheet for deterministic marking.

The goal is still not to replace full executable harnesses. V4 is a cheap first-pass filter that catches the same kinds of errors before a model is sent to expensive repo, terminal, browser, or app-world evaluations.

## Research Basis

The V4 taxonomy was informed by public descriptions of:

- SWE-bench Verified: real GitHub issue repair, hidden tests, and contamination-aware evaluation. https://www.swebench.com/verified.html
- LiveCodeBench: fresh coding, self-repair, code execution, and test-output prediction. https://github.com/LiveCodeBench/LiveCodeBench
- BigCodeBench: practical programming tasks involving library/API composition. https://github.com/bigcode-project/bigcodebench
- CRUXEval: code reasoning and execution over input-output behavior. https://github.com/facebookresearch/cruxeval
- Terminal-Bench: command-line and DevOps-style agent work. https://www.tbench.ai/
- AppWorld: autonomous API workflows over simulated apps. https://appworld.dev/
- tau-bench: tool-agent-user policy following in real-world domains. https://arxiv.org/abs/2406.12045
- BrowseComp: hard-to-find short-answer web evidence tasks. https://openai.com/index/browsecomp/
- GAIA: general assistant tasks with reasoning and tool use. https://ai.meta.com/research/publications/gaia-a-benchmark-for-general-ai-assistants/
- WebArena: web-agent grounding and long-horizon interaction. https://webarena.dev/og/
- RepoBench and Aider Polyglot: repository context and multi-language coding/editing pressure. https://arxiv.org/abs/2306.03091 and https://aider.chat/2024/12/21/polyglot.html

## What Changed From v2

- v2: 96 items across 12 lanes.
- v4: 150 items across 15 lanes.
- v4 adds explicit `leetcode_hard_style`, `database_distributed`, and `performance_concurrency` lanes.
- v4 keeps deterministic exact-answer scoring and aliases.
- v4 adds a separate answer-sheet JSON for exact marking and downstream report generation.

## Lanes

1. `code_execution`: mental execution across Python, Node, SQL, Go, Java, JavaScript, CSS, Ruby, TypeScript.
2. `algorithmic_edge`: LRU, Dijkstra, KMP, union-find, medians, topological uniqueness, rollback state.
3. `leetcode_hard_style`: original hard-style algorithm problems with exact outputs.
4. `patch_reasoning`: SWE-style smallest safe fix and regression reasoning.
5. `repo_context`: package managers, public asset paths, generated files, migrations, dirty worktrees.
6. `tool_protocol`: schema discipline, retries, stale SHAs, untrusted tool output, confirmation gates.
7. `terminal_ops`: shell semantics, permissions, git, pipefail, Docker cache, tar safety.
8. `security`: JWT, SSRF, HMAC, XSS, IDOR, dependency confusion, CSRF, prompt injection.
9. `data_reasoning`: accuracy, weighted scoring, costs, latency, calibration, leakage.
10. `database_distributed`: SQL, isolation, quorums, vector clocks, CRDTs, Kafka, Redis.
11. `web_evidence`: self-contained source freshness, disambiguation, unsupported claims, timestamp reasoning.
12. `long_context`: instruction hierarchy, stale docs, version caveats, refusal boundaries.
13. `agent_policy`: customer-service and app-agent policy decisions.
14. `performance_concurrency`: complexity, deadlock, HTTP cache, N+1 queries, backpressure, stampedes.
15. `abstract_reasoning`: ARC-style symbolic and grid transformations.

## Exact Marking

Every item has one canonical `answer` and optional `aliases`. An answer is correct only if the scorer's normalized extracted answer exactly matches the normalized canonical answer or one normalized alias.

Normalization is unchanged from BornBench v2 practical scoring: trim, strip wrapping quotes/backticks, strip one trailing period, collapse whitespace, and lowercase.

The recommended model instruction is:

```text
Solve the task carefully. You may use natural language or code reasoning.
End with exactly one final-answer line, for example: Final answer: B
```

The answer sheet intentionally duplicates only the marking data:

- `id`
- `lane`
- `difficulty`
- `answer_type`
- `answer`
- `aliases`

## Copyright And Contamination Position

Do not copy proprietary or closed benchmark questions into BornBench. Do not paste LeetCode problem statements or hidden benchmark answers. BornBench v4 is MIT-licensed because its item text and answer keys are original.

The benchmark uses public benchmark names only as taxonomy labels in `inspired_by` and `research_basis`. Those labels describe what capability family a task belongs to; they are not provenance claims for copied prompts.

## Known Limits

BornBench v4 is static and exact-scored. It cannot fully measure autonomous repository editing, GUI control, browser navigation, terminal execution, or long-horizon API state the way full harnesses can. It should be used as a broad regression suite and preflight screen, not as the final claim about model quality.
