BornBench v4 · MIT licensed

A 150-question smoke test for coding agents.

One hundred fifty exact-scored original tasks across execution, hard-style algorithms, SWE patch judgment, repo discipline, tool use, terminal recovery, security, data systems, policy, web evidence, performance, and abstraction.

Download v4 JSON Answer sheet v4 method archived v2 results v1 JSON

150

items

lanes

v4.0

exact score

MIT

license

Code execution

Python, Node, SQL, Go, Java, JavaScript, CSS, Ruby, TypeScript.

Algorithmic edge

Pagination, TTLs, races, graph rejection, heaps, rollback, sliding windows.

LeetCode hard style

Original hard-style exact-output tasks across dynamic programming, matching, graphs, and intervals.

Patch reasoning

SWE-style smallest safe fix under regressions and product constraints.

Repo context

Generated files, package managers, CI matrices, migrations, dirty worktrees.

Tool protocol

Valid calls, stale SHAs, retries, tool authority, tool-output injection.

Terminal ops

Shell semantics, permissions, Git, Docker layer cache, pipefail, and archive safety.

Security

JWTs, HMAC, SSRF, XSS, authz, dependency confusion, CSRF, prompt injection.

Data reasoning

Leaderboard math, calibration, token cost, leakage, latency, tie-breaks.

Database and distributed

SQL nulls, isolation, quorums, vector clocks, CRDTs, Redis, Kafka.

Web evidence

GAIA/BrowseComp-style source freshness, disambiguation, and citation traps.

Long context

Instruction precedence, release caveats, immutable policy, stale docs.

Agent policy

Refunds, airline changes, collateral damage, confirmation, API-policy conflict.

Performance and concurrency

Complexity, locks, cache semantics, backpressure, N+1s, stampedes.

Abstract reasoning

ARC-style symbolic transfer under concise constraints.

V4 runner · 150 items

BornScore first. Raw exact stays visible.

The runner below loads the BornBench v4 exact-answer suite: difficulty-weighted accuracy, lane balance, hard-item accuracy, and final-answer reliability. Raw exact score remains beside it.

RankModelBornScoreValidInvalidCostTimeLanes

No v4 published result file yet. Run the benchmark to populate this ledger.

OpenRouter runner

OpenRouter keyModels

ItemsMax tokens

Leaderboard

Loading benchmark

Model	BornScore	Done	Tokens	Cost	Time	State
Paste a key, confirm the model list, and run BornBench.