Verification layer for AI-agent PRs

Define Done for AI agents.

Evidence that agent PRs actually work and will not regress.

Define Done verifies the task, not just the diff.

For engineering teams shipping with coding agents: run real agent PRs through executable acceptance gates, separate-context review, and sealed regression replay before they merge.

Join design partner program View proof report

proof-report · v1

50 cases · June 3, 2026

CI only

0/35

known-bad PRs caught

Timaeus full

35/35

known-bad PRs caught

timaeus-full recall: 100% (90%, 100%)
timaeus-full false-block: 7% (1%, 30%)
corpus: 35 known-bad · 15 known-good

Internal proof-corpus result, not a public SOTA claim. Wilson 95% intervals shown because N is small.

View full proof report

Built from a working TypeScript reference implementation (Timaeus).

Repository: Private beta · public release in progress
Latest proof run: June 3, 2026
Proof corpus: 50 cases · 35 bad · 15 good
Conformance: VLS self-conformance passing

The problem

Agent PRs need a stronger definition of done.

Agents can finish fast and still miss requested behavior.
CI can pass while the user-visible workflow is broken.
Generic AI review can miss subtle task failures or over-block good changes.
Regression risk compounds when many coding agents ship in parallel.

What it does

Verify the task, then seal the evidence.

Define Done verifies agent-authored PRs with executable evidence: calibrated review, acceptance gates, sealed regression replay, and a reproducible proof report — on every change.

01
Agent claims done
Capture the task and the completion claim.
02
CI runs
Build, typecheck, and existing tests.
03
Timaeus verifies
Separate-context review + executable acceptance gates.
04
Regressions sealed
Behavior captured for deterministic replay.
05
Merge / block / warn
A calibrated verdict with attached evidence.

From “done” to a verdict you can trust.

One command replays the whole bench: CI, generic review, separate-context review, and the full Timaeus pipeline — scored against deterministic oracles, with Wilson intervals because N is small.

See the quickstart

timaeus — proof

The evidence every run preserves

taskscore.jsonBLOCK

case

checkout-promo-code

claim

“Promo codes now apply at checkout.”

confidence0.94

acceptance gates3/4 passed

apply-valid-code
reject-expired-code
stack-with-sale-price
ui-shows-discount-line

A calibrated per-case verdict with evidence references.

TaskScore →

pr #482 · promo codes

function applyPromo(code, cart) {

- if (codes.has(code)) {

+ const promo = codes.get(code);

+ if (promo) {

cart.discount = promo.amount;

return cart;

}

block — expired codes are still accepted; acceptance gate reject-expired-code failed. CI was green.

The merge / block / warn decision, attached to the change.

Diff verdict →

case evidence

checkout-promo-code/
task.md
acceptance/
apply-valid-code.spec.ts
reject-expired-code.spec.ts
review/separate-context.md
replay/checkout-promo.json

Per-case inputs, gate output, and review trace.

Case evidence tree →

proof-report.md

# proof report — v1 corpus

run: 2026-06-03 · cases: 50

workflowrecallF1

ci-only0%0.00

timaeus-full100%0.99

false blocks: 1/15 (Wilson 7%, 1–30%)

A reproducible run summary with the confusion matrix.

Proof report →

Proof, not vibes

CI caught 0/35. The full Timaeus pipeline caught 35/35.

In the internal v1 Agent PR Verification Bench (50 cases across demo-saas and Epic Stack), CI caught 0 of 35 known-bad agent PRs. The full Timaeus pipeline caught all 35, with one false block across 15 known-good PRs.

Workflow	Bad caught	Recall (Wilson 95%)	False blocks	F1	Interpretation
CI only	0/35	0% (0%, 10%)	0%	0.00	Build/typecheck-style checks. Blind to task fulfillment.
Generic AI review	14/35	40% (26%, 56%)	7%	0.56	Comments on the diff. Misses subtle task failures.
Timaeus reviewer	30/35	86% (71%, 94%)	7%	0.91	Separate-context review of the diff against the task.
Timaeus full	35/35	100% (90%, 100%)	7%	0.99	Review + executable acceptance gates + regression replay.

CI only
Bad caught
0/35
F1
0.00
Recall (Wilson 95%)
0% (0%, 10%)
False blocks
0%
Build/typecheck-style checks. Blind to task fulfillment.
Generic AI review
Bad caught
14/35
F1
0.56
Recall (Wilson 95%)
40% (26%, 56%)
False blocks
7%
Comments on the diff. Misses subtle task failures.
Timaeus reviewer
Bad caught
30/35
F1
0.91
Recall (Wilson 95%)
86% (71%, 94%)
False blocks
7%
Separate-context review of the diff against the task.
Timaeus full
Bad caught
35/35
F1
0.99
Recall (Wilson 95%)
100% (90%, 100%)
False blocks
7%
Review + executable acceptance gates + regression replay.

Internal proof-corpus result, not a public SOTA claim. Wilson 95% intervals shown because N is small.

Internal corpus

These are internal proof-corpus results, not a public SOTA claim. Ground truth comes from deterministic oracles, not LLM judgment. The corpus spans 50 cases across two products (demo-saas and Epic Stack) and is expanding further.

Read the full proof report

How it differs

Define Done verifies the task and keeps the evidence.

It does not just comment on the diff. The adjacent tools each cover a slice; none verify task fulfillment and preserve a replayable record.

Approach	Task verified	Executable gates	Evidence preserved	Calibrated verdict
CI only
Generic AI review
Test generation
Benchmark-only evals
Define Done / Timaeus

covered partial not covered

Open standard

Built on VLS.

VLS is a vendor-neutral wire format for AI-coding-agent verification artifacts. Timaeus is the TypeScript reference implementation.

Read about VLS VLS spec & schemas Timaeus repo

Private beta repository. Public VLS/Timaeus release in progress.

FAQ

Questions before you run it on real PRs.

What problem does Define Done solve?

AI coding agents finish fast and often claim done while the requested behavior is still broken. CI passes on builds and types but is blind to whether the task was actually fulfilled. Define Done verifies the task and preserves the evidence before the change merges.

How is it different from CI and generic AI review?

CI checks the build, not the task. Generic AI review comments on the diff and misses subtle task failures or over-blocks good changes. Define Done runs executable acceptance gates tied to the task, reviews in a separate context, seals regressions for replay, and emits a calibrated merge / block / warn verdict with attached evidence.

What evidence exists today?

An internal v1 corpus of 50 cases across two products (demo-saas and Epic Stack). The full Timaeus pipeline caught 35/35 known-bad PRs with one false block across 15 known-good PRs. Every rate is reported with a Wilson 95% interval. It is an internal proof corpus, not a public SOTA claim.

What are the caveats?

Small N, so intervals are wide. Results come from an internal corpus, not a leaderboard. Ground truth comes from deterministic oracles, not LLM judgment. Define Done surfaces evidence and calibrated verdicts — it does not guarantee correctness or zero regressions.

How do I try it on real agent PRs?

Join the design-partner program. We run Define Done in shadow mode on 25 real agent PRs — it reports without blocking your pipeline — and hand back a private report on what CI missed, what generic AI review missed, and what Timaeus caught.

What does the design-partner program cost?

Shadow-mode validation is free for the first 3 months. After that, pricing depends on repository count, PR volume, and whether you need hosted evidence retention. Commercial pricing is set after the private validation cohort.

Design partners

Run it in shadow mode on real agent PRs.

We are looking for engineering teams shipping with AI coding agents. Define Done runs in shadow mode on 25 real agent PRs and measures what CI missed, what generic AI review missed, what Timaeus caught, and whether the catch mattered.

Free in shadow mode for the first 3 months. Commercial pricing is set after the private validation cohort.

Join design partner program