Define Done for AI agents.
Evidence that agent PRs actually work and will not regress.
Define Done verifies the task, not just the diff.
For engineering teams shipping with coding agents: run real agent PRs through executable acceptance gates, separate-context review, and sealed regression replay before they merge.
- timaeus-full recall
- 100% (90%, 100%)
- timaeus-full false-block
- 7% (1%, 30%)
- corpus
- 35 known-bad · 15 known-good
Internal proof-corpus result, not a public SOTA claim. Wilson 95% intervals shown because N is small.
View full proof reportBuilt from a working TypeScript reference implementation (Timaeus).
- Repository
- Private beta · public release in progress
- Latest proof run
- June 3, 2026
- Proof corpus
- 50 cases · 35 bad · 15 good
- Conformance
- VLS self-conformance passing
Agent PRs need a stronger definition of done.
- Agents can finish fast and still miss requested behavior.
- CI can pass while the user-visible workflow is broken.
- Generic AI review can miss subtle task failures or over-block good changes.
- Regression risk compounds when many coding agents ship in parallel.
Verify the task, then seal the evidence.
Define Done verifies agent-authored PRs with executable evidence: calibrated review, acceptance gates, sealed regression replay, and a reproducible proof report — on every change.
- 01Agent claims done
Capture the task and the completion claim.
- 02CI runs
Build, typecheck, and existing tests.
- 03Timaeus verifies
Separate-context review + executable acceptance gates.
- 04Regressions sealed
Behavior captured for deterministic replay.
- 05Merge / block / warn
A calibrated verdict with attached evidence.
From “done” to a verdict you can trust.
One command replays the whole bench: CI, generic review, separate-context review, and the full Timaeus pipeline — scored against deterministic oracles, with Wilson intervals because N is small.
See the quickstartThe evidence every run preserves
- apply-valid-code
- reject-expired-code
- stack-with-sale-price
- ui-shows-discount-line
A calibrated per-case verdict with evidence references.
TaskScore →The merge / block / warn decision, attached to the change.
Diff verdict →- checkout-promo-code/
- task.md
- acceptance/
- apply-valid-code.spec.ts
- reject-expired-code.spec.ts
- review/separate-context.md
- replay/checkout-promo.json
Per-case inputs, gate output, and review trace.
Case evidence tree →A reproducible run summary with the confusion matrix.
Proof report →CI caught 0/35. The full Timaeus pipeline caught 35/35.
In the internal v1 Agent PR Verification Bench (50 cases across demo-saas and Epic Stack), CI caught 0 of 35 known-bad agent PRs. The full Timaeus pipeline caught all 35, with one false block across 15 known-good PRs.
| Workflow | Bad caught | Recall (Wilson 95%) | False blocks | F1 | Interpretation |
|---|---|---|---|---|---|
| CI only | 0/35 | 0% (0%, 10%) | 0% | 0.00 | Build/typecheck-style checks. Blind to task fulfillment. |
| Generic AI review | 14/35 | 40% (26%, 56%) | 7% | 0.56 | Comments on the diff. Misses subtle task failures. |
| Timaeus reviewer | 30/35 | 86% (71%, 94%) | 7% | 0.91 | Separate-context review of the diff against the task. |
| Timaeus full | 35/35 | 100% (90%, 100%) | 7% | 0.99 | Review + executable acceptance gates + regression replay. |
- CI only
- Bad caught
- 0/35
- F1
- 0.00
- Recall (Wilson 95%)
- 0% (0%, 10%)
- False blocks
- 0%
Build/typecheck-style checks. Blind to task fulfillment.
- Generic AI review
- Bad caught
- 14/35
- F1
- 0.56
- Recall (Wilson 95%)
- 40% (26%, 56%)
- False blocks
- 7%
Comments on the diff. Misses subtle task failures.
- Timaeus reviewer
- Bad caught
- 30/35
- F1
- 0.91
- Recall (Wilson 95%)
- 86% (71%, 94%)
- False blocks
- 7%
Separate-context review of the diff against the task.
- Timaeus full
- Bad caught
- 35/35
- F1
- 0.99
- Recall (Wilson 95%)
- 100% (90%, 100%)
- False blocks
- 7%
Review + executable acceptance gates + regression replay.
Internal proof-corpus result, not a public SOTA claim. Wilson 95% intervals shown because N is small.
Define Done verifies the task and keeps the evidence.
It does not just comment on the diff. The adjacent tools each cover a slice; none verify task fulfillment and preserve a replayable record.
| Approach | Task verified | Executable gates | Evidence preserved | Calibrated verdict |
|---|---|---|---|---|
| CI only | ||||
| Generic AI review | ||||
| Test generation | ||||
| Benchmark-only evals | ||||
| Define Done / Timaeus |
covered partial not covered
Built on VLS.
VLS is a vendor-neutral wire format for AI-coding-agent verification artifacts. Timaeus is the TypeScript reference implementation.
Private beta repository. Public VLS/Timaeus release in progress.
Questions before you run it on real PRs.
What problem does Define Done solve?
AI coding agents finish fast and often claim done while the requested behavior is still broken. CI passes on builds and types but is blind to whether the task was actually fulfilled. Define Done verifies the task and preserves the evidence before the change merges.
How is it different from CI and generic AI review?
CI checks the build, not the task. Generic AI review comments on the diff and misses subtle task failures or over-blocks good changes. Define Done runs executable acceptance gates tied to the task, reviews in a separate context, seals regressions for replay, and emits a calibrated merge / block / warn verdict with attached evidence.
What evidence exists today?
An internal v1 corpus of 50 cases across two products (demo-saas and Epic Stack). The full Timaeus pipeline caught 35/35 known-bad PRs with one false block across 15 known-good PRs. Every rate is reported with a Wilson 95% interval. It is an internal proof corpus, not a public SOTA claim.
What are the caveats?
Small N, so intervals are wide. Results come from an internal corpus, not a leaderboard. Ground truth comes from deterministic oracles, not LLM judgment. Define Done surfaces evidence and calibrated verdicts — it does not guarantee correctness or zero regressions.
How do I try it on real agent PRs?
Join the design-partner program. We run Define Done in shadow mode on 25 real agent PRs — it reports without blocking your pipeline — and hand back a private report on what CI missed, what generic AI review missed, and what Timaeus caught.
What does the design-partner program cost?
Shadow-mode validation is free for the first 3 months. After that, pricing depends on repository count, PR volume, and whether you need hosted evidence retention. Commercial pricing is set after the private validation cohort.
Run it in shadow mode on real agent PRs.
We are looking for engineering teams shipping with AI coding agents. Define Done runs in shadow mode on 25 real agent PRs and measures what CI missed, what generic AI review missed, what Timaeus caught, and whether the catch mattered.
Free in shadow mode for the first 3 months. Commercial pricing is set after the private validation cohort.