Skip to content
Define DoneDesign partners
Verification layer for AI-agent PRs

Define Done for AI agents.

Evidence that agent PRs actually work and will not regress.

Define Done verifies the task, not just the diff.

For engineering teams shipping with coding agents: run real agent PRs through executable acceptance gates, separate-context review, and sealed regression replay before they merge.

proof-report · v1
50 cases · June 3, 2026
CI only
0/35
known-bad PRs caught
Timaeus full
35/35
known-bad PRs caught
timaeus-full recall
100% (90%, 100%)
timaeus-full false-block
7% (1%, 30%)
corpus
35 known-bad · 15 known-good

Internal proof-corpus result, not a public SOTA claim. Wilson 95% intervals shown because N is small.

View full proof report

Built from a working TypeScript reference implementation (Timaeus).

Repository
Private beta · public release in progress
Latest proof run
June 3, 2026
Proof corpus
50 cases · 35 bad · 15 good
Conformance
VLS self-conformance passing
The problem

Agent PRs need a stronger definition of done.

  • Agents can finish fast and still miss requested behavior.
  • CI can pass while the user-visible workflow is broken.
  • Generic AI review can miss subtle task failures or over-block good changes.
  • Regression risk compounds when many coding agents ship in parallel.
What it does

Verify the task, then seal the evidence.

Define Done verifies agent-authored PRs with executable evidence: calibrated review, acceptance gates, sealed regression replay, and a reproducible proof report — on every change.

  1. 01
    Agent claims done

    Capture the task and the completion claim.

  2. 02
    CI runs

    Build, typecheck, and existing tests.

  3. 03
    Timaeus verifies

    Separate-context review + executable acceptance gates.

  4. 04
    Regressions sealed

    Behavior captured for deterministic replay.

  5. 05
    Merge / block / warn

    A calibrated verdict with attached evidence.

From “done” to a verdict you can trust.

One command replays the whole bench: CI, generic review, separate-context review, and the full Timaeus pipeline — scored against deterministic oracles, with Wilson intervals because N is small.

See the quickstart
timaeus — proof

The evidence every run preserves

taskscore.jsonBLOCK
case
checkout-promo-code
claim
“Promo codes now apply at checkout.”
confidence0.94
acceptance gates3/4 passed
  • apply-valid-code
  • reject-expired-code
  • stack-with-sale-price
  • ui-shows-discount-line

A calibrated per-case verdict with evidence references.

TaskScore
pr #482 · promo codes
function applyPromo(code, cart) {
- if (codes.has(code)) {
+ const promo = codes.get(code);
+ if (promo) {
cart.discount = promo.amount;
return cart;
}
}
block — expired codes are still accepted; acceptance gate reject-expired-code failed. CI was green.

The merge / block / warn decision, attached to the change.

Diff verdict
case evidence
  • checkout-promo-code/
  • task.md
  • acceptance/
  • apply-valid-code.spec.ts
  • reject-expired-code.spec.ts
  • review/separate-context.md
  • replay/checkout-promo.json

Per-case inputs, gate output, and review trace.

Case evidence tree
proof-report.md
# proof report — v1 corpus
run: 2026-06-03 · cases: 50
workflowrecallF1
ci-only0%0.00
timaeus-full100%0.99
false blocks: 1/15 (Wilson 7%, 1–30%)

A reproducible run summary with the confusion matrix.

Proof report
Proof, not vibes

CI caught 0/35. The full Timaeus pipeline caught 35/35.

In the internal v1 Agent PR Verification Bench (50 cases across demo-saas and Epic Stack), CI caught 0 of 35 known-bad agent PRs. The full Timaeus pipeline caught all 35, with one false block across 15 known-good PRs.

  • CI only
    Bad caught
    0/35
    F1
    0.00
    Recall (Wilson 95%)
    0% (0%, 10%)
    False blocks
    0%

    Build/typecheck-style checks. Blind to task fulfillment.

  • Generic AI review
    Bad caught
    14/35
    F1
    0.56
    Recall (Wilson 95%)
    40% (26%, 56%)
    False blocks
    7%

    Comments on the diff. Misses subtle task failures.

  • Timaeus reviewer
    Bad caught
    30/35
    F1
    0.91
    Recall (Wilson 95%)
    86% (71%, 94%)
    False blocks
    7%

    Separate-context review of the diff against the task.

  • Timaeus full
    Bad caught
    35/35
    F1
    0.99
    Recall (Wilson 95%)
    100% (90%, 100%)
    False blocks
    7%

    Review + executable acceptance gates + regression replay.

Internal proof-corpus result, not a public SOTA claim. Wilson 95% intervals shown because N is small.

Internal corpus
These are internal proof-corpus results, not a public SOTA claim. Ground truth comes from deterministic oracles, not LLM judgment. The corpus spans 50 cases across two products (demo-saas and Epic Stack) and is expanding further.
Read the full proof report
How it differs

Define Done verifies the task and keeps the evidence.

It does not just comment on the diff. The adjacent tools each cover a slice; none verify task fulfillment and preserve a replayable record.

ApproachTask verifiedExecutable gatesEvidence preservedCalibrated verdict
CI only
Generic AI review
Test generation
Benchmark-only evals
Define Done / Timaeus

covered partial not covered

Open standard

Built on VLS.

VLS is a vendor-neutral wire format for AI-coding-agent verification artifacts. Timaeus is the TypeScript reference implementation.

Private beta repository. Public VLS/Timaeus release in progress.

FAQ

Questions before you run it on real PRs.

What problem does Define Done solve?

AI coding agents finish fast and often claim done while the requested behavior is still broken. CI passes on builds and types but is blind to whether the task was actually fulfilled. Define Done verifies the task and preserves the evidence before the change merges.

How is it different from CI and generic AI review?

CI checks the build, not the task. Generic AI review comments on the diff and misses subtle task failures or over-blocks good changes. Define Done runs executable acceptance gates tied to the task, reviews in a separate context, seals regressions for replay, and emits a calibrated merge / block / warn verdict with attached evidence.

What evidence exists today?

An internal v1 corpus of 50 cases across two products (demo-saas and Epic Stack). The full Timaeus pipeline caught 35/35 known-bad PRs with one false block across 15 known-good PRs. Every rate is reported with a Wilson 95% interval. It is an internal proof corpus, not a public SOTA claim.

What are the caveats?

Small N, so intervals are wide. Results come from an internal corpus, not a leaderboard. Ground truth comes from deterministic oracles, not LLM judgment. Define Done surfaces evidence and calibrated verdicts — it does not guarantee correctness or zero regressions.

How do I try it on real agent PRs?

Join the design-partner program. We run Define Done in shadow mode on 25 real agent PRs — it reports without blocking your pipeline — and hand back a private report on what CI missed, what generic AI review missed, and what Timaeus caught.

What does the design-partner program cost?

Shadow-mode validation is free for the first 3 months. After that, pricing depends on repository count, PR volume, and whether you need hosted evidence retention. Commercial pricing is set after the private validation cohort.

Design partners

Run it in shadow mode on real agent PRs.

We are looking for engineering teams shipping with AI coding agents. Define Done runs in shadow mode on 25 real agent PRs and measures what CI missed, what generic AI review missed, what Timaeus caught, and whether the catch mattered.

Free in shadow mode for the first 3 months. Commercial pricing is set after the private validation cohort.