Verification Workflow
Automated quality assurance with FLIP scoring — how workgraph closes the loop between task completion and proven correctness.
The problem
When AI agents complete tasks, how do you know the work is actually correct? Manual review doesn't scale. Trusting the agent's self-report is unreliable. You need an independent verification step that runs automatically and feeds results back into the system.
Workgraph solves this with a three-stage verification pipeline: a machine-checkable gate, an independent quality evaluation, and a feedback loop that evolves agent assignments based on measured performance.
The pipeline
Here's the full verification flow, from task creation to agent evolution:
wg add --verify "cargo test" Task created with verify gate │ ▼ Agent works on task Dispatched by coordinator │ ▼ wg done ────────────────────┐ │ │ ▼ ▼ Run verify command Fail? → task stays open cargo test passes Agent must fix and retry │ ▼ pending-validation Task enters review queue │ ▼ .flip-* auto-created FLIP evaluation task │ ├─ Phase 1: Inference LLM reconstructs prompt │ "What was this agent asked to do?" │ ├─ Phase 2: Comparison LLM scores alignment │ "Does the output match the intent?" │ ▼ FLIP score 0.0 – 1.0 Intent fidelity measurement │ ├─ ≥ threshold → done Work verified, move on │ └─ < threshold → .verify-* Repair task created │ ▼ Scores feed evolution Low scores → re-evaluate agents wg evolve run Recombine top performers
Stage 1: The --verify gate
When you create a task with --verify, you attach a shell command
that must pass before the task can be marked done. This is a hard gate — agents
cannot bypass it.
$ wg add "Implement user auth" \
--verify "cargo test test_auth_ passes" \
-d "Add JWT-based authentication endpoint" When an agent runs wg done, the verify command executes automatically:
- Pass: Task moves to
pending-validationstatus - Fail: Task stays
in-progress— the agent must fix the issue and retry - Timeout: Treated as failure (configurable timeout, default 60s)
Agents cannot use --skip-verify. Only humans can override the gate.
This ensures that every agent-completed task passes its acceptance criteria before
entering the review pipeline.
Stage 2: FLIP scoring
FLIP — Fidelity via Latent Intent Probing — is a two-phase evaluation that measures whether the agent's output actually matches what was asked for. It works by round-tripping the intent:
Phase 1: Inference
An LLM receives only the task's output — artifacts, logs, diffs — and reconstructs what the original prompt must have been. It has no access to the actual task description.
# What the inference LLM sees:
Artifacts: src/auth.rs, src/auth_test.rs
Log: "Added JWT validation middleware"
Diff: +147 lines across 2 files
# What it produces:
"Implement JWT-based authentication with
token validation and test coverage" Phase 2: Comparison
A second LLM compares the inferred prompt to the actual task description and scores the alignment on a 0.0–1.0 scale. High fidelity means the output clearly reflects the original intent; low fidelity means something was lost in translation.
# Actual prompt:
"Add JWT-based authentication endpoint"
# Inferred prompt:
"Implement JWT-based authentication with
token validation and test coverage"
# Score: 0.94 — high fidelity Using two separate LLM calls — one blind to the input, one blind to the output — prevents the evaluation from rubber-stamping its own reasoning. The round-trip creates a meaningful signal about whether the work actually addressed the request.
Stage 3: Automated repair
When a FLIP score falls below the configured threshold, the coordinator
automatically creates a .verify-* repair task. This task receives:
- The original task description
- The FLIP score and why it was flagged
- All artifacts and logs from the original task
- Full authority to edit files, run tests, and commit fixes
The repair agent acts as a senior engineer reviewing a junior's work. The instruction is "fix first, fail last" — repair the work if possible, only reject it if the approach is fundamentally wrong.
auth-module (done) FLIP: 0.62 └→ .verify-auth-module (in-progress) "FLIP score 0.62 below threshold 0.70 — verify and fix this task's work"
The feedback loop
FLIP scores don't just gate individual tasks — they feed back into the agency system to improve future assignments:
- Evaluation records: Every FLIP score is stored as an evaluation
in
.workgraph/agency/evaluations/, tied to the agent that performed the work. - Four-dimensional scoring: FLIP's intent fidelity score joins three other evaluation dimensions (correctness, completeness, quality) to build a comprehensive performance profile.
- Evolution:
wg evolve runuses accumulated evaluation data to recombine the best-performing agent configurations (role + tradeoff pairs) and retire underperformers. - Assignment: Future tasks are assigned to agents whose role and tradeoff profile best match the task requirements, informed by historical FLIP scores.
Evaluation data accumulates: Agent Role Tradeoff Avg FLIP Tasks ───────────────────────────────────────────────────── agent-4a2 Programmer Thorough 0.93 63 agent-8b1 Writer Balanced 0.88 19 agent-c03 Programmer Fast 0.71 12 agent-f19 Analyst Cautious 0.54 8 ← candidate for retirement Evolution recombines top performers: Programmer + Thorough → kept (high scores) Programmer + Fast → mutated (borderline) Analyst + Cautious → retired (low scores)
This creates a closed loop: tasks are verified, scores accumulate, agents evolve, and the system gets better at matching the right agent to each task. No human needs to review every task — the system self-corrects.
Concrete example
Here's a full walkthrough of the verification workflow in action:
# 1. Create a task with a verify gate
$ wg add "Add rate limiting to API" \
--verify "cargo test test_rate_limit" \
-d "Implement token-bucket rate limiter on /api/* endpoints.
## Validation
- [ ] Failing test: test_rate_limit_rejects_over_threshold
- [ ] Implementation passes the test
- [ ] cargo test passes with no regressions"
# 2. Coordinator dispatches an agent
$ wg service start
[coordinator] Dispatched agent-4a2 to 'add-rate-limiting-to-api'
# 3. Agent completes and runs wg done
Running verify command: cargo test test_rate_limit
Verify command passed
Task 'add-rate-limiting-to-api' → pending-validation
# 4. Coordinator creates FLIP evaluation task
[coordinator] Created '.flip-add-rate-limiting-to-api'
# 5. FLIP agent evaluates
FLIP Phase 1: Inferring prompt from output...
FLIP Phase 2: Comparing prompts...
Score: 0.91 (threshold: 0.70) — passed
# 6. Task marked done, score recorded for evolution
Task 'add-rate-limiting-to-api' → done If the FLIP score had been below threshold:
Score: 0.58 (threshold: 0.70) — below threshold
[coordinator] Created '.verify-add-rate-limiting-to-api'
Repair agent dispatched to fix the work... Configuration
The verification pipeline is configured in .workgraph/config.toml:
[agency]
# Enable FLIP evaluation globally
flip_enabled = true
# FLIP score below this triggers a .verify-* repair task
flip_verification_threshold = 0.70
# Auto-evaluate completed tasks
auto_evaluate = true
You can also enable FLIP for individual tasks by adding the flip-eval
tag, without enabling it globally.
Why this matters
Most agent frameworks trust the agent's self-report: if it says "done," it's done. Workgraph doesn't. Every completed task goes through an independent verification pipeline that measures whether the output actually matches the intent.
This gives you three things traditional approaches can't:
- Quality assurance at scale. You can run dozens of agents in parallel without reviewing every task yourself. The verification pipeline catches drift, misinterpretation, and incomplete work automatically.
- Measurable fidelity. FLIP scores are numbers, not vibes. You can track quality trends over time, compare agent configurations, and make data-driven decisions about which agents to assign to which tasks.
- Self-improving coordination. The feedback loop from FLIP scores into agent evolution means the system gets better at assigning the right agent to each task. Poor performers get retired. Good configurations get amplified.
The result is a system that you can trust to run autonomously — not because you've given up control, but because verification is built into the workflow itself.