Verification Workflow

The problem

When AI agents complete tasks, how do you know the work is actually correct? Manual review doesn't scale. Trusting the agent's self-report is unreliable. You need an independent verification step that runs automatically and feeds results back into the system.

Workgraph solves this with a three-stage verification pipeline: a machine-checkable gate, an independent quality evaluation, and a feedback loop that evolves agent assignments based on measured performance.

The pipeline

Here's the full verification flow, from task creation to agent evolution:

  wg add --verify "cargo test"    Task created with verify gate
          │
          ▼
  Agent works on task              Dispatched by coordinator
          │
          ▼
  wg done ────────────────────┐
          │                     │
          ▼                     ▼
  Run verify command       Fail? → task stays open
  cargo test passes        Agent must fix and retry
          │
          ▼
  pending-validation               Task enters review queue
          │
          ▼
  .flip-* auto-created                FLIP evaluation task
          │
          ├─ Phase 1: Inference         LLM reconstructs prompt
          │    "What was this agent asked to do?"
          │
          ├─ Phase 2: Comparison        LLM scores alignment
          │    "Does the output match the intent?"
          │
          ▼
  FLIP score 0.0 – 1.0             Intent fidelity measurement
          │
          ├─ ≥ threshold → done         Work verified, move on
          │
          └─ < threshold → .verify-*    Repair task created
               │
               ▼
          Scores feed evolution        Low scores → re-evaluate agents
          wg evolve run                Recombine top performers

Stage 1: The `--verify` gate

When you create a task with --verify, you attach a shell command that must pass before the task can be marked done. This is a hard gate — agents cannot bypass it.

$ wg add "Implement user auth" \
    --verify "cargo test test_auth_ passes" \
    -d "Add JWT-based authentication endpoint"

When an agent runs wg done, the verify command executes automatically:

Pass: Task moves to pending-validation status
Fail: Task stays in-progress — the agent must fix the issue and retry
Timeout: Treated as failure (configurable timeout, default 60s)

Agents cannot use --skip-verify. Only humans can override the gate. This ensures that every agent-completed task passes its acceptance criteria before entering the review pipeline.

Stage 2: FLIP scoring

FLIP — Fidelity via Latent Intent Probing — is a two-phase evaluation that measures whether the agent's output actually matches what was asked for. It works by round-tripping the intent:

Phase 1: Inference

An LLM receives only the task's output — artifacts, logs, diffs — and reconstructs what the original prompt must have been. It has no access to the actual task description.

# What the inference LLM sees:
Artifacts: src/auth.rs, src/auth_test.rs
Log: "Added JWT validation middleware"
Diff: +147 lines across 2 files

# What it produces:
"Implement JWT-based authentication with
 token validation and test coverage"

Phase 2: Comparison

A second LLM compares the inferred prompt to the actual task description and scores the alignment on a 0.0–1.0 scale. High fidelity means the output clearly reflects the original intent; low fidelity means something was lost in translation.

# Actual prompt:
"Add JWT-based authentication endpoint"

# Inferred prompt:
"Implement JWT-based authentication with
 token validation and test coverage"

# Score: 0.94 — high fidelity

Using two separate LLM calls — one blind to the input, one blind to the output — prevents the evaluation from rubber-stamping its own reasoning. The round-trip creates a meaningful signal about whether the work actually addressed the request.

Stage 3: Automated repair

When a FLIP score falls below the configured threshold, the coordinator automatically creates a .verify-* repair task. This task receives:

The original task description
The FLIP score and why it was flagged
All artifacts and logs from the original task
Full authority to edit files, run tests, and commit fixes

The repair agent acts as a senior engineer reviewing a junior's work. The instruction is "fix first, fail last" — repair the work if possible, only reject it if the approach is fundamentally wrong.

auth-module  (done)              FLIP: 0.62
└→ .verify-auth-module  (in-progress)
   "FLIP score 0.62 below threshold 0.70 —
    verify and fix this task's work"

The feedback loop

FLIP scores don't just gate individual tasks — they feed back into the agency system to improve future assignments:

Evaluation records: Every FLIP score is stored as an evaluation in .workgraph/agency/evaluations/, tied to the agent that performed the work.
Four-dimensional scoring: FLIP's intent fidelity score joins three other evaluation dimensions (correctness, completeness, quality) to build a comprehensive performance profile.
Evolution: wg evolve run uses accumulated evaluation data to recombine the best-performing agent configurations (role + tradeoff pairs) and retire underperformers.
Assignment: Future tasks are assigned to agents whose role and tradeoff profile best match the task requirements, informed by historical FLIP scores.

Evaluation data accumulates:

  Agent       Role        Tradeoff    Avg FLIP   Tasks
  ─────────────────────────────────────────────────────
  agent-4a2   Programmer  Thorough    0.93        63
  agent-8b1   Writer      Balanced    0.88        19
  agent-c03   Programmer  Fast        0.71        12
  agent-f19   Analyst     Cautious    0.54         8  ← candidate for retirement

Evolution recombines top performers:

  Programmer + Thorough  → kept (high scores)
  Programmer + Fast     → mutated (borderline)
  Analyst + Cautious    → retired (low scores)

This creates a closed loop: tasks are verified, scores accumulate, agents evolve, and the system gets better at matching the right agent to each task. No human needs to review every task — the system self-corrects.

Concrete example

Here's a full walkthrough of the verification workflow in action:

# 1. Create a task with a verify gate
$ wg add "Add rate limiting to API" \
    --verify "cargo test test_rate_limit" \
    -d "Implement token-bucket rate limiter on /api/* endpoints.

## Validation
- [ ] Failing test: test_rate_limit_rejects_over_threshold
- [ ] Implementation passes the test
- [ ] cargo test passes with no regressions"

# 2. Coordinator dispatches an agent
$ wg service start
  [coordinator] Dispatched agent-4a2 to 'add-rate-limiting-to-api'

# 3. Agent completes and runs wg done
  Running verify command: cargo test test_rate_limit
  Verify command passed
  Task 'add-rate-limiting-to-api' → pending-validation

# 4. Coordinator creates FLIP evaluation task
  [coordinator] Created '.flip-add-rate-limiting-to-api'

# 5. FLIP agent evaluates
  FLIP Phase 1: Inferring prompt from output...
  FLIP Phase 2: Comparing prompts...
  Score: 0.91 (threshold: 0.70) — passed

# 6. Task marked done, score recorded for evolution
  Task 'add-rate-limiting-to-api' → done

If the FLIP score had been below threshold:

  Score: 0.58 (threshold: 0.70) — below threshold
  [coordinator] Created '.verify-add-rate-limiting-to-api'
  Repair agent dispatched to fix the work...

Configuration

The verification pipeline is configured in .workgraph/config.toml:

[agency]
# Enable FLIP evaluation globally
flip_enabled = true

# FLIP score below this triggers a .verify-* repair task
flip_verification_threshold = 0.70

# Auto-evaluate completed tasks
auto_evaluate = true

You can also enable FLIP for individual tasks by adding the flip-eval tag, without enabling it globally.

Why this matters

Most agent frameworks trust the agent's self-report: if it says "done," it's done. Workgraph doesn't. Every completed task goes through an independent verification pipeline that measures whether the output actually matches the intent.

This gives you three things traditional approaches can't:

Quality assurance at scale. You can run dozens of agents in parallel without reviewing every task yourself. The verification pipeline catches drift, misinterpretation, and incomplete work automatically.
Measurable fidelity. FLIP scores are numbers, not vibes. You can track quality trends over time, compare agent configurations, and make data-driven decisions about which agents to assign to which tasks.
Self-improving coordination. The feedback loop from FLIP scores into agent evolution means the system gets better at assigning the right agent to each task. Poor performers get retired. Good configurations get amplified.

The result is a system that you can trust to run autonomously — not because you've given up control, but because verification is built into the workflow itself.