TL;DR: Treat evaluation like a product. Build a versioned, reproducible pipeline that scores conversational agents on three layersβTask Outcomes, Conversation Quality, and System Reliabilityβwith dataset governance, evidence-aware judges, slice dashboards, and release gates wired into CI/CD and canary rollout.
1) Architecture at a Glance
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β Data Sources β β β’ Shadow/Prod traffic β’ Synthetic suites β’ Human audits β ββββββββββββββββ¬βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββ β β (A) Ingest/Log (B) Offline Suites β βββββββββββββββββββββββββββββΌβββββββββββββββββββββββΌβββββββββββββββββββββββββββββ DATA MANAGEMENT LAYER ββ Event Store (immutable) β’ KB Snapshot Store β’ Test Case Registry ββ PII Redaction/Tokenization β’ Schema Registry β’ Dataset Versioning ββββββββββββββββββ¬ββββββββββββββββββββββ¬ββββββββββββββββββββββββββββ¬βββββββββββ β β β ββββββββΌββββββββ βββββββΌββββββ βββββββΌββββββ β Metrics Calc β β LLM Judgesβ β Rule Check β β (batch/rt) β β (evidence) β (hard req)β ββββββββ¬ββββββββ βββββββ¬ββββββ βββββββ¬ββββββ β β β ββββββββββββ¬βββββββββββ΄βββββββββββ¬βββββββββββββββββ β β βββββββββΌβββββββββ ββββββββΌββββββββββ β Hybrid Scoring β β Slice Analyzer β β (weights+CI) β β (intent/personaβ βββββββββ¬βββββββββ β /channel/age) β β ββββββββ¬ββββββββββ βββββββΌββββββ β β Leaderboardβ β βββββββ¬ββββββ β β β ββββββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββ β DECISION & DELIVERY LAYER (Gates) β β β’ CI gates β’ Canary policy β’ Rollback β’ Weekly reports β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2) Data Model (make it boring, make it last)
2.1 Interaction Event (immutable, PII-safe)
A single turn or tool call with strict schemas.
{ "event_id": "uuid", "ts": "2025-08-31T14:07:01Z", "session_id": "hash", "user_id": "pseudonymized", "channel": "web|whatsapp|voice", "turn": 6, "role": "user|agent|tool", "payload": { "text": "...", "tool": "orders.get", "args": {...}, "result": {...} }, "latency_ms": 412, "tokens_in": 612, "tokens_out": 184, "policy_flags": ["redacted_pii"], "trace_id": "otel-trace", "model_version": "gpt-x.y", "kb_snapshot_id": "kb-2025-08-31"}
2.2 Test Case (task-centric, evidence-bound)
Ensures acceptance rules and evidence exist per case.
{ "case_id": "returns-01", "goal": "Start a return for order #A123", "persona": {"tier":"gold","locale":"en-IN"}, "constraints": {"policy_days": 7}, "required_tools": ["orders.get","returns.create"], "evidence": [{"id":"kb-returns-v3","url":"kb://..."}], "acceptance": { "must_include": ["return_id","timeline","policy_excerpt"], "must_cite": true, "max_words": 180 }, "slices": ["intent:returns","tier:gold","locale:en-IN"]}
2.3 Judgment (rule+judge hybrid)
{ "run_id": "eval-2025-09-01", "case_id": "returns-01", "rules": {"pass": true, "violations": []}, "judge_scores": {"grounded": 2, "success": 2, "safety": 1, "format": 1}, "citations": ["kb-returns-v3"], "latency_ms": 5210, "cost_cents": 0.9, "tool_metrics": {"schema_err": 0, "retries": 1}, "version": {"model":"gpt-x.y","orchestrator":"1.6.2","kb":"kb-2025-08-31"}}
3) Metric Suite (mapped to what the business cares about)
Layer 1 β Task Outcomes & Safety
TSR (Task Success Rate) β acceptance rules satisfied.
GF (Grounded Factuality) β evidence-aligned claims.
SVR (Safety Violation Rate) β PII, policy, tool misuse (must be 0).
P50/P95 Latency, Cost/Task, Escalation Rate.
Layer 2 β Conversation Quality & Control
DME (Dialogue Management Effectiveness) β intent/slots, turn discipline, constraint adherence.
CR (Context Retention) β slot-wise precision/recall across turns/sessions.
CE (Clarification Efficiency) β (# clarifications)/(total turns) and usefulness.
PEE (Plan Execution Efficiency) β actual vs optimal steps; Dead-end Rate.
TUA (Tool Use Accuracy) β schema errors, precondition failures, idempotency.
Layer 3 β System Reliability & Integration
R@K/MRR (retrieval), Memory Integrity (staleness/contamination),
SIH (System Integration Health): edge error rates, circuit-breaker triggers, partial recovery,
SLO Adherence, Drift/Regression Index vs prior release.
Release Gates (example):
TSR β₯ 0.85, GF β₯ 0.88, SVR = 0, P95 β€ 6s, Cost/Task β€ βΉX
No top-5 intent slice below 0.75 TSR
TUA schema error β€ 1%, Dead-end β€ 3%
4) Scoring Pipeline
Rule Checks: enforce acceptance (citations present, JSON valid, β€ max words).
Evidence-Aware Judges (LLM-as-Judge ensemble):
Provide only whitelisted snippets; require quoted evidence IDs in rationale.
Majority vote across 2β3 diverse judges; calibrate monthly vs human gold.
Hybrid Scoring:
final = Ξ±Β·rules + Ξ²Β·judge + Ξ³Β·system_signals
(weights with confidence bands).Slice Analysis: intent/persona/locale/channel/knowledge-freshness.
Leaderboard: compare run_id vs previous; compute Regression Index.
5) Online + Offline: one system, two tempos
Offline (pre-prod): full suite against frozen KB snapshot, deterministic seeds, model+orchestrator+tools pinned.
Online (prod):
Shadow: new version answers silently on mirrored traffic; log deltas.
Canary: 1β5% traffic; auto-rollback on gate breach.
Continuous signals: TSR proxy, P95, Cost, policy blocks, tool error taxonomy.
6) Observability & Governance (non-negotiable)
OpenTelemetry traces around every LLM/tool call; correlation IDs per session.
Structured logs (JSON) with token/latency/cost counters; no raw PII.
Redaction/tokenization at ingest; reversible only under privileged workflows.
Model/KB registry: every run binds to
{model, prompt pack, orchestrator, tool versions, kb_snapshot}
.Reproducibility: βclick-to-replayβ any regression with the exact artifacts.
7) Implementation Plan (8β10 weeks)
Week 1β2 β Foundations
Event schemas + redaction; versioned KB snapshots.
Seed 30β50 task-centric cases per top 5 intents; define acceptance rules.
Week 3β4 β Scoring
Rule engine; evidence-aware judge prompts + 2-judge ensemble;
First leaderboard + slice views; CI gate for TSR/SVR.
Week 5β6 β System Metrics
Tool contract validation (pydantic), schema error counters, retries with jitter;
Retrieval diagnostics (R@K) and chunking audits; memory integrity checks.
Week 7β8 β Online Wiring
Shadow + canary playbooks; auto-rollback; weekly eval report to owners;
Drift monitor (model, KB, traffic mix).
Week 9β10 β Hardening
Human calibration panel; adversarial pack (injunction, jailbreak, flaky tools);
Budget/SLO alerts; documentation and runbooks.
8) Technology Choices (opinionated, swappable)
Ingest/Log: Kafka / Kinesis; structured JSON.
Storage: Object store (S3/GCS) for artifacts; columnar warehouse (BigQuery/Snowflake/DuckDB) for analytics.
Processing: dbt/SQL + Python (Pandas/Polars) for batch; Flink/Spark for streams.
Tracing/Metrics: OpenTelemetry β Tempo/Jaeger; Prometheus + Grafana.
Eval Harness: Python service with Hydra configs; PyTest for CI integration.
Labeling/Review: Light human tooling (Label Studio/GDocs forms) with strict PII policy.
Access Control: IAM per environment; audit trails for artifact changes.
(Swap vendors freely; keep the interfaces stable.)
9) Example: Runner Skeleton (pseudo-Python)
def run_suite(agent, cases, kb_snapshot, model_pin): results = [] with pinned(kb=kb_snapshot, model=model_pin): for c in cases: out, sys = agent.run(c["goal"], persona=c["persona"], constraints=c["constraints"]) rules = check_rules(out, c["acceptance"]) judge = judge_with_evidence(out, load_snippets(c["evidence"])) results.append(aggregate(c, out, sys, rules, judge)) return leaderboard(results)
Wire this to CI so each PR that changes prompts/models/tools/KB re-computes the leaderboard and applies gates.
10) Mapping to Your Earlier Metrics
DME β part of Layer-2 (intent/slot accuracy, turn budgets, constraint adherence).
CMC β Context Retention with slot-wise precision/recall and drift audits.
CPA β Plan Execution Efficiency (+ Dead-end Rate) with step optimality.
CS β System Integration Health (edge error taxonomy, circuit-breakers, recovery).
You keep the spirit, but anchor it in operational signals that unblock decisions.
11) Anti-Patterns to Avoid
Optimizing judge prompts instead of fixing agent behavior.
Comparing runs with different KB snapshots or model pins (eval drift).
One vanity number (global TSR) that hides slice failures.
Unbounded loops in orchestrator or eval harness.
Logging PII βjust for debugging.β
12) What βGoodβ Looks Like (12-week snapshot)
TSR 0.86, GF 0.90, SVR 0, P95 5.2s, Cost βΉX/task
No top-5 intent slice below 0.78 TSR
TUA schema error <0.8%, Dead-end <2%
Canary β GA in 72h with zero rollbacks, all artifacts versioned & replayable.
Closing
An evaluation framework is not a reportβitβs a control surface for your agentic system. When outcomes, behaviors, and reliability are observable, versioned, and enforceable via gates, you ship faster, safer, and cheaper. The rest is iteration.