High-Level Design for a Conversational AI Evaluation Framework (Production Implementation)

High-Level Design for a Conversational AI Evaluation Framework (Production Implementation)

TL;DR: Treat evaluation like a product. Build a versioned, reproducible pipeline that scores conversational agents on three layersβ€”Task Outcomes, Conversation Quality, and System Reliabilityβ€”with dataset governance, evidence-aware judges, slice dashboards, and release gates wired into CI/CD and canary rollout.


1) Architecture at a Glance

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Data Sources β”‚
β”‚ β€’ Shadow/Prod traffic β€’ Synthetic suites β€’ Human audits β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
(A) Ingest/Log (B) Offline Suites
β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ DATA MANAGEMENT LAYER β”‚
β”‚ Event Store (immutable) β€’ KB Snapshot Store β€’ Test Case Registry β”‚
β”‚ PII Redaction/Tokenization β€’ Schema Registry β€’ Dataset Versioning β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”
β”‚ Metrics Calc β”‚ β”‚ LLM Judgesβ”‚ β”‚ Rule Check β”‚
β”‚ (batch/rt) β”‚ β”‚ (evidence) β”‚ (hard req)β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Hybrid Scoring β”‚ β”‚ Slice Analyzer β”‚
β”‚ (weights+CI) β”‚ β”‚ (intent/personaβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ /channel/age) β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β” β”‚
β”‚ Leaderboardβ”‚ β”‚
β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ DECISION & DELIVERY LAYER (Gates) β”‚
β”‚ β€’ CI gates β€’ Canary policy β€’ Rollback β€’ Weekly reports β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2) Data Model (make it boring, make it last)

2.1 Interaction Event (immutable, PII-safe)
A single turn or tool call with strict schemas.

{
"event_id": "uuid",
"ts": "2025-08-31T14:07:01Z",
"session_id": "hash",
"user_id": "pseudonymized",
"channel": "web|whatsapp|voice",
"turn": 6,
"role": "user|agent|tool",
"payload": { "text": "...", "tool": "orders.get", "args": {...}, "result": {...} },
"latency_ms": 412,
"tokens_in": 612,
"tokens_out": 184,
"policy_flags": ["redacted_pii"],
"trace_id": "otel-trace",
"model_version": "gpt-x.y",
"kb_snapshot_id": "kb-2025-08-31"
}

2.2 Test Case (task-centric, evidence-bound)
Ensures acceptance rules and evidence exist per case.

{
"case_id": "returns-01",
"goal": "Start a return for order #A123",
"persona": {"tier":"gold","locale":"en-IN"},
"constraints": {"policy_days": 7},
"required_tools": ["orders.get","returns.create"],
"evidence": [{"id":"kb-returns-v3","url":"kb://..."}],
"acceptance": {
"must_include": ["return_id","timeline","policy_excerpt"],
"must_cite": true,
"max_words": 180
},
"slices": ["intent:returns","tier:gold","locale:en-IN"]
}

2.3 Judgment (rule+judge hybrid)

{
"run_id": "eval-2025-09-01",
"case_id": "returns-01",
"rules": {"pass": true, "violations": []},
"judge_scores": {"grounded": 2, "success": 2, "safety": 1, "format": 1},
"citations": ["kb-returns-v3"],
"latency_ms": 5210,
"cost_cents": 0.9,
"tool_metrics": {"schema_err": 0, "retries": 1},
"version": {"model":"gpt-x.y","orchestrator":"1.6.2","kb":"kb-2025-08-31"}
}

3) Metric Suite (mapped to what the business cares about)

Layer 1 β€” Task Outcomes & Safety

  • TSR (Task Success Rate) β€” acceptance rules satisfied.

  • GF (Grounded Factuality) β€” evidence-aligned claims.

  • SVR (Safety Violation Rate) β€” PII, policy, tool misuse (must be 0).

  • P50/P95 Latency, Cost/Task, Escalation Rate.

Layer 2 β€” Conversation Quality & Control

  • DME (Dialogue Management Effectiveness) β€” intent/slots, turn discipline, constraint adherence.

  • CR (Context Retention) β€” slot-wise precision/recall across turns/sessions.

  • CE (Clarification Efficiency) β€” (# clarifications)/(total turns) and usefulness.

  • PEE (Plan Execution Efficiency) β€” actual vs optimal steps; Dead-end Rate.

  • TUA (Tool Use Accuracy) β€” schema errors, precondition failures, idempotency.

Layer 3 β€” System Reliability & Integration

  • R@K/MRR (retrieval), Memory Integrity (staleness/contamination),

  • SIH (System Integration Health): edge error rates, circuit-breaker triggers, partial recovery,

  • SLO Adherence, Drift/Regression Index vs prior release.

Release Gates (example):

  • TSR β‰₯ 0.85, GF β‰₯ 0.88, SVR = 0, P95 ≀ 6s, Cost/Task ≀ β‚ΉX

  • No top-5 intent slice below 0.75 TSR

  • TUA schema error ≀ 1%, Dead-end ≀ 3%


4) Scoring Pipeline

  1. Rule Checks: enforce acceptance (citations present, JSON valid, ≀ max words).

  2. Evidence-Aware Judges (LLM-as-Judge ensemble):

    • Provide only whitelisted snippets; require quoted evidence IDs in rationale.

    • Majority vote across 2–3 diverse judges; calibrate monthly vs human gold.

  3. Hybrid Scoring: final = Ξ±Β·rules + Ξ²Β·judge + Ξ³Β·system_signals (weights with confidence bands).

  4. Slice Analysis: intent/persona/locale/channel/knowledge-freshness.

  5. Leaderboard: compare run_id vs previous; compute Regression Index.


5) Online + Offline: one system, two tempos

  • Offline (pre-prod): full suite against frozen KB snapshot, deterministic seeds, model+orchestrator+tools pinned.

  • Online (prod):

    • Shadow: new version answers silently on mirrored traffic; log deltas.

    • Canary: 1–5% traffic; auto-rollback on gate breach.

    • Continuous signals: TSR proxy, P95, Cost, policy blocks, tool error taxonomy.


6) Observability & Governance (non-negotiable)

  • OpenTelemetry traces around every LLM/tool call; correlation IDs per session.

  • Structured logs (JSON) with token/latency/cost counters; no raw PII.

  • Redaction/tokenization at ingest; reversible only under privileged workflows.

  • Model/KB registry: every run binds to {model, prompt pack, orchestrator, tool versions, kb_snapshot}.

  • Reproducibility: β€œclick-to-replay” any regression with the exact artifacts.


7) Implementation Plan (8–10 weeks)

Week 1–2 β€” Foundations

  • Event schemas + redaction; versioned KB snapshots.

  • Seed 30–50 task-centric cases per top 5 intents; define acceptance rules.

Week 3–4 β€” Scoring

  • Rule engine; evidence-aware judge prompts + 2-judge ensemble;

  • First leaderboard + slice views; CI gate for TSR/SVR.

Week 5–6 β€” System Metrics

  • Tool contract validation (pydantic), schema error counters, retries with jitter;

  • Retrieval diagnostics (R@K) and chunking audits; memory integrity checks.

Week 7–8 β€” Online Wiring

  • Shadow + canary playbooks; auto-rollback; weekly eval report to owners;

  • Drift monitor (model, KB, traffic mix).

Week 9–10 β€” Hardening

  • Human calibration panel; adversarial pack (injunction, jailbreak, flaky tools);

  • Budget/SLO alerts; documentation and runbooks.


8) Technology Choices (opinionated, swappable)

  • Ingest/Log: Kafka / Kinesis; structured JSON.

  • Storage: Object store (S3/GCS) for artifacts; columnar warehouse (BigQuery/Snowflake/DuckDB) for analytics.

  • Processing: dbt/SQL + Python (Pandas/Polars) for batch; Flink/Spark for streams.

  • Tracing/Metrics: OpenTelemetry β†’ Tempo/Jaeger; Prometheus + Grafana.

  • Eval Harness: Python service with Hydra configs; PyTest for CI integration.

  • Labeling/Review: Light human tooling (Label Studio/GDocs forms) with strict PII policy.

  • Access Control: IAM per environment; audit trails for artifact changes.

(Swap vendors freely; keep the interfaces stable.)


9) Example: Runner Skeleton (pseudo-Python)

def run_suite(agent, cases, kb_snapshot, model_pin):
results = []
with pinned(kb=kb_snapshot, model=model_pin):
for c in cases:
out, sys = agent.run(c["goal"], persona=c["persona"], constraints=c["constraints"])
rules = check_rules(out, c["acceptance"])
judge = judge_with_evidence(out, load_snippets(c["evidence"]))
results.append(aggregate(c, out, sys, rules, judge))
return leaderboard(results)

Wire this to CI so each PR that changes prompts/models/tools/KB re-computes the leaderboard and applies gates.


10) Mapping to Your Earlier Metrics

  • DME β†’ part of Layer-2 (intent/slot accuracy, turn budgets, constraint adherence).

  • CMC β†’ Context Retention with slot-wise precision/recall and drift audits.

  • CPA β†’ Plan Execution Efficiency (+ Dead-end Rate) with step optimality.

  • CS β†’ System Integration Health (edge error taxonomy, circuit-breakers, recovery).

You keep the spirit, but anchor it in operational signals that unblock decisions.


11) Anti-Patterns to Avoid

  • Optimizing judge prompts instead of fixing agent behavior.

  • Comparing runs with different KB snapshots or model pins (eval drift).

  • One vanity number (global TSR) that hides slice failures.

  • Unbounded loops in orchestrator or eval harness.

  • Logging PII β€œjust for debugging.”


12) What β€œGood” Looks Like (12-week snapshot)

  • TSR 0.86, GF 0.90, SVR 0, P95 5.2s, Cost β‚ΉX/task

  • No top-5 intent slice below 0.78 TSR

  • TUA schema error <0.8%, Dead-end <2%

  • Canary β†’ GA in 72h with zero rollbacks, all artifacts versioned & replayable.


Closing

An evaluation framework is not a reportβ€”it’s a control surface for your agentic system. When outcomes, behaviors, and reliability are observable, versioned, and enforceable via gates, you ship faster, safer, and cheaper. The rest is iteration.