TL;DR: Don’t send every query to a giant general model. Put a semantic router in front: it maps an input to the right path (deterministic tool, RAG answerer, specialist model, or full agent) using embeddings, calibrated thresholds, and fallback rules. You’ll cut latency/cost, reduce hallucinations, and gain control. This post gives you the mental model, production patterns, metrics, and a copy-ready implementation sketch.
1) What semantic routing actually is (and isn’t)
Definition: A lightweight decision layer that transforms an input into an embedding and routes it to the best handler among a set of capabilities (tools, services, models, or agents), with an explicit abstain/fallback path when confidence is low.
Why it works: Most traffic isn’t open-ended reasoning. It clusters into recurring intents (“track order”, “cancel booking”, “price match”), each best served by a deterministic system or a specialist—not a monolithic LLM.
What it’s not: It’s not a rules spaghetti or a black-box chat router. It’s data-driven, measurable, and calibratable.
2) The routing fabric in an agentic stack
User → Normalizer → Semantic Router → { Deterministic Tool | RAG Answerer | Specialist SLM | Full Agent } │ Fallback (low confidence)
Normalizer: spell/locale handling, PII redaction, unit normalization.
Router: encoder → vector similarity → calibrated decision + margin test → path.
Handlers:
Deterministic tools (APIs/CRUD/calculators)
RAG answerer (citations, one-shot)
Specialist SLM (classification, extraction, reranking)
Full agent (bounded loop; planning + tools)
Fallback: clarification question or human/legacy handoff.
Key design rule: Deterministic before generative; SLM before LLM; route early, route once.
3) Core patterns that lift quality and slash cost
Coarse→Fine routing
Coarse “domain” head (support/sales/tech) → fine intent head (refund/returns/rebooking).Margin-based abstain
Decide only when(s1 - s2) ≥ δ
ands1 ≥ τ_route
(top-1 and margin thresholds). Otherwise clarify or hit a safe default.Union-of-experts (multi-label)
Some inputs trigger multiple paths (e.g., cancel + refund). Route to a fixed ordered set of handlers with budgets.Context-aware routing
Conditioning by channel (voice vs chat), persona (VIP vs guest), and state (authenticated? existing order?).Confidence-gated RAG
If retrieval confidence ≥ τ, answer without LLM (templated read-back + citations). Else call the LLM once.
4) Encoders & retrieval: picking the right tool
Encoders: pick for your constraints, not hype.
Lightweight/open: MiniLM/E5-small/bge-small for ms-level latency.
Stronger/general: E5-base/bge-base/cohere/openai embeddings for tougher semantics.
Index: HNSW/IVF-PQ for speed; store route prototypes (centroids) + hard negatives.
Hybrid signals: lexical cues (regex/SKU/PNR) short-circuit to tools before embeddings.
Tip: keep embedding dimensionality and version pinned; version your route sets like you would a model.
5) Decision math you can operate
Similarity s=cos(x,ck)s = \cos(\mathbf{x}, \mathbf{c}_k) against route centroids ck\mathbf{c}_k.
Route if
s1 ≥ τ_route
and(s1 - s2) ≥ δ_margin
.Abstain otherwise → ask a single high-information clarification (not a chatty loop).
Latency model:
Total T≈Tembed+Tsearch+ThandlerT \approx T_\text{embed} + T_\text{search} + T_\text{handler} vs baseline TLLMT_\text{LLM}.
You win when Tembed+Tsearch≪TLLMT_\text{embed} + T_\text{search} \ll T_\text{LLM} for the majority slice.
6) Evaluation: prove it before rollout
Offline (pre-prod)
Routing accuracy (top-1) & abstain precision (abstain when you should).
Confusion matrix by route; find near neighbors that cause flips.
Budget impact: P95 latency, cost/task vs baseline.
Online (prod)
CLAB: Clarification-Before-Answer Rate (should drop).
TSR: Task Success Rate (must not drop).
Safety: zero policy violations from misroutes.
Economics: cost/contact, tokens/task.
Drift monitors: embedding distribution shift and intent mix changes.
Gates: block deploy if TSR ↓ or Safety > 0 or P95 ↑ beyond SLO.
7) Implementation (copy-ready sketches)
7.1 Python: minimal router with calibrated thresholds
from dataclasses import dataclassimport numpy as npfrom numpy.linalg import norm @dataclassclass Route: name: str centroid: np.ndarray # mean embedding of utterances handler: callable # function to call if routed threshold: float # τ_route per route def cos(a, b): return float(a @ b / (norm(a) * norm(b) + 1e-8)) class SemanticRouter: def __init__(self, routes, margin=0.05, default_handler=None): self.routes = routes self.margin = margin self.default_handler = default_handler def route(self, x_emb): scores = [(r, cos(x_emb, r.centroid)) for r in self.routes] (r1, s1), (r2, s2) = sorted(scores, key=lambda z: z[1], reverse=True)[:2] if s1 >= r1.threshold and (s1 - s2) >= self.margin: return r1 return None # abstain def handle(self, x_emb, payload): r = self.route(x_emb) if r: return r.handler(payload, confidence="%.3f" % r.threshold) return self.default_handler(payload) # Calibrate thresholds offline with labeled dev set for desired precision.
7.2 TypeScript: coarse→fine routing with margin & fallback
type Path = "order_status" | "booking_change" | "policy_rag" | "agent_full"; interface Decision { path: Path | "abstain"; s1: number; s2: number; } export function decideRoute(embed: number[], centroids: Record<Path, number[]>, tau: Record<Path, number>, delta=0.05): Decision { const scores = Object.entries(centroids).map(([k, c]) => [k, cosine(embed, c as number[])] as [Path, number]); scores.sort((a,b)=>b[1]-a[1]); const [p1, s1] = scores[0], [, s2] = scores[1]; const pass = s1 >= tau[p1] && (s1 - s2) >= delta; return { path: pass ? p1 : "abstain", s1, s2 };}
7.3 Clarify once, maximally
If abstaining, ask one question that collapses the ambiguity (“Do you want to track an existing booking or change it?”). Don’t start a meandering chat.
8) Training the router (simple and robust)
Seed utterances per route from real logs (redacted).
Generate hard negatives (near neighbors from other routes).
Compute centroids; store route prototypes and per-route τ.
Calibrate τ/δ via grid search to hit target precision (over recall).
Periodically retrain on fresh samples; alarm on drift.
Advanced: train a tiny classifier on embeddings (logistic regression) for better calibration; still keep abstain logic.
9) Integrating with tools, RAG, and agents
Tool short-circuiting: regex/ID detectors (PNR, order ID) → direct tool.
RAG confidence: retrieval score ≥ τ? → templated answer + citations; else LLM.
Agent entry: only when intent is open-ended or multi-step; enforce budgets (steps/time/tokens) at the orchestrator.
Safety: deny state-changing tools unless preconditions pass (auth, policy). Router never overrides policy.
10) Ops: observability, drift, and updates
Trace every decision: input hash, embedding version, top-k scores, chosen path, outcome.
Dashboards: route share, abstain rate, flip rate post-update.
Drift: monitor cosine to centroids over time; alert on distribution shift.
Rollouts: shadow → canary → GA with auto-rollback on gates breach.
11) Privacy & ethics
Redact PII before embedding; avoid logging raw prompts.
Avoid sensitive-attribute routing unless explicitly consented and necessary.
Provide a “Why this path?” explainer in UI—list the detected intent and alternatives.
12) Anti-patterns to avoid
Sending every message to the LLM and calling it “routing.”
No abstain path (forces wrong decisions).
Over-fitting with synthetic utterances only (fails in the wild).
Re-asking clarifications serially (frustrates users; raise one decisive question).
Mixing personalization into the retrieval stage such that filters break—keep it for rerank.
13) 30/60/90 plan to production
0–30 days
Define 6–12 high-volume routes; collect 100–200 utterances each.
Build centroid router with τ/δ calibration for precision.
Add deterministic short-circuits (IDs, exact intents).
Instrument decisions; baseline latency/cost/TSR.
31–60 days
Add coarse→fine heads; integrate RAG confidence early-exit.
Introduce single clarification step on abstain.
A/B compare against “always LLM”; gate on TSR and P95.
61–90 days
Add bandit routing for path selection where multiple handlers compete.
Drift monitors + weekly refresh; auto-rollback on gate breach.
Expand to voice/WhatsApp; locale-specific encoders if needed.
14) Minimal library stack (pick and swap)
Embeddings: open-source (E5/MiniLM/bge) or hosted embeddings.
ANN: FAISS / Qdrant / Milvus / Pinecone for centroids & neighbor search.
Router lib (optional): semantic-router-style abstractions if you prefer DSLs.
Data: pandas/Polars for curation; sklearn/logreg for calibrated heads.
Serving: FastAPI/Express with edge-friendly SSE for streamed clarifications.
Closing
Semantic routing makes LLM agents faster, cheaper, and safer by steering work to the right handler—and confidently saying “not sure” when needed. Start with centroids + thresholds, wire in a single clarification, keep evaluation honest, and iterate. Once routing is reliable, your agent stops being a universal hammer and becomes a competent dispatcher—exactly what production systems need.