TL;DR: To cut latency and cost without hurting quality, push work out of the LLM loop (precompute, cache, specialize), make each LLM call multi-purpose and bounded, and adopt an adaptive compute policy (SLM-first cascades, early-exit, confidence gating). Measure LLM Call Rate (LCR), Cost/Task, P95 latency, Cache Hit Rate, and Task Success—optimize them in that order.
1) Start with a cost/latency model (so you can optimize it)
Let each LLM call ii cost tokensi⋅c\text{tokens}_i \cdot c and take tit_i ms.
Sequential makespan Tseq=∑tiT_\text{seq}=\sum t_i, parallel Tpar=maxtiT_\text{par}=\max t_i.
Objective: minimize LCR = #calls/task
, Cost/Task = Σ(tokens_i·c)
, and P95(T)
under a fixed Task Success Rate (TSR).
If you don’t track LCR and per-edge latency, you’re optimizing blind.
2) Architecture patterns that reduce calls
User ─▶ Router (SLM) ─▶ {Static Templates | Deterministic Tools | RAG Answerer | LLM Fallback} │ └─▶ Confidence Gate + Early-Exit Policy
SLM-first router, LLM-second fallback
Use a small specialized model (entity/intent/FAQ) to decide: Can I answer or fetch deterministically?
Only escalate to the LLM when confidence < τ and value justifies cost.
Single-shot, multi-head prompts
In one call, ask for: intent, entities, safety flags, and final answer as JSON. Avoid N calls for N subtasks.
Deterministic before generative
If the answer is static policy, templated computation, or a DB lookup → no LLM.
Reserve LLM for synthesis/abstraction only.
RAG with high-precision retrieval
Get the right passages once; generate once. Avoid “search → read → re-search” loops with rerankers and query rewriting (done by SLM).
Early-exit + bounded loops
Hard cap steps/tokens; if acceptance criteria are met, stop. No “just one more reflection.”
3) Planning to do less work: adaptive compute policies
Hierarchical planning:
SLM estimates complexity; 2) If “simple”, go template/RAG; 3) Else call LLM once with multi-head JSON.
Clarify once, maximally:
Choose clarifying question that maximizes expected information gain. Never ask serial micro-clarifications.Confidence gating:
If RAG confidence ≥ τ and answer conforms to acceptance rules, skip the LLM; return a templated, cited snippet.Bandit routing:
Use UCB/ε-greedy to learn which path (template vs SLM vs LLM) minimizes cost while preserving TSR for each intent slice.
4) Caching that actually moves the needle
Semantic response cache (query → answer) with locality-aware keys (normalize numerals, dates, product codes).
Partial output cache (intent/entities/tool args) to skip re-extraction.
Retrieval cache (query embeddings → doc IDs) + snippet cache (docID, span → text).
Prompt template cache: pre-bake system+tools instruction blocks; don’t resend boilerplate.
Cross-session memory summaries (short, signed, TTL’d) to avoid re-establishing context.
Track Cache Hit Rate separately for responses, partials, and retrieval; prioritize the one with highest time saved per hit.
5) Replace general LLM calls with specialized components
SLMs / classical NLP: regex+parsers for IDs, dates, currencies; CRF/biLSTM for slots; tiny T5 for summarization.
Program synthesis (rare, reusable): let the LLM write a tiny function once, then call the code deterministically many times (guarded and versioned).
Validation as code: strict pydantic schemas + runtime validators to avoid “fix-up” calls.
6) Prompt & context minimization
Multi-head JSON schema for one-shot extraction + answer:
{"intent": "...","entities": {"pnr":"", "date":""},"needs_clarification": false,"answer": {"text":"...", "citations":["doc://..."]},"safety": {"pii": false}}Context windowing: keep only task-critical facts; summarize long history into fixed slots.
Dynamic context: assemble per-turn context from a slot store, not raw transcripts.
Tight system prompts: no essays—just rules, budget, and acceptance criteria.
7) Orchestration: do fewer calls, or do them in parallel
Policy: Parallelize independent work; consolidate dependent work.
# Pseudocode for an LCR-aware orchestratordef handle(query): plan = slm_router(query) # cheap if plan.template: return render_template(plan) if plan.rag_only and plan.confidence >= TAU: return cite_snippets(plan.snippets) # no LLM # Consolidate: one multi-head call for extraction + answer retrieved = retrieve_once(plan) # cached + reranked result = llm_chat(messages=[system, user(query, retrieved)], tools=[...], max_tokens=300, temperature=0.2, response_format="json") if violates_schema(result): return fallback_safe(result) return finalize(result)
8) Observability: metrics that force good trade-offs
LCR (LLM Call Rate) = calls/task
Tokens In/Out per task
P50/P95 latency (end-to-end, and per edge)
Cache Hit Rates (response/partial/retrieval)
SLM Offload Rate (% tasks answered without LLM)
Clarification Ratio (turns for clarifications / total turns)
TSR & GF (task success, grounded factuality)
Safety Violations (must be 0)
Cost/Task (₹)
Make a leaderboard by intent/persona/channel; block deploys if TSR drops or LCR rises beyond budget.
9) Concrete tactics (field-tested)
Pre-compute heavy stuff: embeddings, doc summaries, policy extracts, entity dictionaries.
Chunk for retrieval, not for reading: fewer, denser chunks → fewer passages → one generation.
Stop-conditions: max steps, max cost, max wall-clock; return best-effort with evidence.
Idempotent tools + retries with jitter: fewer “fix-me” LLM calls after tool failures.
Guarded generation: JSON mode + strict schema to avoid post-hoc “re-ask.”
Region pinning & connection reuse: keep-alive, gzip, colocate with vector DB.
10) Anti-patterns (they look helpful, they add calls)
Multi-turn “self-reflect” loops without stop rules.
Serial micro-tasks that could be one consolidated call.
Re-sending full chat history each turn.
Re-searching because retrieval confidence is unknown—measure it!
Using a general LLM for trivial parsing/formatting.
11) 30-60-90 day plan to cut calls by ~35–60%
30 days
Instrument LCR, Cache Hit Rates, SLM Offload Rate; add step/cost/time caps.
Introduce response + partial caches; consolidate prompts to multi-head JSON.
Add SLM router for top 5 intents; build one clarify-once template.
60 days
RAG revamp (reranker + query rewrite) with confidence scores; enable rag-only early-exit.
Program-synthesize 3 frequent formatters/computations and freeze them as tools.
Parallelize independent retrieval/tool calls.
90 days
Bandit routing (learned thresholds per intent); memory summaries with TTL.
Replace 2 LLM subtasks with SLMs/fine-tunes; tighten JSON schemas.
Canary cost/latency gates; auto-rollback on LCR/TSR breach.
Closing
Reducing LLM calls isn’t about “asking the model less”; it’s about designing the system so the model is used only where it adds unique value—and when called, it achieves more in one bounded pass. Combine adaptive routing, deterministic tools, precise retrieval, and ruthless caching. Measure relentlessly. The payoff is dramatic: faster UX, lower bills, and a platform that scales.