Building Your First LLM Application: a Production-Grade Path (Not a Toy)

TL;DR: Don’t start with a chat demo. Start with one valuable, verifiable job; wrap a deterministic orchestrator around an LLM; add typed tools, grounding, budgets/safety, observability, and a tiny evaluation harness. Ship something you can trust, measure, and improve.

0) Define the job → success criteria → guardrails

Pick one narrow, high-value workflow.

Example job: “Given a user question, search internal docs and return a cited answer under 200 words.”

Success: exact answer present in source, ≤ 200 words, cites ≥ 1 doc link.
Non-goals: chit-chat, multi-turn memory, reasoning about PDFs.
Guardrails: max 6 steps, ≤ 8s P95 latency, ≤ ₹X per 100 tasks, no PII in logs.

This forces you to design for outcomes, not vibes.

1) Minimal architecture (one page)

Client/UI → sends goal + context.
Orchestrator (deterministic) → state machine/graph that caps steps/time.
LLM Adapter → provider-agnostic wrapper.
Tools/Skills (typed) → search, retrieval, calculators, formatters.
Grounding Layer → retriever + re-ranker + cache.
Budgets & Safety → token/time/₹ caps, PII redaction, allow/deny tool lists.
Observability → traces, metrics, structured logs.
Eval Harness → 30–50 cases + leaderboard.

2) Project scaffold

 1llm_app/
 2  app.py                 # FastAPI/CLI entry
 3  orchestrator.py        # loop/graph + guards
 4  llm.py                 # provider-agnostic client
 5  tools.py               # typed tool contracts
 6  grounding.py           # retriever/reranker/cache
 7  safety.py              # PII redaction, policy checks
 8  observability.py       # traces/metrics/logs
 9  eval/
10    cases.jsonl          # test set
11    judge.py             # LLM-as-judge (optional)
12

3) Provider-agnostic LLM adapter

Keep the rest of your code oblivious to vendors.

 1# llm.py
 2import os, httpx, json
 3from typing import Dict, Any
 4
 5class LLMClient:
 6    def __init__(self, base_url: str = os.getenv("LLM_BASE_URL"), api_key: str = os.getenv("LLM_API_KEY")):
 7        self.base_url, self.api_key = base_url, api_key
 8
 9    async def chat(self, messages: list[Dict[str, str]], tools: list[Dict[str, Any]] | None = None, **params):
10        headers = {"Authorization": f"Bearer {self.api_key}", "Content-Type": "application/json"}
11        payload = {"messages": messages, "tools": tools or [], **params}
12        async with httpx.AsyncClient(timeout=15) as client:
13            r = await client.post(f"{self.base_url}/chat", headers=headers, json=payload)
14            r.raise_for_status()
15            return r.json()  # Expect { "output": "...", "tool_call": {...}? }
16

4) Typed tools with validation

Your tools are the system’s muscles. Type them.

 1# tools.py
 2from pydantic import BaseModel, Field
 3
 4class SearchInput(BaseModel):
 5    query: str = Field(min_length=3, max_length=256)
 6    top_k: int = Field(default=5, ge=1, le=10)
 7
 8async def search_docs(args: SearchInput) -> dict:
 9    # plug your own vector/keyword search; return {"hits":[{"id":..., "text":..., "url":...}, ...]}
10    ...
11    
12TOOL_SPEC = [{
13  "name": "search_docs",
14  "description": "Search internal knowledge base",
15  "input_schema": SearchInput.model_json_schema()
16}]
17

5) Grounding micro-stack (retriever → reranker → cache)

Start simple: hybrid BM25 + vectors, then rerank.

 1# grounding.py
 2from functools import lru_cache
 3
 4@lru_cache(maxsize=2048)
 5def cache_key(q: str, k: int) -> tuple: return (q.strip().lower(), k)
 6
 7async def retrieve(query: str, top_k: int = 5) -> list[dict]:
 8    # 1) lexical + vector search
 9    hits = await search_index(query=query, top_k=top_k*3)
10    # 2) rerank (cross-encoder or fast heuristic)
11    reranked = await rerank(query, hits)
12    return reranked[:top_k]
13

Tip: Log retrieval diagnostics (recall@k on eval set) from day one.

6) Deterministic orchestrator (bounded loop)

Wrap the LLM in a predictable skeleton. No unbounded “think loops”.

 1# orchestrator.py
 2from typing import Any
 3from tools import TOOL_SPEC, SearchInput, search_docs
 4from safety import redact_pii, check_policy
 5from grounding import retrieve
 6from observability import trace_span, counter
 7
 8MAX_STEPS = 6
 9
10TOOL_REGISTRY = {"search_docs": (SearchInput, search_docs)}
11
12@trace_span("run_task")
13async def run_task(goal: str, llm, budgets: dict[str, Any]):
14    history = [{"role": "system", "content": "You answer only with grounded, cited text ≤ 200 words."}]
15    history += [{"role": "user", "content": redact_pii(goal)}]
16    cost = 0; steps = 0
17
18    while steps < MAX_STEPS and cost < budgets["max_tokens"]:
19        resp = await llm.chat(messages=history, tools=TOOL_SPEC, temperature=0.2, max_tokens=400)
20        cost += resp.get("usage", {}).get("total_tokens", 0)
21
22        if tool := resp.get("tool_call"):
23            name, args = tool["name"], tool["arguments"]
24            if name not in TOOL_REGISTRY: break
25            schema, fn = TOOL_REGISTRY[name]
26            inputs = schema(**args)          # validate
27            if name == "search_docs":
28                hits = await retrieve(inputs.query, inputs.top_k)
29                tool_obs = {"tool": name, "result": hits}
30            else:
31                tool_obs = await fn(inputs)
32            history += [{"role": "tool", "name": name, "content": tool_obs}]
33        else:
34            answer = resp["output"]
35            if not check_policy(answer):  # safety gate
36                history += [{"role": "system", "content": "Revise to meet policy: no PII, cite sources."}]
37                steps += 1; continue
38            counter("task.success").inc()
39            return answer
40
41        steps += 1
42
43    counter("task.timeouts").inc()
44    return "Unable to produce a grounded answer within limits. Escalating."
45

7) Safety, budgets, and failure-mode design

Budgets: cap MAX_STEPS, tokens, wall-clock time, and requests.
PII: redact before send; never log raw inputs; hash user IDs.
Allow/Deny tools: deny state-mutating tools unless preconditions pass.
Fallbacks: graceful “can’t answer” with escalation payload (question + retrieved hits + logs).

8) Observability (from day one)

Instrument everything.

Traces: one span per LLM/tool call with inputs/outputs sizes, durations.
Metrics: task_success, time_ms_p50/p95, tokens, tool_calls, timeouts, escalations.
Logs: structured (JSON) per step with correlation IDs.

 1# observability.py
 2import time, json, functools
 3from collections import Counter
 4_metrics = Counter()
 5
 6def counter(name): 
 7    class C: 
 8        def inc(self, v=1): _metrics[name]+=v
 9    return C()
10
11def trace_span(name):
12    def deco(fn):
13        @functools.wraps(fn)
14        async def wrap(*a, **k):
15            t=time.time()
16            try: return await fn(*a, **k)
17            finally: print(json.dumps({"span":name,"ms":int((time.time()-t)*1000)}))
18        return wrap
19    return deco
20

9) Tiny evaluation harness (make it real)

Create 30–50 real tasks with gold answers + evidence links.

eval/cases.jsonl

1{"id":"q1","question":"How do we rotate API keys?","gold_contains":["rotate every 90 days"],"must_cite":true}
2

Runner:

 1# eval/judge.py
 2import json, asyncio
 3from orchestrator import run_task
 4from llm import LLMClient
 5
 6async def evaluate():
 7    llm = LLMClient()
 8    ok=0; total=0
 9    for line in open("eval/cases.jsonl"):
10        c = json.loads(line); total+=1
11        out = await run_task(c["question"], llm, budgets={"max_tokens":4000})
12        pass_contains = all(x.lower() in out.lower() for x in c["gold_contains"])
13        pass_citation = ("http" in out) if c.get("must_cite") else True
14        ok += (pass_contains and pass_citation)
15    print(f"TSR: {ok}/{total} = {ok/total:.2%}")
16
17if __name__ == "__main__":
18    asyncio.run(evaluate())
19

Track TSR (task success rate) and latency before you add features.

10) Rollout checklist

Success criteria + 30–50 eval cases.
Deterministic orchestrator + budgets.
Typed tools with validation + idempotency.
Retrieval quality checked on eval set.
Safety gates (redaction, policy).
Traces/metrics/logs + alerting.
P95 latency and cost within budget.
Escalation path defined.

11) Sensible next steps

Memory: add short-term scratchpad summarization; add long-term only if it improves TSR or latency.
Caching: response + retrieval caches with TTL and invalidation hooks.
Reranking: upgrade to learned rerankers when drift appears.
Multi-agent: introduce a router/specialists only when specialization yields measurable gains.
Guarded generation: structured outputs (JSON schemas) to reduce post-processing errors.

Closing

“Building an LLM app” isn’t about calling a model; it’s about owning the loop—planning, tool use, grounding, safety, and measurement. Ship the smallest version that’s reliable under constraints, then iterate with evidence.

Building Your First LLM Application: a Production-Grade Path (Not a Toy)

0) Define the job → success criteria → guardrails

1) Minimal architecture (one page)

2) Project scaffold

3) Provider-agnostic LLM adapter

4) Typed tools with validation

5) Grounding micro-stack (retriever → reranker → cache)

6) Deterministic orchestrator (bounded loop)

7) Safety, budgets, and failure-mode design

8) Observability (from day one)

9) Tiny evaluation harness (make it real)

10) Rollout checklist

11) Sensible next steps

Closing

Comments

Topics

Meet our authors

Recent Posts

The Next Platform Shift: A Pragmatic Playbook for Building with AI

The Future of Intelligent Commerce: Agents, Trust, and the New Digital Trade Infrastructure

Agentic AI Just Crossed a Line: Platform-Native Agents, OS-Level Companions, and a Security Wake-Up Call

Tag Cloud

Building Your First LLM Application: a Production-Grade Path (Not a Toy)

0) Define the job → success criteria → guardrails

1) Minimal architecture (one page)

2) Project scaffold

3) Provider-agnostic LLM adapter

4) Typed tools with validation

5) Grounding micro-stack (retriever → reranker → cache)

6) Deterministic orchestrator (bounded loop)

7) Safety, budgets, and failure-mode design

8) Observability (from day one)

9) Tiny evaluation harness (make it real)

10) Rollout checklist

11) Sensible next steps

Closing

Newsletter

Comments

Topics

Meet our authors

Recent Posts

The Next Platform Shift: A Pragmatic Playbook for Building with AI

The Future of Intelligent Commerce: Agents, Trust, and the New Digital Trade Infrastructure

Agentic AI Just Crossed a Line: Platform-Native Agents, OS-Level Companions, and a Security Wake-Up Call

Tag Cloud