Monitoring the Context Window in LLM Applications

Monitoring the Context Window in LLM Applications

TL;DR: Treat your context window like a budget. Measure it (tokens-in/out, occupancy, truncation, drift), manage it (slot-based memory + summaries + RAG), and gate it (hard caps with graceful fallbacks). Context sizes vary by model—e.g., OpenAI GPT-4o (128k) (OpenAI Platform), Claude Sonnet 4 up to 1M tokens in beta (Anthropic), and Gemini models offering 1M–2M tokens depending on the version (Google AI for Developers)—so your monitoring must be provider-aware.


Why the context window matters

The context window is the max token budget an LLM can attend to at once. Exceed it and providers truncate—often lopping off the oldest parts—risking lost instructions, broken references, and degraded answers. Tokenization also differs by provider (e.g., Google notes ~4 characters per token as a rule of thumb), so cross-vendor counts won’t match exactly. (Google AI for Developers)


What to monitor (make it a dashboard)

  • Tokens In / Tokens Out per turn and per session.

  • Occupancy %: (prompt_tokens / model_context_limit) * 100.

  • Truncation Rate: % of requests where you had to drop content to fit.

  • TTFT / P95 Latency: especially as occupancy rises.

  • Summary Drift: semantic delta between original text and its summary (spot check or LLM-judge).

  • Retrieval Recall@K: if you use RAG to offload history.

  • Budget Breaches: hard stops where generation was curtailed.

Observability tools: Langfuse and Helicone both track tokens, latency, and costs out of the box—handy for correlating occupancy with quality and spend. (Langfuse, Helicone.ai)


Model context snapshots (Aug 31, 2025)

  • OpenAI GPT-4o: 128,000-token context window (model card). (OpenAI Platform)

  • Anthropic Claude Sonnet 4: up to 1,000,000 tokens (beta; availability and pricing caveats apply). (Anthropic)

  • Google Gemini: documentation covers 1M+; current model pages show 1M for Gemini 2.0 Flash, and up to 2M for Gemini 1.5 Pro. (Google AI for Developers)

Always check the latest provider docs for your exact model; limits can differ by snapshot, tier, or beta flag. (OpenAI Platform, Anthropic, Google AI for Developers)


A reliable management pattern (that scales)

1) Slot-based memory, not raw transcripts
Keep conversation state in named slots (user goal, constraints, decisions, artifacts). Only materialize a concise “working memory” into the prompt. Older turns get rolled up into slot summaries.

2) Retrieval-Augmented Memory (RAM)
Store detail in a vector store or document DB; at turn time, retrieve only what matters (top-k). This keeps prompts tight and grounded.

3) Context budgets & guards

  • Soft cap: warn at 70% occupancy; prefer summarization.

  • Hard cap: at ~85–90% occupancy, summarize or drop least-valuable chunks (see LLM-scored importance).

  • Absolute cap: refuse/clarify before you exceed provider limits.

4) One-shot, structured prompts
Use JSON-style sections for instructions, slots, retrieved facts, and tool outputs. The model reads less fluff; you spend fewer tokens.


Practical code patterns

Token counting (Python; provider-aware)

def count_tokens_openai(messages, enc):
# enc: a tiktoken-like encoder for your OpenAI model
return sum(len(enc.encode(m["content"])) for m in messages)
def count_tokens_generic(text: str, approx_chars_per_token=4):
return (len(text) // approx_chars_per_token) + 1 # rough fallback

A tiny “context budgeter”

def budget_prompt(messages, limit, summarize, score):
"""
messages: list of {role, content, kind} # kind: 'instruction'|'slot'|'retrieved'|'chatter'
limit: max tokens allowed for the model
summarize: fn(list)->str # produces a short summary
score: fn(msg)->float # importance score (lower drops first)
"""
tokens = lambda msgs: sum(m["tokens"] for m in msgs) # assume pre-counted
msgs = sorted(messages, key=lambda m: {"instruction":0,"slot":1,"retrieved":2,"chatter":3}[m["kind"]])
if tokens(msgs) <= limit: return msgs
# Summarize low-value buckets first (chatter → retrieved)
for kind in ["chatter","retrieved"]:
bucket = [m for m in msgs if m["kind"] == kind]
if not bucket: continue
# Replace bottom 50% by importance with a single summary
bucket.sort(key=score)
keep = bucket[len(bucket)//2:]
summary = {"role":"system","kind":kind,"content":summarize(bucket),"tokens":len(summarize(bucket))}
msgs = [m for m in msgs if m["kind"] != kind] + keep + [summary]
if tokens(msgs) <= limit: break
return msgs[:]

Server-side guard (TypeScript)

function enforceBudget(count: number, max: number) {
if (count > max) {
return { ok: false, action: "summarize", message: "Context too large; summarizing history." };
}
if (count > 0.85 * max) {
return { ok: true, action: "trim", message: "High occupancy; trimming low-value turns." };
}
return { ok: true, action: "pass" };
}

Observability wiring (what to log per request)

  • model, model_snapshot, context_limit

  • prompt_tokens, completion_tokens, occupancy_pct

  • truncated: bool; dropped_items: ids; summary_bytes

  • ttft_ms, latency_ms

  • rag_hits, rag_confidence

  • finish_reason (stop, length, content_filter)

Langfuse and Helicone both provide token/cost tracking and latency metrics you can segment by route, feature, or user cohort. (Langfuse, Helicone.ai)


Provider features worth knowing

  • OpenAI: model cards indicate context limits; their APIs will truncate when you exceed the window. Don’t rely on implicit truncation—guard before you send. (OpenAI Platform)

  • Anthropic: explicit docs for token counting and guidance on long-context usage and pricing tiers (1M is beta/availability-gated). (Anthropic)

  • Gemini: long-context docs and token guidance; some models provide 1M–2M tokens context. (Google AI for Developers)


A simple rollout plan

  • Week 1: Add token counters + occupancy telemetry; define caps & fallbacks.

  • Week 2: Implement slot-based memory and RAM (RAG for older turns).

  • Week 3: Add summarization with drift checks; A/B occupancy vs quality.

  • Week 4: Set SLOs and alerts (P95 latency, truncation rate, occupancy >90%); document runbooks.


Bottom line

Bigger windows are helpful, but discipline beats size. Most wins come from what you choose to include: instructions → slots → just-in-time retrieval. Measure relentlessly, prune aggressively, and keep answers grounded. Your users—and your latency and cost graphs—will thank you.