TL;DR: For most LLM apps, SSE is the simplest and most reliable way to stream tokens from your server to the browser. Use WebSockets when you need true bi-directional realtime (tool progress, multi-party collab, cursor sharing). Keep fetch streaming over HTTP/2/3 for controlled environments or server-component frameworks. Design a structured event protocol, add abort/interrupt, reconnect with resume, and instrument TTFT (time-to-first-token) and tokens/sec. Ship with edge-friendly routes, short-lived auth, and ARIA live updates.
1) What “good streaming UX” actually means
Fast first paint: TTFT < 300–700 ms feels snappy.
Steady cadence: avoid bursty token dumps; render in small chunks.
Interruptible: user can “Stop” instantly (AbortController) and edit prompt.
Recoverable: reconnect on network blips without duplicating output.
Structured: the stream carries typed events (token, tool, usage, error), not raw text.
Accessible: screen readers get incremental updates (ARIA live regions).
Traceable: every stream has correlation IDs and usage counters.
2) Transport options (browser-first view)
Capability | SSE (EventSource / fetch+text/event-stream) | WebSocket | Fetch Streaming (HTTP/2/3) | WebTransport |
---|---|---|---|---|
Direction | Server → Client | Bi-directional | Server → Client | Bi-/uni over QUIC |
Complexity | Low | Medium | Medium | High / emerging |
Infra friction | Minimal (HTTP) | Proxies/load balancers sometimes tricky | Depends on platform | Limited support |
Auto-reconnect | Built-in (EventSource) | Custom | Custom | Custom |
Best for | Token streams, logs, progress | Collaborative apps, tool telemetry | SSR/edge functions, RSC | Low-latency labs, custom stacks |
Rule of thumb
Start with SSE for token streaming to the browser.
Use WebSockets when you truly need client→server push while streaming (tool UI, multi-user cursors).
Prefer fetch streaming in Next.js/Remix/SvelteKit SSR routes or edge functions where it fits naturally.
WebTransport is powerful but not yet mainstream for LLM apps.
3) A robust streaming event protocol
Avoid “just send text.” Send framed JSON events with minimal overhead.
Event types
token
: partial text chunkdelta
: structured JSON delta (for tool calls/JSON mode)tool_start
/tool_result
message_refusal
/safety_violation
usage
: token counts, model, timingsfinal
: end-of-stream with metadataerror
: typed error
NDJSON framing (works over SSE/WebSocket/fetch)
{"type":"token","id":"m1","content":"Hello"}{"type":"token","id":"m1","content":" world"}{"type":"usage","prompt_tokens":342,"completion_tokens":89}{"type":"final","id":"m1","finish_reason":"stop"}
SSE framing
event: tokendata: {"id":"m1","content":"Hello"} event: tokendata: {"id":"m1","content":" world"} event: finaldata: {"id":"m1","finish_reason":"stop"}
Keep payload small, gzip enabled, and include a cursor/offset for resume.
4) Front-end patterns (React examples)
4.1 Streaming with SSE (React + EventSource)
import { useEffect, useRef, useState } from "react"; export function StreamedAnswer({ requestId, body }: {requestId: string; body: any}) { const [text, setText] = useState(""); const esRef = useRef<EventSource | null>(null); useEffect(() => { const params = new URLSearchParams({ requestId }); const es = new EventSource(`/api/stream?${params}`, { withCredentials: true }); esRef.current = es; es.addEventListener("token", (ev) => { const msg = JSON.parse((ev as MessageEvent).data); setText((t) => t + (msg.content ?? "")); }); es.addEventListener("final", () => es.close()); es.addEventListener("error", () => es.close()); return () => es.close(); }, [requestId]); return ( <div aria-live="polite" aria-busy={text.length === 0} className="prose"> {text || "Thinking…"} </div> );}
Notes
aria-live="polite"
announces incremental text for assistive tech.Use a “Stop” button calling a
/abort
endpoint (or keep a client token to cancel; see below).
4.2 Streaming with fetch + ReadableStream (works with HTTP/2/3)
async function* streamLines(res: Response) { const reader = res.body!.getReader(); const decoder = new TextDecoder(); let buf = ""; for (;;) { const { done, value } = await reader.read(); if (done) break; buf += decoder.decode(value, { stream: true }); let idx; while ((idx = buf.indexOf("\n")) >= 0) { yield buf.slice(0, idx); // NDJSON line buf = buf.slice(idx + 1); } }} export async function streamAnswer(signal: AbortSignal, payload: any, onToken: (t: string)=>void) { const res = await fetch("/api/stream", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify(payload), signal }); for await (const line of streamLines(res)) { const msg = JSON.parse(line); if (msg.type === "token") onToken(msg.content); if (msg.type === "final") break; }}
4.3 Immediate Cancel/Interrupt
const acRef = useRef<AbortController | null>(null);function onSend(payload) { acRef.current?.abort(); // cancel previous const ac = new AbortController(); acRef.current = ac; streamAnswer(ac.signal, payload, (t) => setText((x) => x + t));}function onStop() { acRef.current?.abort(); }
5) Backpressure, buffering, and rendering
Batch small tokens: render every 30–60 ms or per ~20–60 characters to avoid reflow storms.
Use requestIdleCallback (with a fallback) to schedule heavy DOM updates.
Virtualize long transcripts (e.g., react-virtuoso) to keep DOM light.
For code blocks, buffer until fenced block closes to avoid flicker.
6) Front-end stack guidance
React + Next.js: great DX, easy edge routes, supports streaming responses and Server Components; co-locate auth and proxies.
Remix: nice streaming via loaders and
defer
, good for progressively rendering pages.SvelteKit: minimal overhead, great perf; readable streams supported.
State: local state for the live answer; server cache (SWR/TanStack Query) for settled results.
Routing: keep streaming endpoints separate from JSON APIs to tune timeouts and headers.
When to use WebSockets
Live tool progress (multi-step tool execution).
Agentic multi-actor UIs (planner/worker logs).
Multi-user sessions (co-browsing, handoff to human agents).
7) Server routes (Edge/Node examples)
7.1 SSE in a Node/Express route
app.get("/api/stream", async (req, res) => { res.setHeader("Content-Type", "text/event-stream"); res.setHeader("Cache-Control", "no-cache, no-transform"); res.setHeader("Connection", "keep-alive"); const send = (event: string, data: any) => { res.write(`event: ${event}\n`); res.write(`data: ${JSON.stringify(data)}\n\n`); }; // Example: forward tokens from your model SDK try { for await (const token of model.generateStream({ prompt: req.query.q })) { send("token", { content: token }); } send("final", { finish_reason: "stop" }); res.end(); } catch (e) { send("error", { message: "Stream failed" }); res.end(); }});
7.2 WebSocket (ws) with typed events
wss.on("connection", (socket) => { socket.on("message", async (raw) => { const msg = JSON.parse(raw.toString()); if (msg.type === "new_request") { for await (const ev of agent.runStream(msg.payload)) { socket.send(JSON.stringify(ev)); // token/tool_result/usage/final } } });});
Infra tips
NGINX/ALB timeouts: raise idle timeouts for long streams (e.g., 60–120s).
Enable gzip/brotli; add
X-Accel-Buffering: no
(NGINX) to disable proxy buffering for SSE.Prefer keep-alive and HTTP/2 to reduce connection overhead.
8) Product & UX details that matter
Skeleton + type-on effect: show “Thinking…” immediately; replace with tokens.
Inline citations: render superscripts as they arrive; clicking opens a side panel.
Tool feedback: unobtrusive “Calling calendar API…” status chips; collapse into final result.
Edit & resend: keep the input enabled; allow editing mid-stream by aborting current stream.
Copy & export: stream into a buffer; final message enables copy/download.
A11y:
role="status"
oraria-live="polite"
; avoid spamming screen readers (batch updates).
9) Security & privacy
Never expose model API keys to the browser; front-end talks to your server only.
Use short-lived session tokens for streaming endpoints.
CORS: lock origins; prefer same-origin routes behind auth.
PII controls: redact on server before sending to the model; avoid logging raw prompts.
Content-Security-Policy: forbid mixed content; pin websocket endpoints.
10) Observability & SLOs
Metrics per stream: TTFT, tokens/sec, stream duration, bytes sent, abort count, error rate.
App signals: clarification count, stop ratio, retries.
Traces: correlate user action → retrieval → model call → stream events.
Budgets: cap wall-clock (e.g., 30–45s) and token usage per request; surface “truncated” state in UI.
11) Advanced: structured JSON mode & tool results
For tool-augmented agents, stream JSON deltas not just text:
{"type":"delta","path":"$.thought","value":"Checking your calendar…"}{"type":"tool_start","name":"calendar.find","args":{"date":"2025-09-01"}}{"type":"tool_result","name":"calendar.find","result":[{"slot":"10:30–11:00"}]}{"type":"delta","path":"$.plan[0]","value":"Propose 10:30–11:00"}{"type":"final","value":{"answer":"Booked for 10:30","citations":[]}}
In React, apply path
updates to a local JSON tree, and render partial structure alongside the growing text.
12) Anti-patterns (cost you latency & reliability)
Sending entire conversation transcript on every turn (use summaries/slots).
Rendering every single token (coalesce into small batches).
Building custom binary framing when SSE/NDJSON works fine.
No abort path—users get stuck waiting.
Proxy buffering enabled—SSE appears to “not stream.”
13) Quick selection guide
Simple chat, browser clients, minimal ops → SSE (+ NDJSON events)
Realtime collab, tool telemetry, agent logs → WebSocket
SSR/Edge render with progressive responses → fetch streaming (HTTP/2/3)
Experimental ultra-low latency → WebTransport (only if you know why)
14) Checklist to ship
Choose transport (SSE/WebSocket/fetch) and define event schema
Implement abort + “Stop generating”
Batch tokens; update UI ≤ every 60 ms
Add TTFT and tokens/sec metrics
A11y:
aria-live
+ keyboard shortcutsProxy tuned (no buffering, keep-alive, sane timeouts)
Security: server-side keys, short-lived auth, CORS locked
Error states + reconnect with offset resume
Edge deployment (optional): colocate with model/RAG infra
Closing
Great LLM UX isn’t just what the model says—it’s how it arrives. Start with SSE, design a typed event protocol, make it interruptible, and measure the right things. When you need richer realtime semantics, graduate to WebSockets. Keep rendering efficient, accessible, and resilient—and your app will feel fast even when the model is thinking hard.