Best Choices for Streaming Responses in LLM Applications: A Front-End Perspective

Best Choices for Streaming Responses in LLM Applications: A Front-End Perspective

TL;DR: For most LLM apps, SSE is the simplest and most reliable way to stream tokens from your server to the browser. Use WebSockets when you need true bi-directional realtime (tool progress, multi-party collab, cursor sharing). Keep fetch streaming over HTTP/2/3 for controlled environments or server-component frameworks. Design a structured event protocol, add abort/interrupt, reconnect with resume, and instrument TTFT (time-to-first-token) and tokens/sec. Ship with edge-friendly routes, short-lived auth, and ARIA live updates.


1) What “good streaming UX” actually means

  • Fast first paint: TTFT < 300–700 ms feels snappy.

  • Steady cadence: avoid bursty token dumps; render in small chunks.

  • Interruptible: user can “Stop” instantly (AbortController) and edit prompt.

  • Recoverable: reconnect on network blips without duplicating output.

  • Structured: the stream carries typed events (token, tool, usage, error), not raw text.

  • Accessible: screen readers get incremental updates (ARIA live regions).

  • Traceable: every stream has correlation IDs and usage counters.


2) Transport options (browser-first view)

Capability

SSE (EventSource / fetch+text/event-stream)

WebSocket

Fetch Streaming (HTTP/2/3)

WebTransport

Direction

Server → Client

Bi-directional

Server → Client

Bi-/uni over QUIC

Complexity

Low

Medium

Medium

High / emerging

Infra friction

Minimal (HTTP)

Proxies/load balancers sometimes tricky

Depends on platform

Limited support

Auto-reconnect

Built-in (EventSource)

Custom

Custom

Custom

Best for

Token streams, logs, progress

Collaborative apps, tool telemetry

SSR/edge functions, RSC

Low-latency labs, custom stacks

Rule of thumb

  • Start with SSE for token streaming to the browser.

  • Use WebSockets when you truly need client→server push while streaming (tool UI, multi-user cursors).

  • Prefer fetch streaming in Next.js/Remix/SvelteKit SSR routes or edge functions where it fits naturally.

  • WebTransport is powerful but not yet mainstream for LLM apps.


3) A robust streaming event protocol

Avoid “just send text.” Send framed JSON events with minimal overhead.

Event types

  • token: partial text chunk

  • delta: structured JSON delta (for tool calls/JSON mode)

  • tool_start / tool_result

  • message_refusal / safety_violation

  • usage: token counts, model, timings

  • final: end-of-stream with metadata

  • error: typed error

NDJSON framing (works over SSE/WebSocket/fetch)

{"type":"token","id":"m1","content":"Hello"}
{"type":"token","id":"m1","content":" world"}
{"type":"usage","prompt_tokens":342,"completion_tokens":89}
{"type":"final","id":"m1","finish_reason":"stop"}

SSE framing

event: token
data: {"id":"m1","content":"Hello"}
event: token
data: {"id":"m1","content":" world"}
event: final
data: {"id":"m1","finish_reason":"stop"}

Keep payload small, gzip enabled, and include a cursor/offset for resume.


4) Front-end patterns (React examples)

4.1 Streaming with SSE (React + EventSource)

import { useEffect, useRef, useState } from "react";
export function StreamedAnswer({ requestId, body }: {requestId: string; body: any}) {
const [text, setText] = useState("");
const esRef = useRef<EventSource | null>(null);
useEffect(() => {
const params = new URLSearchParams({ requestId });
const es = new EventSource(`/api/stream?${params}`, { withCredentials: true });
esRef.current = es;
es.addEventListener("token", (ev) => {
const msg = JSON.parse((ev as MessageEvent).data);
setText((t) => t + (msg.content ?? ""));
});
es.addEventListener("final", () => es.close());
es.addEventListener("error", () => es.close());
return () => es.close();
}, [requestId]);
return (
<div aria-live="polite" aria-busy={text.length === 0} className="prose">
{text || "Thinking…"}
</div>
);
}

Notes

  • aria-live="polite" announces incremental text for assistive tech.

  • Use a “Stop” button calling a /abort endpoint (or keep a client token to cancel; see below).

4.2 Streaming with fetch + ReadableStream (works with HTTP/2/3)

async function* streamLines(res: Response) {
const reader = res.body!.getReader();
const decoder = new TextDecoder();
let buf = "";
for (;;) {
const { done, value } = await reader.read();
if (done) break;
buf += decoder.decode(value, { stream: true });
let idx;
while ((idx = buf.indexOf("\n")) >= 0) {
yield buf.slice(0, idx); // NDJSON line
buf = buf.slice(idx + 1);
}
}
}
export async function streamAnswer(signal: AbortSignal, payload: any, onToken: (t: string)=>void) {
const res = await fetch("/api/stream", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify(payload),
signal
});
for await (const line of streamLines(res)) {
const msg = JSON.parse(line);
if (msg.type === "token") onToken(msg.content);
if (msg.type === "final") break;
}
}

4.3 Immediate Cancel/Interrupt

const acRef = useRef<AbortController | null>(null);
function onSend(payload) {
acRef.current?.abort(); // cancel previous
const ac = new AbortController();
acRef.current = ac;
streamAnswer(ac.signal, payload, (t) => setText((x) => x + t));
}
function onStop() { acRef.current?.abort(); }

5) Backpressure, buffering, and rendering

  • Batch small tokens: render every 30–60 ms or per ~20–60 characters to avoid reflow storms.

  • Use requestIdleCallback (with a fallback) to schedule heavy DOM updates.

  • Virtualize long transcripts (e.g., react-virtuoso) to keep DOM light.

  • For code blocks, buffer until fenced block closes to avoid flicker.


6) Front-end stack guidance

  • React + Next.js: great DX, easy edge routes, supports streaming responses and Server Components; co-locate auth and proxies.

  • Remix: nice streaming via loaders and defer, good for progressively rendering pages.

  • SvelteKit: minimal overhead, great perf; readable streams supported.

  • State: local state for the live answer; server cache (SWR/TanStack Query) for settled results.

  • Routing: keep streaming endpoints separate from JSON APIs to tune timeouts and headers.

When to use WebSockets

  • Live tool progress (multi-step tool execution).

  • Agentic multi-actor UIs (planner/worker logs).

  • Multi-user sessions (co-browsing, handoff to human agents).


7) Server routes (Edge/Node examples)

7.1 SSE in a Node/Express route

app.get("/api/stream", async (req, res) => {
res.setHeader("Content-Type", "text/event-stream");
res.setHeader("Cache-Control", "no-cache, no-transform");
res.setHeader("Connection", "keep-alive");
const send = (event: string, data: any) => {
res.write(`event: ${event}\n`);
res.write(`data: ${JSON.stringify(data)}\n\n`);
};
// Example: forward tokens from your model SDK
try {
for await (const token of model.generateStream({ prompt: req.query.q })) {
send("token", { content: token });
}
send("final", { finish_reason: "stop" });
res.end();
} catch (e) {
send("error", { message: "Stream failed" });
res.end();
}
});

7.2 WebSocket (ws) with typed events

wss.on("connection", (socket) => {
socket.on("message", async (raw) => {
const msg = JSON.parse(raw.toString());
if (msg.type === "new_request") {
for await (const ev of agent.runStream(msg.payload)) {
socket.send(JSON.stringify(ev)); // token/tool_result/usage/final
}
}
});
});

Infra tips

  • NGINX/ALB timeouts: raise idle timeouts for long streams (e.g., 60–120s).

  • Enable gzip/brotli; add X-Accel-Buffering: no (NGINX) to disable proxy buffering for SSE.

  • Prefer keep-alive and HTTP/2 to reduce connection overhead.


8) Product & UX details that matter

  • Skeleton + type-on effect: show “Thinking…” immediately; replace with tokens.

  • Inline citations: render superscripts as they arrive; clicking opens a side panel.

  • Tool feedback: unobtrusive “Calling calendar API…” status chips; collapse into final result.

  • Edit & resend: keep the input enabled; allow editing mid-stream by aborting current stream.

  • Copy & export: stream into a buffer; final message enables copy/download.

  • A11y: role="status" or aria-live="polite"; avoid spamming screen readers (batch updates).


9) Security & privacy

  • Never expose model API keys to the browser; front-end talks to your server only.

  • Use short-lived session tokens for streaming endpoints.

  • CORS: lock origins; prefer same-origin routes behind auth.

  • PII controls: redact on server before sending to the model; avoid logging raw prompts.

  • Content-Security-Policy: forbid mixed content; pin websocket endpoints.


10) Observability & SLOs

  • Metrics per stream: TTFT, tokens/sec, stream duration, bytes sent, abort count, error rate.

  • App signals: clarification count, stop ratio, retries.

  • Traces: correlate user action → retrieval → model call → stream events.

  • Budgets: cap wall-clock (e.g., 30–45s) and token usage per request; surface “truncated” state in UI.


11) Advanced: structured JSON mode & tool results

For tool-augmented agents, stream JSON deltas not just text:

{"type":"delta","path":"$.thought","value":"Checking your calendar…"}
{"type":"tool_start","name":"calendar.find","args":{"date":"2025-09-01"}}
{"type":"tool_result","name":"calendar.find","result":[{"slot":"10:30–11:00"}]}
{"type":"delta","path":"$.plan[0]","value":"Propose 10:30–11:00"}
{"type":"final","value":{"answer":"Booked for 10:30","citations":[]}}

In React, apply path updates to a local JSON tree, and render partial structure alongside the growing text.


12) Anti-patterns (cost you latency & reliability)

  • Sending entire conversation transcript on every turn (use summaries/slots).

  • Rendering every single token (coalesce into small batches).

  • Building custom binary framing when SSE/NDJSON works fine.

  • No abort path—users get stuck waiting.

  • Proxy buffering enabled—SSE appears to “not stream.”


13) Quick selection guide

  • Simple chat, browser clients, minimal opsSSE (+ NDJSON events)

  • Realtime collab, tool telemetry, agent logsWebSocket

  • SSR/Edge render with progressive responsesfetch streaming (HTTP/2/3)

  • Experimental ultra-low latencyWebTransport (only if you know why)


14) Checklist to ship

  • Choose transport (SSE/WebSocket/fetch) and define event schema

  • Implement abort + “Stop generating”

  • Batch tokens; update UI ≤ every 60 ms

  • Add TTFT and tokens/sec metrics

  • A11y: aria-live + keyboard shortcuts

  • Proxy tuned (no buffering, keep-alive, sane timeouts)

  • Security: server-side keys, short-lived auth, CORS locked

  • Error states + reconnect with offset resume

  • Edge deployment (optional): colocate with model/RAG infra


Closing

Great LLM UX isn’t just what the model says—it’s how it arrives. Start with SSE, design a typed event protocol, make it interruptible, and measure the right things. When you need richer realtime semantics, graduate to WebSockets. Keep rendering efficient, accessible, and resilient—and your app will feel fast even when the model is thinking hard.