Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Streaming agent responses to users using server-sent events, WebSocket patterns, token-by-token streaming, and streaming tool calls. Covers how major frameworks handle real-time agent output.
Human conversation is streamed – we process the first word before the speaker finishes the paragraph. Early LLM applications waited for the full generation to complete before sending a response, creating a poor user experience. Streaming transforms the agent UX from “wait and hope” to progressive disclosure.
The impact is dramatic: traditional request-response shows perceived latency of 5-10 seconds, while streaming shows the first token in ~200ms. Users see immediate feedback, can start reading right away, and can interrupt generation mid-response.
For AI agents, streaming is harder than for simple chatbots because agents have internal reasoning steps (thoughts, tool calls) interspersed with final answers that users should see. The core engineering challenge is leakage control: building a state machine that streams final answer tokens while hiding raw tool calls and intermediate reasoning.
The standard protocol for LLM streaming. Simple, unidirectional, and friendly to firewalls and load balancers.
data: … chunksBidirectional communication protocol. More complex but enables richer interaction.
| Feature | SSE | WebSocket |
|---|---|---|
| Direction | Server → Client | Bidirectional |
| Complexity | Low | Medium-High |
| Reconnection | Built-in | Manual |
| Load Balancer | Standard HTTP | Requires upgrade support |
| Best For | Token streaming | Interactive agents |
Frameworks emit distinct event types during streaming:
import asyncio import json from collections.abc import AsyncGenerator from fastapi import FastAPI from fastapi.responses import StreamingResponse app = FastAPI() async def run_agent_stream(query: str) -> AsyncGenerator[str, None]: """Stream agent events as SSE.""" # Emit status for tool calls yield f"data: {json.dumps({'type': 'status', 'content': 'Thinking...'})} " async for event in agent.stream(query): if event.type == "text_delta": yield f"data: {json.dumps({'type': 'token', 'content': event.text})} " elif event.type == "tool_call": yield f"data: {json.dumps({'type': 'status', 'content': f'Using {event.tool_name}...'})} " elif event.type == "tool_result": yield f"data: {json.dumps({'type': 'tool_result', 'tool': event.tool_name, 'content': event.result})} " yield f"data: {json.dumps({'type': 'done'})} " @app.get("/stream") async def stream_endpoint(query: str): return StreamingResponse( run_agent_stream(query), media_type="text/event-stream", headers={ "Cache-Control": "no-cache", "X-Accel-Buffering": "no", # disable nginx buffering }, )
const source = new EventSource(`/stream?query=${encodeURIComponent(query)}`); source.onmessage = (event) => { const data = JSON.parse(event.data); switch (data.type) { case 'token': appendToResponse(data.content); break; case 'status': showStatus(data.content); break; case 'done': source.close(); finalizeResponse(); break; } };
The Vercel AI SDK wraps streaming in a high-level API. Uses useChat hook on the client and streamText on the server.
Key features:
onToolCall callbacks for client-side tool result display
LangChain uses astream_events to emit discrete events for each token. Filter for on_chat_model_stream events and ResponseTextDeltaEvent objects.
Uses ConversationBufferMemory for state management during streaming, pairing event-driven processing with vector databases for context retrieval.
X-Accel-Buffering: no header or proxy_buffering off