Agent Streaming

Streaming agent responses to users using server-sent events, WebSocket patterns, token-by-token streaming, and streaming tool calls. Covers how major frameworks handle real-time agent output.

Overview

Human conversation is streamed – we process the first word before the speaker finishes the paragraph. Early LLM applications waited for the full generation to complete before sending a response, creating a poor user experience. Streaming transforms the agent UX from “wait and hope” to progressive disclosure.

The impact is dramatic: traditional request-response shows perceived latency of 5-10 seconds, while streaming shows the first token in ~200ms. Users see immediate feedback, can start reading right away, and can interrupt generation mid-response.

For AI agents, streaming is harder than for simple chatbots because agents have internal reasoning steps (thoughts, tool calls) interspersed with final answers that users should see. The core engineering challenge is leakage control: building a state machine that streams final answer tokens while hiding raw tool calls and intermediate reasoning.

Streaming Protocols

Server-Sent Events (SSE)

The standard protocol for LLM streaming. Simple, unidirectional, and friendly to firewalls and load balancers.

Server keeps an HTTP connection open and pushes data: … chunks
One-way communication (server to client)
Built-in reconnection handling
Works with standard load balancers and CDNs
Best for: most agent streaming use cases

WebSockets

Bidirectional communication protocol. More complex but enables richer interaction.

Full-duplex communication
Client can send messages while receiving stream
Better for: showing real-time tool call progress, allowing user interrupts, multi-turn within a stream
Higher infrastructure complexity (sticky sessions, special load balancer config)

Comparison

Feature	SSE	WebSocket
Direction	Server → Client	Bidirectional
Complexity	Low	Medium-High
Reconnection	Built-in	Manual
Load Balancer	Standard HTTP	Requires upgrade support
Best For	Token streaming	Interactive agents

Streaming Architecture

graph LR A[User Input] --> B[Agent Server] B --> C{LLM Call} C -->|Token Stream| D[Stream Transformer] D -->|Tool Call Event| E[Tool Executor] E -->|Tool Result| C D -->|Status Update| F[Client: Status Bar] D -->|Answer Token| G[Client: Response Area] subgraph Stream Transformer D --> H{Event Type?} H -->|thought| I[Filter / Hide] H -->|tool_call| J[Emit Status] H -->|text_delta| K[Forward Token] end

Event Types

Frameworks emit distinct event types during streaming:

text_delta – Individual LLM-generated tokens for progressive text display
tool_call – Agent invoked a tool; UI can display “Searching…” indicators
tool_result – Tool returned data; agent continues reasoning
thought – Internal reasoning (typically hidden from users)
done – Stream complete; finalize UI

Framework Implementations

SSE with FastAPI (Python)

import asyncio
import json
from collections.abc import AsyncGenerator
 
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
 
app = FastAPI()
 
 
async def run_agent_stream(query: str) -> AsyncGenerator[str, None]:
    """Stream agent events as SSE."""
    # Emit status for tool calls
    yield f"data: {json.dumps({'type': 'status', 'content': 'Thinking...'})}
 
"
 
    async for event in agent.stream(query):
        if event.type == "text_delta":
            yield f"data: {json.dumps({'type': 'token', 'content': event.text})}
 
"
        elif event.type == "tool_call":
            yield f"data: {json.dumps({'type': 'status', 'content': f'Using {event.tool_name}...'})}
 
"
        elif event.type == "tool_result":
            yield f"data: {json.dumps({'type': 'tool_result', 'tool': event.tool_name, 'content': event.result})}
 
"
 
    yield f"data: {json.dumps({'type': 'done'})}
 
"
 
 
@app.get("/stream")
async def stream_endpoint(query: str):
    return StreamingResponse(
        run_agent_stream(query),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",  # disable nginx buffering
        },
    )

Client-Side SSE (JavaScript)

const source = new EventSource(`/stream?query=${encodeURIComponent(query)}`);
 
source.onmessage = (event) => {
  const data = JSON.parse(event.data);
  switch (data.type) {
    case 'token':
      appendToResponse(data.content);
      break;
    case 'status':
      showStatus(data.content);
      break;
    case 'done':
      source.close();
      finalizeResponse();
      break;
  }
};

Vercel AI SDK

The Vercel AI SDK wraps streaming in a high-level API. Uses useChat hook on the client and streamText on the server.

Key features:

Automatic SSE handling with built-in React hooks
onToolCall callbacks for client-side tool result display
Progressive enhancement (show citations after answer completes)
Built-in abort/cancel support

LangChain

LangChain uses astream_events to emit discrete events for each token. Filter for on_chat_model_stream events and ResponseTextDeltaEvent objects.

Uses ConversationBufferMemory for state management during streaming, pairing event-driven processing with vector databases for context retrieval.

Production Considerations

Edge buffering – Nginx and Cloudflare silently buffer responses by default, killing streaming. Disable with X-Accel-Buffering: no header or proxy_buffering off
Connection timeouts – Set appropriate timeouts for long-running agent streams (tool calls can take seconds)
Backpressure – If the client consumes slower than the server produces, implement buffering or flow control
Error mid-stream – If an error occurs during streaming, emit an error event rather than closing the connection silently
Heartbeats – Send periodic keepalive messages to prevent proxy timeouts on idle connections

Table of Contents