Table of Contents

Agent Streaming

Streaming agent responses to users using server-sent events, WebSocket patterns, token-by-token streaming, and streaming tool calls. Covers how major frameworks handle real-time agent output.

Overview

Human conversation is streamed – we process the first word before the speaker finishes the paragraph. Early LLM applications waited for the full generation to complete before sending a response, creating a poor user experience. Streaming transforms the agent UX from “wait and hope” to progressive disclosure.

The impact is dramatic: traditional request-response shows perceived latency of 5-10 seconds, while streaming shows the first token in ~200ms. Users see immediate feedback, can start reading right away, and can interrupt generation mid-response.

For AI agents, streaming is harder than for simple chatbots because agents have internal reasoning steps (thoughts, tool calls) interspersed with final answers that users should see. The core engineering challenge is leakage control: building a state machine that streams final answer tokens while hiding raw tool calls and intermediate reasoning.

Streaming Protocols

Server-Sent Events (SSE)

The standard protocol for LLM streaming. Simple, unidirectional, and friendly to firewalls and load balancers.

WebSockets

Bidirectional communication protocol. More complex but enables richer interaction.

Comparison

Feature SSE WebSocket
Direction Server → Client Bidirectional
Complexity Low Medium-High
Reconnection Built-in Manual
Load Balancer Standard HTTP Requires upgrade support
Best For Token streaming Interactive agents

Streaming Architecture

graph LR A[User Input] --> B[Agent Server] B --> C{LLM Call} C -->|Token Stream| D[Stream Transformer] D -->|Tool Call Event| E[Tool Executor] E -->|Tool Result| C D -->|Status Update| F[Client: Status Bar] D -->|Answer Token| G[Client: Response Area] subgraph Stream Transformer D --> H{Event Type?} H -->|thought| I[Filter / Hide] H -->|tool_call| J[Emit Status] H -->|text_delta| K[Forward Token] end

Event Types

Frameworks emit distinct event types during streaming:

Framework Implementations

SSE with FastAPI (Python)

import asyncio
import json
from collections.abc import AsyncGenerator
 
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
 
app = FastAPI()
 
 
async def run_agent_stream(query: str) -> AsyncGenerator[str, None]:
    """Stream agent events as SSE."""
    # Emit status for tool calls
    yield f"data: {json.dumps({'type': 'status', 'content': 'Thinking...'})}
 
"
 
    async for event in agent.stream(query):
        if event.type == "text_delta":
            yield f"data: {json.dumps({'type': 'token', 'content': event.text})}
 
"
        elif event.type == "tool_call":
            yield f"data: {json.dumps({'type': 'status', 'content': f'Using {event.tool_name}...'})}
 
"
        elif event.type == "tool_result":
            yield f"data: {json.dumps({'type': 'tool_result', 'tool': event.tool_name, 'content': event.result})}
 
"
 
    yield f"data: {json.dumps({'type': 'done'})}
 
"
 
 
@app.get("/stream")
async def stream_endpoint(query: str):
    return StreamingResponse(
        run_agent_stream(query),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",  # disable nginx buffering
        },
    )

Client-Side SSE (JavaScript)

const source = new EventSource(`/stream?query=${encodeURIComponent(query)}`);
 
source.onmessage = (event) => {
  const data = JSON.parse(event.data);
  switch (data.type) {
    case 'token':
      appendToResponse(data.content);
      break;
    case 'status':
      showStatus(data.content);
      break;
    case 'done':
      source.close();
      finalizeResponse();
      break;
  }
};

Vercel AI SDK

The Vercel AI SDK wraps streaming in a high-level API. Uses useChat hook on the client and streamText on the server.

Key features:

LangChain

LangChain uses astream_events to emit discrete events for each token. Filter for on_chat_model_stream events and ResponseTextDeltaEvent objects.

Uses ConversationBufferMemory for state management during streaming, pairing event-driven processing with vector databases for context retrieval.

Production Considerations

References

See Also