====== Agent Streaming ======

Streaming agent responses to users using server-sent events, WebSocket patterns, token-by-token streaming, and streaming tool calls. Covers how major frameworks handle real-time agent output.

===== Overview =====

Human conversation is streamed -- we process the first word before the speaker finishes the paragraph. Early LLM applications waited for the full generation to complete before sending a response, creating a poor user experience. Streaming transforms the agent UX from "wait and hope" to progressive disclosure.

The impact is dramatic: traditional request-response shows perceived latency of 5-10 seconds, while streaming shows the first token in ~200ms. Users see immediate feedback, can start reading right away, and can interrupt generation mid-response.

For AI agents, streaming is harder than for simple chatbots because agents have internal reasoning steps (thoughts, tool calls) interspersed with final answers that users should see. The core engineering challenge is //leakage control//: building a state machine that streams final answer tokens while hiding raw tool calls and intermediate reasoning.

===== Streaming Protocols =====

==== Server-Sent Events (SSE) ====

The standard protocol for LLM streaming. Simple, unidirectional, and friendly to firewalls and load balancers.

  * Server keeps an HTTP connection open and pushes ''data: ...'' chunks
  * One-way communication (server to client)
  * Built-in reconnection handling
  * Works with standard load balancers and CDNs
  * Best for: most agent streaming use cases

==== WebSockets ====

Bidirectional communication protocol. More complex but enables richer interaction.

  * Full-duplex communication
  * Client can send messages while receiving stream
  * Better for: showing real-time tool call progress, allowing user interrupts, multi-turn within a stream
  * Higher infrastructure complexity (sticky sessions, special load balancer config)

==== Comparison ====

^ Feature ^ SSE ^ WebSocket ^
| Direction | Server -> Client | Bidirectional |
| Complexity | Low | Medium-High |
| Reconnection | Built-in | Manual |
| Load Balancer | Standard HTTP | Requires upgrade support |
| Best For | Token streaming | Interactive agents |

===== Streaming Architecture =====

<mermaid>
graph LR
    A[User Input] --> B[Agent Server]
    B --> C{LLM Call}
    C -->|Token Stream| D[Stream Transformer]
    D -->|Tool Call Event| E[Tool Executor]
    E -->|Tool Result| C
    D -->|Status Update| F[Client: Status Bar]
    D -->|Answer Token| G[Client: Response Area]

    subgraph Stream Transformer
        D --> H{Event Type?}
        H -->|thought| I[Filter / Hide]
        H -->|tool_call| J[Emit Status]
        H -->|text_delta| K[Forward Token]
    end
</mermaid>

===== Event Types =====

Frameworks emit distinct event types during streaming:

  * **text_delta** -- Individual LLM-generated tokens for progressive text display
  * **tool_call** -- Agent invoked a tool; UI can display "Searching..." indicators
  * **tool_result** -- Tool returned data; agent continues reasoning
  * **thought** -- Internal reasoning (typically hidden from users)
  * **done** -- Stream complete; finalize UI

===== Framework Implementations =====

==== SSE with FastAPI (Python) ====

<code python>
import asyncio
import json
from collections.abc import AsyncGenerator

from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()


async def run_agent_stream(query: str) -> AsyncGenerator[str, None]:
    """Stream agent events as SSE."""
    # Emit status for tool calls
    yield f"data: {json.dumps({'type': 'status', 'content': 'Thinking...'})}

"

    async for event in agent.stream(query):
        if event.type == "text_delta":
            yield f"data: {json.dumps({'type': 'token', 'content': event.text})}

"
        elif event.type == "tool_call":
            yield f"data: {json.dumps({'type': 'status', 'content': f'Using {event.tool_name}...'})}

"
        elif event.type == "tool_result":
            yield f"data: {json.dumps({'type': 'tool_result', 'tool': event.tool_name, 'content': event.result})}

"

    yield f"data: {json.dumps({'type': 'done'})}

"


@app.get("/stream")
async def stream_endpoint(query: str):
    return StreamingResponse(
        run_agent_stream(query),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",  # disable nginx buffering
        },
    )
</code>

==== Client-Side SSE (JavaScript) ====

<code javascript>
const source = new EventSource(`/stream?query=${encodeURIComponent(query)}`);

source.onmessage = (event) => {
  const data = JSON.parse(event.data);
  switch (data.type) {
    case 'token':
      appendToResponse(data.content);
      break;
    case 'status':
      showStatus(data.content);
      break;
    case 'done':
      source.close();
      finalizeResponse();
      break;
  }
};
</code>

==== Vercel AI SDK ====

The Vercel AI SDK wraps streaming in a high-level API. Uses ''useChat'' hook on the client and ''streamText'' on the server.

Key features:
  * Automatic SSE handling with built-in React hooks
  * ''onToolCall'' callbacks for client-side tool result display
  * Progressive enhancement (show citations after answer completes)
  * Built-in abort/cancel support

==== LangChain ====

LangChain uses ''astream_events'' to emit discrete events for each token. Filter for ''on_chat_model_stream'' events and ''ResponseTextDeltaEvent'' objects.

Uses ''ConversationBufferMemory'' for state management during streaming, pairing event-driven processing with vector databases for context retrieval.

===== Production Considerations =====

  * **Edge buffering** -- Nginx and Cloudflare silently buffer responses by default, killing streaming. Disable with ''X-Accel-Buffering: no'' header or ''proxy_buffering off''
  * **Connection timeouts** -- Set appropriate timeouts for long-running agent streams (tool calls can take seconds)
  * **Backpressure** -- If the client consumes slower than the server produces, implement buffering or flow control
  * **Error mid-stream** -- If an error occurs during streaming, emit an error event rather than closing the connection silently
  * **Heartbeats** -- Send periodic keepalive messages to prevent proxy timeouts on idle connections

===== References =====

  * [[https://dev.to/nebulagg/how-to-stream-ai-agent-responses-in-5-min-1i0l|How to Stream AI Agent Responses in 5 Min]]
  * [[https://arunbaby.com/ai-agents/0048-streaming-real-time-agents/|Streaming Real-Time Agents - Arun Baby]]
  * [[https://dontpaniclabs.com/blog/post/2026/01/27/agent-chat-using-langchain-part-2-token-streaming-with-websockets/|Agent Chat using LangChain Part 2 - Token Streaming with WebSockets]]
  * [[https://www.9.agency/blog/streaming-ai-responses-vercel-ai-sdk|Streaming AI Responses with Vercel AI SDK]]
  * [[https://sparkco.ai/blog/mastering-streaming-responses-in-agent-systems|Mastering Streaming Responses in Agent Systems]]

===== See Also =====

  * [[tool_result_parsing|Tool Result Parsing]]
  * [[agent_error_recovery|Agent Error Recovery]]
  * [[agent_ux_design|Agent UX Design]]