Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
OpenVoiceUI is an open-source, browser-based voice agent platform that combines speech-to-text input, large language model reasoning, text-to-speech output, and a live web canvas rendering system into a unified interface. Built on top of the OpenClaw AI gateway, it allows users to interact with any supported LLM provider through voice and receive both spoken responses and dynamically generated visual artifacts such as dashboards, reports, and interactive web pages. The project is MIT-licensed and deployed via Docker Compose.
OpenVoiceUI follows a layered architecture that separates the interface layer from the intelligence layer. OpenClaw serves as the backend gateway handling LLM routing, tool orchestration, session management, and authentication, while OpenVoiceUI provides the browser-based frontend for voice interaction and visual rendering.
┌─────────────────────────────────────────────────────────┐
│ Browser Client │
│ ┌──────────┐ ┌──────────────┐ ┌───────────────────┐ │
│ │ Voice I/O│ │ Desktop Shell│ │ Web Canvas │ │
│ │ (STT/TTS)│ │ (Windows/ │ │ (iframe-based │ │
│ │ │ │ Menus) │ │ HTML renderer) │ │
│ └────┬─────┘ └──────┬───────┘ └────────┬──────────┘ │
│ └───────────┬────┘ │ │
│ ▼ │ │
│ ┌─────────────────┐ │ │
│ │ OpenVoiceUI │◄───────────────┘ │
│ │ Frontend │ │
│ └────────┬────────┘ │
└──────────────────┼──────────────────────────────────────┘
▼
┌─────────────────┐
│ OpenClaw │
│ (AI Gateway) │
│ - LLM Routing │
│ - Tool Use │
│ - Sessions │
│ - Auth Profiles │
└────────┬────────┘
▼
┌─────────────────────────────┐
│ LLM Providers │
│ Anthropic │ OpenAI │ Groq │
│ Z.AI │ Local Models │
└─────────────────────────────┘
The platform consists of the following components:
The codebase is approximately 36% JavaScript, 36% Python, and 22% HTML/CSS. Deployment uses Docker and Docker Compose. The setup process is initiated via npx openvoiceui setup, which scaffolds the project structure and configuration files.
Key dependencies and integrations include:
The canvas system is the primary visual output mechanism. When a user issues a voice command such as “build me a sales dashboard,” the LLM generates HTML, CSS, and JavaScript code, which is injected into a sandboxed iframe and rendered immediately in the browser. This allows the AI to produce working interactive artifacts – charts, data tables, forms, and complete web pages – rather than text descriptions.
Canvas pages persist within the session and can be iteratively refined through follow-up voice commands. The user speaks a modification (e.g., “add a date filter”), and the LLM regenerates or patches the canvas content accordingly.
“Vibe brainstorming” is the project's term for the workflow pattern enabled by the combination of voice input and canvas output. The user speaks ideas in natural language, and the system responds with visual artifacts rather than text. This reduces the feedback loop from the traditional cycle of design, implement, review, and revise to a conversational iteration measured in seconds. The concept is similar to the rapid prototyping workflows found in tools like Bolt.new and v0 but uses voice as the primary input modality.
OpenClaw's session management layer maintains conversation state across interactions. Context windowing ensures that long conversations remain within LLM token limits while preserving relevant history. Canvas artifacts, agent profile selections, and conversation threads persist within a session, allowing cumulative refinement over time.
The platform supports multiple specialized agents that can be invoked for different tasks within a single session. OpenClaw coordinates routing between agents, each with domain-specific system prompts and tool access. For example, one agent profile might specialize in data visualization while another handles copywriting. The user can switch between profiles or allow the system to route based on the request type. This pattern aligns with broader agent orchestration approaches in multi-agent systems.
| Tool | Primary Input | Output Type | LLM Flexibility | Self-Hosted |
|---|---|---|---|---|
| OpenVoiceUI | Voice + text | Live HTML canvas + speech | Any (via OpenClaw) | Yes (Docker) |
| Bolt.new | Text | Full-stack web apps | Limited | No |
| v0 (Vercel) | Text | React components | Limited | No |
| Cursor | Text + code context | Code edits | Multiple models | Desktop app |
| OpenVoiceChat | Voice | Voice responses | Multiple models | Yes |
OpenVoiceUI differs from text-based generative coding tools in that voice is the primary input modality and the output includes rendered visual artifacts rather than source code files. It differs from other voice agent platforms in that it includes a visual canvas system rather than being limited to audio-only interaction.
# Scaffold the project npx openvoiceui setup # Configure API keys in the generated .env file # Then launch with Docker Compose docker compose up -d
The system requires at least one LLM API key (Groq offers a free tier). Node.js 18+ and Docker are prerequisites. For production use, deployment to a VPS with SSL is recommended for reliable microphone access and persistent uptime.