OpenVoiceUI

OpenVoiceUI is an open-source, browser-based voice agent platform that combines speech-to-text input, large language model reasoning, text-to-speech output, and a live web canvas rendering system into a unified interface. Built on top of the OpenClaw AI gateway, it allows users to interact with any supported LLM provider through voice and receive both spoken responses and dynamically generated visual artifacts such as dashboards, reports, and interactive web pages. The project is MIT-licensed and deployed via Docker Compose.

Architecture

OpenVoiceUI follows a layered architecture that separates the interface layer from the intelligence layer. OpenClaw serves as the backend gateway handling LLM routing, tool orchestration, session management, and authentication, while OpenVoiceUI provides the browser-based frontend for voice interaction and visual rendering.

System Overview

┌─────────────────────────────────────────────────────────┐
│                     Browser Client                      │
│  ┌──────────┐  ┌──────────────┐  ┌───────────────────┐  │
│  │ Voice I/O│  │ Desktop Shell│  │   Web Canvas      │  │
│  │ (STT/TTS)│  │ (Windows/    │  │   (iframe-based   │  │
│  │          │  │  Menus)      │  │    HTML renderer)  │  │
│  └────┬─────┘  └──────┬───────┘  └────────┬──────────┘  │
│       └───────────┬────┘                   │             │
│                   ▼                        │             │
│         ┌─────────────────┐                │             │
│         │  OpenVoiceUI    │◄───────────────┘             │
│         │  Frontend       │                              │
│         └────────┬────────┘                              │
└──────────────────┼──────────────────────────────────────┘
                   ▼
         ┌─────────────────┐
         │    OpenClaw      │
         │  (AI Gateway)    │
         │  - LLM Routing   │
         │  - Tool Use      │
         │  - Sessions      │
         │  - Auth Profiles │
         └────────┬────────┘
                  ▼
    ┌─────────────────────────────┐
    │      LLM Providers          │
    │  Anthropic │ OpenAI │ Groq  │
    │  Z.AI │ Local Models        │
    └─────────────────────────────┘

Component Breakdown

The platform consists of the following components:

Voice I/O – Browser-based speech-to-text supporting push-to-talk, wake word activation, and continuous listening modes. TTS output supports multiple engines including a bundled local option and voice cloning via Qwen3-TTS.
OpenClaw Gateway – Handles LLM provider routing (Anthropic, OpenAI, Groq, Z.AI, local models), API key management, tool execution, agent orchestration, and session context windowing.
Web Canvas – A fullscreen iframe-based display system where the LLM generates complete HTML, CSS, and JavaScript artifacts during conversation. These render in real time within the browser.
Desktop Shell – A desktop-style interface layer providing windows, folders, right-click context menus, and wallpaper customization around the canvas and voice components.
Agent Profiles – Configurable AI persona definitions that can be hot-swapped during a session, each with distinct system prompts, model preferences, and behavioral parameters.

Tech Stack

The codebase is approximately 36% JavaScript, 36% Python, and 22% HTML/CSS. Deployment uses Docker and Docker Compose. The setup process is initiated via npx openvoiceui setup, which scaffolds the project structure and configuration files.

Key dependencies and integrations include:

Backend: Python (Flask), Node.js 18+
Containerization: Docker, Docker Compose
LLM Gateway: OpenClaw (supports Anthropic, OpenAI, Groq, Z.AI, and local models)
Speech-to-Text: Browser-native Web Speech API
Text-to-Speech: Multiple provider support, Qwen3-TTS for voice cloning
Image Generation: FLUX.1, Stable Diffusion 3.5
Music Generation: Suno API integration
Hosting: Designed for VPS deployment (Hetzner recommended); local execution possible but not recommended due to SSL and microphone access requirements

Key Concepts

Live Web Canvas

The canvas system is the primary visual output mechanism. When a user issues a voice command such as “build me a sales dashboard,” the LLM generates HTML, CSS, and JavaScript code, which is injected into a sandboxed iframe and rendered immediately in the browser. This allows the AI to produce working interactive artifacts – charts, data tables, forms, and complete web pages – rather than text descriptions.

Canvas pages persist within the session and can be iteratively refined through follow-up voice commands. The user speaks a modification (e.g., “add a date filter”), and the LLM regenerates or patches the canvas content accordingly.

Vibe Brainstorming

“Vibe brainstorming” is the project's term for the workflow pattern enabled by the combination of voice input and canvas output. The user speaks ideas in natural language, and the system responds with visual artifacts rather than text. This reduces the feedback loop from the traditional cycle of design, implement, review, and revise to a conversational iteration measured in seconds. The concept is similar to the rapid prototyping workflows found in tools like Bolt.new and v0 but uses voice as the primary input modality.

Persistent Session Context

OpenClaw's session management layer maintains conversation state across interactions. Context windowing ensures that long conversations remain within LLM token limits while preserving relevant history. Canvas artifacts, agent profile selections, and conversation threads persist within a session, allowing cumulative refinement over time.

Agent Orchestration

The platform supports multiple specialized agents that can be invoked for different tasks within a single session. OpenClaw coordinates routing between agents, each with domain-specific system prompts and tool access. For example, one agent profile might specialize in data visualization while another handles copywriting. The user can switch between profiles or allow the system to route based on the request type. This pattern aligns with broader agent orchestration approaches in multi-agent systems.

Comparison with Related Tools

Tool	Primary Input	Output Type	LLM Flexibility	Self-Hosted
OpenVoiceUI	Voice + text	Live HTML canvas + speech	Any (via OpenClaw)	Yes (Docker)
Bolt.new	Text	Full-stack web apps	Limited	No
v0 (Vercel)	Text	React components	Limited	No
Cursor	Text + code context	Code edits	Multiple models	Desktop app
OpenVoiceChat	Voice	Voice responses	Multiple models	Yes

OpenVoiceUI differs from text-based generative coding tools in that voice is the primary input modality and the output includes rendered visual artifacts rather than source code files. It differs from other voice agent platforms in that it includes a visual canvas system rather than being limited to audio-only interaction.

Use Cases

Rapid prototyping – Generating interactive dashboard mockups, landing pages, or form interfaces through voice commands without writing code.
Business intelligence – Creating ad-hoc data visualizations and reports during meetings or planning sessions.
Accessible development – Enabling non-technical users to produce working web interfaces through natural language.
Multi-modal agent interaction – Combining voice control with visual output for tasks that benefit from both modalities, such as design iteration or workflow visualization.

Installation

# Scaffold the project
npx openvoiceui setup
 
# Configure API keys in the generated .env file
# Then launch with Docker Compose
docker compose up -d

The system requires at least one LLM API key (Groq offers a free tier). Node.js 18+ and Docker are prerequisites. For production use, deployment to a VPS with SSL is recommended for reliable microphone access and persistent uptime.

AI Agent Knowledge Base

Sidebar

Table of Contents

OpenVoiceUI

Architecture

System Overview

Component Breakdown

Tech Stack

Key Concepts

Live Web Canvas

Vibe Brainstorming

Persistent Session Context

Agent Orchestration

Comparison with Related Tools

Use Cases

Installation

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

OpenVoiceUI

Architecture

System Overview

Component Breakdown

Tech Stack

Key Concepts

Live Web Canvas

Vibe Brainstorming

Persistent Session Context

Agent Orchestration

Comparison with Related Tools

Use Cases

Installation

References

See Also

Page Tools