An open standard for voice-first applications. Like the DOM for visual interfaces, VIP gives every app a standard voice layer that works across any device, model, or platform.
Visual interfaces have the DOM and ARIA. Voice has nothing. Every implementation today is proprietary, brittle, and model-locked.
VIP is designed to outlast any specific model or platform. As AI capabilities improve, the protocol adapts. Applications do not.
Swap OpenAI, Gemini, or a local on-device model with zero changes to application logic.
Identical protocol behaviour on web, iOS, Android, desktop, and embedded devices.
Any device with a mic, speaker, and network connection is a valid VIP client.
The AI can only invoke actions declared in the Action Registry. No hallucinated actions.
A strict Finite State Machine keeps the client and runtime synchronized through every turn. Both sides must agree on state at all times.
Client authenticates via the VIP Server. The Voice Interaction Tree — current view and available actions — is transmitted to the runtime.
VAD or Push-to-Talk activates Listening state. Audio streams to the model provider, directly or via the VIP Server.
The model transcribes and infers intent against the Action Registry. It selects: reply with speech, or invoke an action.
The client executes the action or plays synthesized audio, returns a result to the runtime, then transitions back to Idle.
VIP separates the control plane from the media plane, giving implementations flexibility to optimize for latency or orchestration.
The VIP Server handles authentication and issues a short-lived ephemeral token. The client then streams audio directly to the model provider, minimizing latency. Ideal for production applications.
The VIP Server acts as a full proxy and orchestrator. The client streams everything through it, enabling server-side model chaining and provider swaps with zero client changes.
Browser, mobile app, or IoT device
Session gateway and policy enforcer
STT, LLM, TTS (any vendor)
Identity provider for session auth
Seven sections establish the full interaction model. Message format and transport specs are next.
Problem statement, purpose, and audience
Boundaries, non-goals, design objectives
Normative glossary of all protocol terms
Architecture, modes, and component roles
FSM, turn-taking, and action invocation
Navigation, button, input, confirmation
Full transition matrix and interruption rules
JSON schemas, event payloads, error codes
WebSocket events and WebRTC channel setup
Token exchange and ephemeral credential lifecycle
VIP is an open effort. Read the draft, share your feedback, and help shape the standard for voice-native applications.