v0.1 Draft Specification

Voice Interaction
Protocol

An open standard for voice-first applications. Like the DOM for visual interfaces, VIP gives every app a standard voice layer that works across any device, model, or platform.

Read the Specification View on GitHub

The Problem

There is no standard for voice

Visual interfaces have the DOM and ARIA. Voice has nothing. Every implementation today is proprietary, brittle, and model-locked.

Without VIP

Apps are tightly coupled to a specific AI provider
Swapping models breaks the entire voice layer
AI can invoke actions the application never intended
Turn-taking, barge-in, and latency require bespoke logic
No shared vocabulary between the app and the voice agent
IoT and constrained devices are left out entirely

With VIP

Swap any model or provider with no frontend changes
Works on browsers, mobile apps, IoT, and feature phones
AI is bounded to actions the app explicitly declared
Turn-taking, barge-in, and states are standardized
The Voice Interaction Tree gives agents full app context
One protocol for any device with a mic and a speaker

Design Principles

Built to evolve

VIP is designed to outlast any specific model or platform. As AI capabilities improve, the protocol adapts. Applications do not.

Model-Agnostic

Swap OpenAI, Gemini, or a local on-device model with zero changes to application logic.

Platform-Agnostic

Identical protocol behaviour on web, iOS, Android, desktop, and embedded devices.

Device-Agnostic

Any device with a mic, speaker, and network connection is a valid VIP client.

Safety-First

The AI can only invoke actions declared in the Action Registry. No hallucinated actions.

How It Works

The interaction loop

A strict Finite State Machine keeps the client and runtime synchronized through every turn. Both sides must agree on state at all times.

Initialize Session

Client authenticates via the VIP Server. The Voice Interaction Tree — current view and available actions — is transmitted to the runtime.

User Speaks

VAD or Push-to-Talk activates Listening state. Audio streams to the model provider, directly or via the VIP Server.

Model Reasons

The model transcribes and infers intent against the Action Registry. It selects: reply with speech, or invoke an action.

App Responds

The client executes the action or plays synthesized audio, returns a result to the runtime, then transitions back to Idle.

Idle

Waiting

Listening

Input

Processing

Thinking

Speaking

Output

Action

Execute

Architecture

Two modes, one protocol

VIP separates the control plane from the media plane, giving implementations flexibility to optimize for latency or orchestration.

Internal ModeRecommended

The VIP Server handles authentication and issues a short-lived ephemeral token. The client then streams audio directly to the model provider, minimizing latency. Ideal for production applications.

External ModeFlexible

The VIP Server acts as a full proxy and orchestrator. The client streams everything through it, enabling server-side model chaining and provider swaps with zero client changes.

User Agent

Browser, mobile app, or IoT device

VIP Server

Session gateway and policy enforcer

Model Provider

STT, LLM, TTS (any vendor)

App Backend

Identity provider for session auth

Specification Status

What is in v0.1 Draft

Seven sections establish the full interaction model. Message format and transport specs are next.

§1IntroductionDraft

Problem statement, purpose, and audience

§2Scope and GoalsDraft

Boundaries, non-goals, design objectives

§3TerminologyDraft

Normative glossary of all protocol terms

§4Protocol OverviewDraft

Architecture, modes, and component roles

§5Core Interaction ModelDraft

FSM, turn-taking, and action invocation

§6Interaction PrimitivesDraft

Navigation, button, input, confirmation

§7State and Flow ManagementDraft

Full transition matrix and interruption rules

§8Message FormatPlanned

JSON schemas, event payloads, error codes

§9Transport LayerPlanned

WebSocket events and WebRTC channel setup

§10Authentication FlowPlanned

Token exchange and ephemeral credential lifecycle

Start with the specification

VIP is an open effort. Read the draft, share your feedback, and help shape the standard for voice-native applications.

Read the Spec Contribute on GitHub

Voice InteractionProtocol