Skip to main content
v0.1 Draft Specification
Voice Interaction Protocol

Voice Interaction
Protocol

An open standard for voice-first applications. Like the DOM for visual interfaces, VIP gives every app a standard voice layer that works across any device, model, or platform.

There is no standard for voice

Visual interfaces have the DOM and ARIA. Voice has nothing. Every implementation today is proprietary, brittle, and model-locked.

Without VIP

  • Apps are tightly coupled to a specific AI provider
  • Swapping models breaks the entire voice layer
  • AI can invoke actions the application never intended
  • Turn-taking, barge-in, and latency require bespoke logic
  • No shared vocabulary between the app and the voice agent
  • IoT and constrained devices are left out entirely

With VIP

  • Swap any model or provider with no frontend changes
  • Works on browsers, mobile apps, IoT, and feature phones
  • AI is bounded to actions the app explicitly declared
  • Turn-taking, barge-in, and states are standardized
  • The Voice Interaction Tree gives agents full app context
  • One protocol for any device with a mic and a speaker

Built to evolve

VIP is designed to outlast any specific model or platform. As AI capabilities improve, the protocol adapts. Applications do not.

Model-Agnostic

Swap OpenAI, Gemini, or a local on-device model with zero changes to application logic.

Platform-Agnostic

Identical protocol behaviour on web, iOS, Android, desktop, and embedded devices.

Device-Agnostic

Any device with a mic, speaker, and network connection is a valid VIP client.

Safety-First

The AI can only invoke actions declared in the Action Registry. No hallucinated actions.

The interaction loop

A strict Finite State Machine keeps the client and runtime synchronized through every turn. Both sides must agree on state at all times.

01

Initialize Session

Client authenticates via the VIP Server. The Voice Interaction Tree — current view and available actions — is transmitted to the runtime.

02

User Speaks

VAD or Push-to-Talk activates Listening state. Audio streams to the model provider, directly or via the VIP Server.

03

Model Reasons

The model transcribes and infers intent against the Action Registry. It selects: reply with speech, or invoke an action.

04

App Responds

The client executes the action or plays synthesized audio, returns a result to the runtime, then transitions back to Idle.

Idle
Waiting
Listen­ing
Input
Process­ing
Thinking
Speak­ing
Output
or
Action
Execute

Two modes, one protocol

VIP separates the control plane from the media plane, giving implementations flexibility to optimize for latency or orchestration.

Internal ModeRecommended

The VIP Server handles authentication and issues a short-lived ephemeral token. The client then streams audio directly to the model provider, minimizing latency. Ideal for production applications.

External ModeFlexible

The VIP Server acts as a full proxy and orchestrator. The client streams everything through it, enabling server-side model chaining and provider swaps with zero client changes.

User Agent

Browser, mobile app, or IoT device

VIP Server

Session gateway and policy enforcer

Model Provider

STT, LLM, TTS (any vendor)

App Backend

Identity provider for session auth

What is in v0.1 Draft

Seven sections establish the full interaction model. Message format and transport specs are next.

§1IntroductionDraft

Problem statement, purpose, and audience

§2Scope and GoalsDraft

Boundaries, non-goals, design objectives

§3TerminologyDraft

Normative glossary of all protocol terms

§4Protocol OverviewDraft

Architecture, modes, and component roles

§5Core Interaction ModelDraft

FSM, turn-taking, and action invocation

§6Interaction PrimitivesDraft

Navigation, button, input, confirmation

§7State and Flow ManagementDraft

Full transition matrix and interruption rules

§8Message FormatPlanned

JSON schemas, event payloads, error codes

§9Transport LayerPlanned

WebSocket events and WebRTC channel setup

§10Authentication FlowPlanned

Token exchange and ephemeral credential lifecycle

Start with the specification

VIP is an open effort. Read the draft, share your feedback, and help shape the standard for voice-native applications.