3. Terminology and Definitions
This section defines key terms and concepts used throughout the Voice Interaction Protocol (VIP) specification. The definitions provided here are normative and strictly govern the interpretation of this document.
3.1 Roles and Entities
User Agent (Client) The software instance acting on behalf of the user, typically a web browser, mobile application, or IoT device. The User Agent is responsible for capturing audio input, rendering audio output, and managing the visual state of the application.
Voice Runtime (Server) The processing entity responsible for orchestration. It manages session authentication, context state (Voice Interaction Tree), and the translation of natural language into executable application intents.
Model Provider The underlying service (third-party or local) that performs specific AI tasks, including Automatic Speech Recognition (ASR), Large Language Model (LLM) inference, and Text-to-Speech (TTS) synthesis.
Application Backend The existing server-side logic of the host application. It acts as the Identity Provider (IdP) for validating user permissions before a VIP session is authorized.
3.2 Protocol Architecture
Internal Mode An architectural configuration where the VIP Server is used solely for session negotiation and ephemeral token generation. Once authenticated, the User Agent establishes a direct, low-latency WebSocket or WebRTC connection to the Model Provider for media streaming.
External Mode An architectural configuration where the VIP Server acts as a full proxy. The User Agent streams media to the VIP Server, which then orchestrates the distinct ASR, LLM, and TTS services.
Session A bounded period of interaction between the User Agent and the Voice Runtime. A session begins with a Session Handshake and ends with a termination event (explicit hang-up or timeout).
Session Handshake The initial sequence of message exchanges used to negotiate protocol version, environment (Development/Production), and authentication credentials.
Ephemeral Token A short-lived, time-bound credential generated by the VIP Server (in Internal Mode) that allows the User Agent to connect directly to a Model Provider without exposing long-term API keys.
3.3 Data Structures and Context
Voice Interaction Tree The hierarchical representation of the application's current state, optimized for auditory consumption. It is the voice-modality equivalent of the Document Object Model (DOM).
Narrated State Description A generated text description of the current view, derived from the Voice Interaction Tree. This serves as the system prompt context, informing the AI of what the user "sees."
Action Registry A dynamic map of all currently executable capabilities. This includes navigation routes, form inputs, buttons, and custom functions available to the user in the current context.
Visibility Registry A subset of the Action Registry representing only those elements currently perceptible to the user (e.g., elements within the viewport or active modal).
Conversation History
A chronological record of messages and events exchanged during the session, including metadata such as speaker role (User, AI) and timestamps.
3.4 Interaction Logic
User Turn The period during which the User Agent captures input (audio or text) from the user.
System Turn The period during which the Voice Runtime processes input and generates a response (audio, text, or action).
Intent Invocation The execution of a specific function defined in the Action Registry. This is the voice-modality equivalent of a "click" or "keypress."
Prompt The instructional text (System Prompt) provided to the Voice Runtime that defines the agent's persona, boundaries, and operational rules.
Context Propagation The mechanism by which the User Agent updates the Voice Runtime with changes to the application state (e.g., route changes, form updates) in real-time.
Barge-in (Interruption) The act of the user speaking while the system is outputting audio. The protocol defines specific handling for this event to ensure state consistency.
3.5 Interaction States
State: Idle The default state where the User Agent is connected but not actively capturing audio or processing requests.
State: Listening The state where the User Agent is actively capturing audio input, triggered manually (Push-to-Talk) or automatically (Voice Activity Detection).
State: Processing The state where input has been finalized and transmitted, and the Voice Runtime is computing a response.
State: Speaking The state where the User Agent is receiving and playing back an audio stream from the Voice Runtime.
State: Action The state where the Voice Runtime has requested the execution of a client-side function (e.g., navigating to a new page).
3.6 Error Handling and Determinism
Deterministic Behavior Protocol operations that must yield a guaranteed outcome, specifically the mapping of a resolved intent to a specific function in the Action Registry.
Probabilistic Behavior Operations involving the AI model's interpretation of natural language, where outcomes are generated based on likelihood rather than fixed rules.
Recognition Error A failure occurring within the ASR layer, resulting in an inability to transcribe user input.
System Error A failure at the protocol or transport level (e.g., connection timeout, authentication failure, malformed payload).
Fallback The defined behavior when a specific modality fails (e.g., reverting to touch input if the microphone is inaccessible).