models Package

Classes

ActionFind	A find action to search text within a page.
ActionOpenPage	An open page action.
ActionSearch	A web search action.
ActionSearchSource	A search action source URL.
AgentConfig	Configuration for the agent.
Animation	Configuration for animation outputs including blendshapes and visemes metadata.
AssistantMessageItem	An assistant message item within a conversation.
AudioEchoCancellation	Echo cancellation configuration for server-side audio processing.
AudioInputTranscriptionOptions	Configuration for input audio transcription.
AudioNoiseReduction	Configuration for input audio noise reduction.
AvatarConfig	Configuration for avatar streaming and behavior during the session.
AzureAvatarVoiceSyncVoice	Azure avatar voice sync configuration. Uses personal voice synthesis with avatar character.
AzureCustomVoice	Azure custom voice configuration.
AzurePersonalVoice	Azure personal voice configuration.
AzureSemanticDetection	Azure semantic end-of-utterance detection (default).
AzureSemanticDetectionEn	Azure semantic end-of-utterance detection (English-optimized).
AzureSemanticDetectionMultilingual	Azure semantic end-of-utterance detection (multilingual).
AzureSemanticVad	Server Speech Detection (Azure semantic VAD, default variant).
AzureSemanticVadEn	Server Speech Detection (Azure semantic VAD, English-only).
AzureSemanticVadMultilingual	Server Speech Detection (Azure semantic VAD).
AzureStandardVoice	Azure standard voice configuration.
AzureVoice	Base for Azure voice configurations. You probably want to use the sub-classes and not this class directly. Known sub-classes are: AzureAvatarVoiceSyncVoice, AzureCustomVoice, AzurePersonalVoice, AzureStandardVoice
Background	Defines a video background, either a solid color or an image URL (mutually exclusive).
CachedTokenDetails	Details of output token usage.
ClientEvent	A voicelive client event. You probably want to use the sub-classes and not this class directly. Known sub-classes are: ClientEventConversationItemCreate, ClientEventConversationItemDelete, ClientEventConversationItemRetrieve, ClientEventConversationItemTruncate, ClientEventInputAudioClear, ClientEventInputAudioTurnAppend, ClientEventInputAudioTurnCancel, ClientEventInputAudioTurnEnd, ClientEventInputAudioTurnStart, ClientEventInputAudioBufferAppend, ClientEventInputAudioBufferClear, ClientEventInputAudioBufferCommit, ClientEventOutputAudioBufferClear, ClientEventResponseCancel, ClientEventResponseCreate, ClientEventSessionAvatarConnect, ClientEventSessionUpdate
ClientEventConversationItemCreate	Add a new Item to the Conversation's context, including messages, function calls, and function call responses. This event can be used both to populate a "history" of the conversation and to add new items mid-stream, but has the current limitation that it cannot populate assistant audio messages. If successful, the server will respond with a `conversation.item.created` event, otherwise an `error` event will be sent.
ClientEventConversationItemDelete	Send this event when you want to remove any item from the conversation history. The server will respond with a `conversation.item.deleted` event, unless the item does not exist in the conversation history, in which case the server will respond with an error.
ClientEventConversationItemRetrieve	Send this event when you want to retrieve the server's representation of a specific item in the conversation history. This is useful, for example, to inspect user audio after noise cancellation and VAD. The server will respond with a `conversation.item.retrieved` event, unless the item does not exist in the conversation history, in which case the server will respond with an error.
ClientEventConversationItemTruncate	Send this event to truncate a previous assistant message's audio. The server will produce audio faster than voicelive, so this event is useful when the user interrupts to truncate audio that has already been sent to the client but not yet played. This will synchronize the server's understanding of the audio with the client's playback. Truncating audio will delete the server-side text transcript to ensure there is not text in the context that hasn't been heard by the user. If successful, the server will respond with a `conversation.item.truncated` event.
ClientEventInputAudioBufferAppend	Send this event to append audio bytes to the input audio buffer. The audio buffer is temporary storage you can write to and later commit. In Server VAD mode, the audio buffer is used to detect speech and the server will decide when to commit. When Server VAD is disabled, you must commit the audio buffer manually. The client may choose how much audio to place in each event up to a maximum of 15 MiB, for example streaming smaller chunks from the client may allow the VAD to be more responsive. Unlike made other client events, the server will not send a confirmation response to this event.
ClientEventInputAudioBufferClear	Send this event to clear the audio bytes in the buffer. The server will respond with an `input_audio_buffer.cleared` event.
ClientEventInputAudioBufferCommit	Send this event to commit the user input audio buffer, which will create a new user message item in the conversation. This event will produce an error if the input audio buffer is empty. When in Server VAD mode, the client does not need to send this event, the server will commit the audio buffer automatically. Committing the input audio buffer will trigger input audio transcription (if enabled in session configuration), but it will not create a response from the model. The server will respond with an `input_audio_buffer.committed` event.
ClientEventInputAudioClear	Clears all input audio currently being streamed.
ClientEventInputAudioTurnAppend	Appends audio data to an ongoing input turn.
ClientEventInputAudioTurnCancel	Cancels an in-progress input audio turn.
ClientEventInputAudioTurnEnd	Marks the end of an audio input turn.
ClientEventInputAudioTurnStart	Indicates the start of a new audio input turn.
ClientEventOutputAudioBufferClear	Client request to clear the avatar output buffer.
ClientEventResponseCancel	Send this event to cancel an in-progress response. The server will respond with a `response.cancelled` event or an error if there is no response to cancel.
ClientEventResponseCreate	This event instructs the server to create a Response, which means triggering model inference. When in Server VAD mode, the server will create Responses automatically. A Response will include at least one Item, and may have two, in which case the second will be a function call. These Items will be appended to the conversation history. The server will respond with a `response.created` event, events for Items and content created, and finally a `response.done` event to indicate the Response is complete. The `response.create` event includes inference configuration like `instructions`, and `temperature`. These fields will override the Session's configuration for this Response only.
ClientEventSessionAvatarConnect	Sent when the client connects and provides its SDP (Session Description Protocol) for avatar-related media negotiation.
ClientEventSessionUpdate	Send this event to update the session's default configuration. The client may send this event at any time to update any field, except for `voice`. However, note that once a session has been initialized with a particular `model`, it can't be changed to another model using `session.update`. When the server receives a `session.update`, it will respond with a `session.updated` event showing the full, effective configuration. Only the fields that are present are updated. To clear a field like `instructions`, pass an empty string.
ContentPart	Base for any content part; discriminated by `type`. You probably want to use the sub-classes and not this class directly. Known sub-classes are: ResponseAudioContentPart, RequestAudioContentPart, RequestImageContentPart, RequestTextContentPart, ResponseTextContentPart
ConversationItemBase	The item to add to the conversation.
ConversationRequestItem	Base for any response item; discriminated by `type`. You probably want to use the sub-classes and not this class directly. Known sub-classes are: FunctionCallItem, FunctionCallOutputItem, MCPApprovalResponseRequestItem, MessageItem
EouDetection	Top-level union for end-of-utterance (EOU) semantic detection configuration. You probably want to use the sub-classes and not this class directly. Known sub-classes are: AzureSemanticDetection, AzureSemanticDetectionEn, AzureSemanticDetectionMultilingual
ErrorResponse	Standard error response envelope.
FileSearchResult	A file search result entry.
FunctionCallItem	A function call item within a conversation.
FunctionCallOutputItem	A function call output item within a conversation.
FunctionTool	The definition of a function tool as used by the voicelive endpoint.
IceServer	ICE server configuration for WebRTC connection negotiation.
InputAudioContentPart	Input audio content part.
InputTextContentPart	Input text content part.
InputTokenDetails	Details of input token usage.
InterimResponseConfigBase	Base model for interim response configuration. You probably want to use the sub-classes and not this class directly. Known sub-classes are: LlmInterimResponseConfig, StaticInterimResponseConfig
LlmInterimResponseConfig	Configuration for LLM-based interim response generation. Uses LLM to generate context-aware interim responses when any trigger condition is met.
LogProbProperties	A single log probability entry for a token.
MCPApprovalResponseRequestItem	A request item that represents a response to an MCP approval request.
MCPServer	The definition of an MCP server as used by the voicelive endpoint.
MCPTool	Represents a mcp tool definition.
MessageContentPart	Base for any message content part; discriminated by `type`. You probably want to use the sub-classes and not this class directly. Known sub-classes are: InputAudioContentPart, InputTextContentPart, OutputTextContentPart
MessageItem	A message item within a conversation. You probably want to use the sub-classes and not this class directly. Known sub-classes are: AssistantMessageItem, SystemMessageItem, UserMessageItem
OpenAIVoice	OpenAI voice configuration with explicit type field. This provides a unified interface for OpenAI voices, complementing the existing string-based OAIVoice for backward compatibility.
OutputTextContentPart	Output text content part.
OutputTokenDetails	Details of output token usage.
RequestAudioContentPart	An audio content part for a request. This is supported only by realtime models (e.g., gpt-realtime). For text-based models, use `input_text` instead.
RequestImageContentPart	Input image content part.
RequestSession	Extended RequestSession that tracks explicitly set None values.
RequestTextContentPart	A text content part for a request.
Response	The response resource.
ResponseAudioContentPart	An audio content part for a response.
ResponseCancelledDetails	Details for a cancelled response.
ResponseCreateParams	Create a new VoiceLive response with these parameters.
ResponseFailedDetails	Details for a failed response.
ResponseFileSearchCallItem	A response item that represents a file search call.
ResponseFunctionCallItem	A function call item within a conversation.
ResponseFunctionCallOutputItem	A function call output item within a conversation.
ResponseIncompleteDetails	Details for an incomplete response.
ResponseItem	Base for any response item; discriminated by `type`. You probably want to use the sub-classes and not this class directly. Known sub-classes are: ResponseFileSearchCallItem, ResponseFunctionCallItem, ResponseFunctionCallOutputItem, ResponseMCPApprovalRequestItem, ResponseMCPApprovalResponseItem, ResponseMCPCallItem, ResponseMCPListToolItem, ResponseMessageItem, ResponseWebSearchCallItem
ResponseMCPApprovalRequestItem	A response item that represents a request for approval to call an MCP tool.
ResponseMCPApprovalResponseItem	A response item that represents a response to an MCP approval request.
ResponseMCPCallItem	A response item that represents a call to an MCP tool.
ResponseMCPListToolItem	A response item that lists the tools available on an MCP server.
ResponseMessageItem	Base type for message item within a conversation.
ResponseSession	Base for session configuration in the response.
ResponseStatusDetails	Base for all non-success response details. You probably want to use the sub-classes and not this class directly. Known sub-classes are: ResponseCancelledDetails, ResponseFailedDetails, ResponseIncompleteDetails
ResponseTextContentPart	A text content part for a response.
ResponseWebSearchCallItem	A response item that represents a web search call.
Scene	Configuration for avatar's zoom level, position, rotation and movement amplitude in the video frame.
ServerEvent	A voicelive server event. You probably want to use the sub-classes and not this class directly. Known sub-classes are: ServerEventConversationItemCreated, ServerEventConversationItemDeleted, ServerEventConversationItemInputAudioTranscriptionCompleted, ServerEventConversationItemInputAudioTranscriptionDelta, ServerEventConversationItemInputAudioTranscriptionFailed, ServerEventConversationItemRetrieved, ServerEventConversationItemTruncated, ServerEventError, ServerEventInputAudioBufferCleared, ServerEventInputAudioBufferCommitted, ServerEventInputAudioBufferSpeechStarted, ServerEventInputAudioBufferSpeechStopped, ServerEventMcpListToolsCompleted, ServerEventMcpListToolsFailed, ServerEventMcpListToolsInProgress, ServerEventOutputAudioBufferCleared, ServerEventResponseAnimationBlendshapeDelta, ServerEventResponseAnimationBlendshapeDone, ServerEventResponseAnimationVisemeDelta, ServerEventResponseAnimationVisemeDone, ServerEventResponseAudioDelta, ServerEventResponseAudioDone, ServerEventResponseAudioTimestampDelta, ServerEventResponseAudioTimestampDone, ServerEventResponseAudioTranscriptAnnotationAdded, ServerEventResponseAudioTranscriptDelta, ServerEventResponseAudioTranscriptDone, ServerEventResponseContentPartAdded, ServerEventResponseContentPartDone, ServerEventResponseCreated, ServerEventResponseDone, ServerEventResponseFileSearchCallCompleted, ServerEventResponseFileSearchCallInProgress, ServerEventResponseFileSearchCallSearching, ServerEventResponseFunctionCallArgumentsDelta, ServerEventResponseFunctionCallArgumentsDone, ServerEventResponseMcpCallCompleted, ServerEventResponseMcpCallFailed, ServerEventResponseMcpCallInProgress, ServerEventResponseMcpCallArgumentsDelta, ServerEventResponseMcpCallArgumentsDone, ServerEventResponseOutputItemAdded, ServerEventResponseOutputItemDone, ServerEventResponseTextDelta, ServerEventResponseTextDone, ServerEventResponseVideoDelta, ServerEventResponseWebSearchCallCompleted, ServerEventResponseWebSearchCallInProgress, ServerEventResponseWebSearchCallSearching, ServerEventSessionAvatarConnecting, ServerEventSessionAvatarSwitchToIdle, ServerEventSessionAvatarSwitchToSpeaking, ServerEventSessionCreated, ServerEventSessionUpdated, ServerEventWarning
ServerEventConversationItemCreated	Returned when a conversation item is created. There are several scenarios that produce this event: The server is generating a Response, which if successful will produce either one or two Items, which will be of type message (role assistant) or type function_call. The input audio buffer has been committed, either by the client or the server (in server_vad mode). The server will take the content of the input audio buffer and add it to a new user message Item. The client has sent a conversation.item.create event to add a new Item to the Conversation.
ServerEventConversationItemDeleted	Returned when an item in the conversation is deleted by the client with a `conversation.item.delete` event. This event is used to synchronize the server's understanding of the conversation history with the client's view.
ServerEventConversationItemInputAudioTranscriptionCompleted	This event is the output of audio transcription for user audio written to the user audio buffer. Transcription begins when the input audio buffer is committed by the client or server (in `server_vad` mode). Transcription runs asynchronously with Response creation, so this event may come before or after the Response events. VoiceLive API models accept audio natively, and thus input transcription is a separate process run on a separate ASR (Automatic Speech Recognition) model. The transcript may diverge somewhat from the model's interpretation, and should be treated as a rough guide.
ServerEventConversationItemInputAudioTranscriptionDelta	Returned when the text value of an input audio transcription content part is updated.
ServerEventConversationItemInputAudioTranscriptionFailed	Returned when input audio transcription is configured, and a transcription request for a user message failed. These events are separate from other `error` events so that the client can identify the related Item.
ServerEventConversationItemRetrieved	Returned when a conversation item is retrieved with `conversation.item.retrieve`.
ServerEventConversationItemTruncated	Returned when an earlier assistant audio message item is truncated by the client with a `conversation.item.truncate` event. This event is used to synchronize the server's understanding of the audio with the client's playback. This action will truncate the audio and remove the server-side text transcript to ensure there is no text in the context that hasn't been heard by the user.
ServerEventError	Returned when an error occurs, which could be a client problem or a server problem. Most errors are recoverable and the session will stay open, we recommend to implementors to monitor and log error messages by default.
ServerEventErrorDetails	Details of the error.
ServerEventInputAudioBufferCleared	Returned when the input audio buffer is cleared by the client with a `input_audio_buffer.clear` event.
ServerEventInputAudioBufferCommitted	Returned when an input audio buffer is committed, either by the client or automatically in server VAD mode. The `item_id` property is the ID of the user message item that will be created, thus a `conversation.item.created` event will also be sent to the client.
ServerEventInputAudioBufferSpeechStarted	Sent by the server when in `server_vad` mode to indicate that speech has been detected in the audio buffer. This can happen any time audio is added to the buffer (unless speech is already detected). The client may want to use this event to interrupt audio playback or provide visual feedback to the user. The client should expect to receive a `input_audio_buffer.speech_stopped` event when speech stops. The `item_id` property is the ID of the user message item that will be created when speech stops and will also be included in the `input_audio_buffer.speech_stopped` event (unless the client manually commits the audio buffer during VAD activation).
ServerEventInputAudioBufferSpeechStopped	Returned in `server_vad` mode when the server detects the end of speech in the audio buffer. The server will also send an `conversation.item.created` event with the user message item that is created from the audio buffer.
ServerEventMcpListToolsCompleted	MCP list tools completed message.
ServerEventMcpListToolsFailed	MCP list tools failed message.
ServerEventMcpListToolsInProgress	MCP list tools in progress message.
ServerEventOutputAudioBufferCleared	Returned when the output audio buffer has been cleared.
ServerEventResponseAnimationBlendshapeDelta	Represents a delta update of blendshape animation frames for a specific output of a response.
ServerEventResponseAnimationBlendshapeDone	Indicates the completion of blendshape animation processing for a specific output of a response.
ServerEventResponseAnimationVisemeDelta	Represents a viseme ID delta update for animation based on audio.
ServerEventResponseAnimationVisemeDone	Indicates completion of viseme animation delivery for a response.
ServerEventResponseAudioDelta	Returned when the model-generated audio is updated.
ServerEventResponseAudioDone	Returned when the model-generated audio is done. Also emitted when a Response is interrupted, incomplete, or cancelled.
ServerEventResponseAudioTimestampDelta	Represents a word-level audio timestamp delta for a response.
ServerEventResponseAudioTimestampDone	Indicates completion of audio timestamp delivery for a response.
ServerEventResponseAudioTranscriptAnnotationAdded	Returned when an audio transcript annotation is added to a response.
ServerEventResponseAudioTranscriptDelta	Returned when the model-generated transcription of audio output is updated.
ServerEventResponseAudioTranscriptDone	Returned when the model-generated transcription of audio output is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.
ServerEventResponseContentPartAdded	Returned when a new content part is added to an assistant message item during response generation.
ServerEventResponseContentPartDone	Returned when a content part is done streaming in an assistant message item. Also emitted when a Response is interrupted, incomplete, or cancelled.
ServerEventResponseCreated	Returned when a new Response is created. The first event of response creation, where the response is in an initial state of `in_progress`.
ServerEventResponseDone	Returned when a Response is done streaming. Always emitted, no matter the final state. The Response object included in the `response.done` event will include all output Items in the Response but will omit the raw audio data.
ServerEventResponseFileSearchCallCompleted	Returned when a file search call has completed.
ServerEventResponseFileSearchCallInProgress	Returned when a file search call is in progress.
ServerEventResponseFileSearchCallSearching	Returned when a file search call is searching.
ServerEventResponseFunctionCallArgumentsDelta	Returned when the model-generated function call arguments are updated.
ServerEventResponseFunctionCallArgumentsDone	Returned when the model-generated function call arguments are done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.
ServerEventResponseMcpCallArgumentsDelta	Represents a delta update of the arguments for an MCP tool call.
ServerEventResponseMcpCallArgumentsDone	Indicates the completion of the arguments for an MCP tool call.
ServerEventResponseMcpCallCompleted	Indicates the MCP call has completed.
ServerEventResponseMcpCallFailed	Indicates the MCP call has failed.
ServerEventResponseMcpCallInProgress	Indicates the MCP call running.
ServerEventResponseOutputItemAdded	Returned when a new Item is created during Response generation.
ServerEventResponseOutputItemDone	Returned when an Item is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.
ServerEventResponseTextDelta	Returned when the text value of a "text" content part is updated.
ServerEventResponseTextDone	Returned when the text value of a "text" content part is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.
ServerEventResponseVideoDelta	Returned when avatar video frame data is streamed.
ServerEventResponseWebSearchCallCompleted	Returned when a web search call has completed.
ServerEventResponseWebSearchCallInProgress	Returned when a web search call is in progress.
ServerEventResponseWebSearchCallSearching	Returned when a web search call is searching.
ServerEventSessionAvatarConnecting	Sent when the server is in the process of establishing an avatar media connection and provides its SDP answer.
ServerEventSessionAvatarSwitchToIdle	Returned when the avatar switches to idle state.
ServerEventSessionAvatarSwitchToSpeaking	Returned when the avatar switches to speaking state.
ServerEventSessionCreated	Returned when a Session is created. Emitted automatically when a new connection is established as the first server event. This event will contain the default Session configuration.
ServerEventSessionUpdated	Returned when a session is updated with a `session.update` event, unless there is an error.
ServerEventWarning	Returned when a warning occurs that does not interrupt the conversation flow. Warnings are informational and the session will continue normally.
ServerEventWarningDetails	Details of the warning.
ServerVad	Base model for VAD-based turn detection.
SessionBase	VoiceLive session object configuration.
StaticInterimResponseConfig	Configuration for static interim response generation. Randomly selects from configured texts when any trigger condition is met.
SystemMessageItem	A system message item within a conversation.
TokenUsage	Overall usage statistics for a response.
Tool	The base representation of a voicelive tool definition. You probably want to use the sub-classes and not this class directly. Known sub-classes are: FunctionTool, MCPServer
ToolChoiceFunctionSelection	The representation of a voicelive tool_choice selecting a named function tool.
ToolChoiceSelection	A base representation for a voicelive tool_choice selecting a named tool. You probably want to use the sub-classes and not this class directly. Known sub-classes are: ToolChoiceFunctionSelection
TranscriptionPhrase	A transcribed phrase with timing information.
TranscriptionWord	A time-stamped word in the transcription.
TurnDetection	Top-level union for turn detection configuration. You probably want to use the sub-classes and not this class directly. Known sub-classes are: AzureSemanticVad, AzureSemanticVadEn, AzureSemanticVadMultilingual, ServerVad
UserMessageItem	A user message item within a conversation.
VideoCrop	Defines a video crop rectangle using top-left and bottom-right coordinates.
VideoParams	Video streaming parameters for avatar.
VideoResolution	Resolution of the video feed in pixels.
VoiceLiveErrorDetails	Error object returned in case of API failure.

Enums

AnimationOutputType	Specifies the types of animation data to output.
AudioTimestampType	Output timestamp types supported in audio response content.
AvatarConfigTypes	Avatar config types.
AvatarOutputProtocol	Avatar config output protocols.
AzureVoiceType	Union of all supported Azure voice types.
ClientEventType	Client event types used in VoiceLive protocol.
ContentPartType	Type of ContentPartType.
EouThresholdLevel	Threshold level settings for Azure semantic end-of-utterance detection.
InputAudioFormat	Input audio format types supported.
InterimResponseConfigType	Interim response configuration types.
InterimResponseTrigger	Triggers that can activate interim response generation.
ItemParamStatus	Indicates the processing status of an item or parameter.
ItemType	Type of ItemType.
MCPApprovalType	The available set of MCP approval types.
MessageRole	Type of MessageRole.
Modality	Supported modalities for the session.
OpenAIVoiceName	Supported OpenAI voice names (string enum).
OutputAudioFormat	Output audio format types supported.
PersonalVoiceModels	PersonalVoice models.
PhotoAvatarBaseModes	Photo avatar base modes.
ReasoningEffort	Constrains effort on reasoning for reasoning models. Check model documentation for supported values for each model. Reducing reasoning effort can result in faster responses and fewer tokens used on reasoning in a response.
RequestImageContentPartDetail	Specifies an image's detail level. Can be 'auto', 'low', 'high', or an unknown future value.
ResponseItemStatus	Indicates the processing status of a response item.
ResponseStatus	Terminal status of a response.
ServerEventType	Server event types used in VoiceLive protocol.
SessionIncludeOption	Options for what additional data to include in session responses.
ToolChoiceLiteral	The available set of mode-level, string literal tool_choice options for the voicelive endpoint.
ToolType	The supported tool type discriminators for voicelive tools. Currently, only 'function' tools are supported.
TurnDetectionType	Type of TurnDetectionType.

Feedback

Was this page helpful?