models Package
Classes
| ActionFind |
A find action to search text within a page. |
| ActionOpenPage |
An open page action. |
| ActionSearch |
A web search action. |
| ActionSearchSource |
A search action source URL. |
| AgentConfig |
Configuration for the agent. |
| Animation |
Configuration for animation outputs including blendshapes and visemes metadata. |
| AssistantMessageItem |
An assistant message item within a conversation. |
| AudioEchoCancellation |
Echo cancellation configuration for server-side audio processing. |
| AudioInputTranscriptionOptions |
Configuration for input audio transcription. |
| AudioNoiseReduction |
Configuration for input audio noise reduction. |
| AvatarConfig |
Configuration for avatar streaming and behavior during the session. |
| AzureAvatarVoiceSyncVoice |
Azure avatar voice sync configuration. Uses personal voice synthesis with avatar character. |
| AzureCustomVoice |
Azure custom voice configuration. |
| AzurePersonalVoice |
Azure personal voice configuration. |
| AzureSemanticDetection |
Azure semantic end-of-utterance detection (default). |
| AzureSemanticDetectionEn |
Azure semantic end-of-utterance detection (English-optimized). |
| AzureSemanticDetectionMultilingual |
Azure semantic end-of-utterance detection (multilingual). |
| AzureSemanticVad |
Server Speech Detection (Azure semantic VAD, default variant). |
| AzureSemanticVadEn |
Server Speech Detection (Azure semantic VAD, English-only). |
| AzureSemanticVadMultilingual |
Server Speech Detection (Azure semantic VAD). |
| AzureStandardVoice |
Azure standard voice configuration. |
| AzureVoice |
Base for Azure voice configurations. You probably want to use the sub-classes and not this class directly. Known sub-classes are: AzureAvatarVoiceSyncVoice, AzureCustomVoice, AzurePersonalVoice, AzureStandardVoice |
| Background |
Defines a video background, either a solid color or an image URL (mutually exclusive). |
| CachedTokenDetails |
Details of output token usage. |
| ClientEvent |
A voicelive client event. You probably want to use the sub-classes and not this class directly. Known sub-classes are: ClientEventConversationItemCreate, ClientEventConversationItemDelete, ClientEventConversationItemRetrieve, ClientEventConversationItemTruncate, ClientEventInputAudioClear, ClientEventInputAudioTurnAppend, ClientEventInputAudioTurnCancel, ClientEventInputAudioTurnEnd, ClientEventInputAudioTurnStart, ClientEventInputAudioBufferAppend, ClientEventInputAudioBufferClear, ClientEventInputAudioBufferCommit, ClientEventOutputAudioBufferClear, ClientEventResponseCancel, ClientEventResponseCreate, ClientEventSessionAvatarConnect, ClientEventSessionUpdate |
| ClientEventConversationItemCreate |
Add a new Item to the Conversation's context, including messages, function calls, and function
call responses. This event can be used both to populate a "history" of the conversation and to
add new items mid-stream, but has the current limitation that it cannot populate assistant
audio messages. If successful, the server will respond with a |
| ClientEventConversationItemDelete |
Send this event when you want to remove any item from the conversation history. The server will
respond with a |
| ClientEventConversationItemRetrieve |
Send this event when you want to retrieve the server's representation of a specific item in the
conversation history. This is useful, for example, to inspect user audio after noise
cancellation and VAD. The server will respond with a |
| ClientEventConversationItemTruncate |
Send this event to truncate a previous assistant message's audio. The server will produce audio
faster than voicelive, so this event is useful when the user interrupts to truncate audio that
has already been sent to the client but not yet played. This will synchronize the server's
understanding of the audio with the client's playback. Truncating audio will delete the
server-side text transcript to ensure there is not text in the context that hasn't been heard
by the user. If successful, the server will respond with a |
| ClientEventInputAudioBufferAppend |
Send this event to append audio bytes to the input audio buffer. The audio buffer is temporary storage you can write to and later commit. In Server VAD mode, the audio buffer is used to detect speech and the server will decide when to commit. When Server VAD is disabled, you must commit the audio buffer manually. The client may choose how much audio to place in each event up to a maximum of 15 MiB, for example streaming smaller chunks from the client may allow the VAD to be more responsive. Unlike made other client events, the server will not send a confirmation response to this event. |
| ClientEventInputAudioBufferClear |
Send this event to clear the audio bytes in the buffer. The server will respond with an
|
| ClientEventInputAudioBufferCommit |
Send this event to commit the user input audio buffer, which will create a new user message
item in the conversation. This event will produce an error if the input audio buffer is empty.
When in Server VAD mode, the client does not need to send this event, the server will commit
the audio buffer automatically. Committing the input audio buffer will trigger input audio
transcription (if enabled in session configuration), but it will not create a response from the
model. The server will respond with an |
| ClientEventInputAudioClear |
Clears all input audio currently being streamed. |
| ClientEventInputAudioTurnAppend |
Appends audio data to an ongoing input turn. |
| ClientEventInputAudioTurnCancel |
Cancels an in-progress input audio turn. |
| ClientEventInputAudioTurnEnd |
Marks the end of an audio input turn. |
| ClientEventInputAudioTurnStart |
Indicates the start of a new audio input turn. |
| ClientEventOutputAudioBufferClear |
Client request to clear the avatar output buffer. |
| ClientEventResponseCancel |
Send this event to cancel an in-progress response. The server will respond with a
|
| ClientEventResponseCreate |
This event instructs the server to create a Response, which means triggering model inference.
When in Server VAD mode, the server will create Responses automatically. A Response will
include at least one Item, and may have two, in which case the second will be a function call.
These Items will be appended to the conversation history. The server will respond with a
|
| ClientEventSessionAvatarConnect |
Sent when the client connects and provides its SDP (Session Description Protocol) for avatar-related media negotiation. |
| ClientEventSessionUpdate |
Send this event to update the session's default configuration. The client may send this event
at any time to update any field, except for |
| ContentPart |
Base for any content part; discriminated by You probably want to use the sub-classes and not this class directly. Known sub-classes are: ResponseAudioContentPart, RequestAudioContentPart, RequestImageContentPart, RequestTextContentPart, ResponseTextContentPart |
| ConversationItemBase |
The item to add to the conversation. |
| ConversationRequestItem |
Base for any response item; discriminated by You probably want to use the sub-classes and not this class directly. Known sub-classes are: FunctionCallItem, FunctionCallOutputItem, MCPApprovalResponseRequestItem, MessageItem |
| EouDetection |
Top-level union for end-of-utterance (EOU) semantic detection configuration. You probably want to use the sub-classes and not this class directly. Known sub-classes are: AzureSemanticDetection, AzureSemanticDetectionEn, AzureSemanticDetectionMultilingual |
| ErrorResponse |
Standard error response envelope. |
| FileSearchResult |
A file search result entry. |
| FunctionCallItem |
A function call item within a conversation. |
| FunctionCallOutputItem |
A function call output item within a conversation. |
| FunctionTool |
The definition of a function tool as used by the voicelive endpoint. |
| IceServer |
ICE server configuration for WebRTC connection negotiation. |
| InputAudioContentPart |
Input audio content part. |
| InputTextContentPart |
Input text content part. |
| InputTokenDetails |
Details of input token usage. |
| InterimResponseConfigBase |
Base model for interim response configuration. You probably want to use the sub-classes and not this class directly. Known sub-classes are: LlmInterimResponseConfig, StaticInterimResponseConfig |
| LlmInterimResponseConfig |
Configuration for LLM-based interim response generation. Uses LLM to generate context-aware interim responses when any trigger condition is met. |
| LogProbProperties |
A single log probability entry for a token. |
| MCPApprovalResponseRequestItem |
A request item that represents a response to an MCP approval request. |
| MCPServer |
The definition of an MCP server as used by the voicelive endpoint. |
| MCPTool |
Represents a mcp tool definition. |
| MessageContentPart |
Base for any message content part; discriminated by You probably want to use the sub-classes and not this class directly. Known sub-classes are: InputAudioContentPart, InputTextContentPart, OutputTextContentPart |
| MessageItem |
A message item within a conversation. You probably want to use the sub-classes and not this class directly. Known sub-classes are: AssistantMessageItem, SystemMessageItem, UserMessageItem |
| OpenAIVoice |
OpenAI voice configuration with explicit type field. This provides a unified interface for OpenAI voices, complementing the existing string-based OAIVoice for backward compatibility. |
| OutputTextContentPart |
Output text content part. |
| OutputTokenDetails |
Details of output token usage. |
| RequestAudioContentPart |
An audio content part for a request. This is supported only by realtime models (e.g.,
gpt-realtime). For text-based models, use |
| RequestImageContentPart |
Input image content part. |
| RequestSession |
Extended RequestSession that tracks explicitly set None values. |
| RequestTextContentPart |
A text content part for a request. |
| Response |
The response resource. |
| ResponseAudioContentPart |
An audio content part for a response. |
| ResponseCancelledDetails |
Details for a cancelled response. |
| ResponseCreateParams |
Create a new VoiceLive response with these parameters. |
| ResponseFailedDetails |
Details for a failed response. |
| ResponseFileSearchCallItem |
A response item that represents a file search call. |
| ResponseFunctionCallItem |
A function call item within a conversation. |
| ResponseFunctionCallOutputItem |
A function call output item within a conversation. |
| ResponseIncompleteDetails |
Details for an incomplete response. |
| ResponseItem |
Base for any response item; discriminated by You probably want to use the sub-classes and not this class directly. Known sub-classes are: ResponseFileSearchCallItem, ResponseFunctionCallItem, ResponseFunctionCallOutputItem, ResponseMCPApprovalRequestItem, ResponseMCPApprovalResponseItem, ResponseMCPCallItem, ResponseMCPListToolItem, ResponseMessageItem, ResponseWebSearchCallItem |
| ResponseMCPApprovalRequestItem |
A response item that represents a request for approval to call an MCP tool. |
| ResponseMCPApprovalResponseItem |
A response item that represents a response to an MCP approval request. |
| ResponseMCPCallItem |
A response item that represents a call to an MCP tool. |
| ResponseMCPListToolItem |
A response item that lists the tools available on an MCP server. |
| ResponseMessageItem |
Base type for message item within a conversation. |
| ResponseSession |
Base for session configuration in the response. |
| ResponseStatusDetails |
Base for all non-success response details. You probably want to use the sub-classes and not this class directly. Known sub-classes are: ResponseCancelledDetails, ResponseFailedDetails, ResponseIncompleteDetails |
| ResponseTextContentPart |
A text content part for a response. |
| ResponseWebSearchCallItem |
A response item that represents a web search call. |
| Scene |
Configuration for avatar's zoom level, position, rotation and movement amplitude in the video frame. |
| ServerEvent |
A voicelive server event. You probably want to use the sub-classes and not this class directly. Known sub-classes are: ServerEventConversationItemCreated, ServerEventConversationItemDeleted, ServerEventConversationItemInputAudioTranscriptionCompleted, ServerEventConversationItemInputAudioTranscriptionDelta, ServerEventConversationItemInputAudioTranscriptionFailed, ServerEventConversationItemRetrieved, ServerEventConversationItemTruncated, ServerEventError, ServerEventInputAudioBufferCleared, ServerEventInputAudioBufferCommitted, ServerEventInputAudioBufferSpeechStarted, ServerEventInputAudioBufferSpeechStopped, ServerEventMcpListToolsCompleted, ServerEventMcpListToolsFailed, ServerEventMcpListToolsInProgress, ServerEventOutputAudioBufferCleared, ServerEventResponseAnimationBlendshapeDelta, ServerEventResponseAnimationBlendshapeDone, ServerEventResponseAnimationVisemeDelta, ServerEventResponseAnimationVisemeDone, ServerEventResponseAudioDelta, ServerEventResponseAudioDone, ServerEventResponseAudioTimestampDelta, ServerEventResponseAudioTimestampDone, ServerEventResponseAudioTranscriptAnnotationAdded, ServerEventResponseAudioTranscriptDelta, ServerEventResponseAudioTranscriptDone, ServerEventResponseContentPartAdded, ServerEventResponseContentPartDone, ServerEventResponseCreated, ServerEventResponseDone, ServerEventResponseFileSearchCallCompleted, ServerEventResponseFileSearchCallInProgress, ServerEventResponseFileSearchCallSearching, ServerEventResponseFunctionCallArgumentsDelta, ServerEventResponseFunctionCallArgumentsDone, ServerEventResponseMcpCallCompleted, ServerEventResponseMcpCallFailed, ServerEventResponseMcpCallInProgress, ServerEventResponseMcpCallArgumentsDelta, ServerEventResponseMcpCallArgumentsDone, ServerEventResponseOutputItemAdded, ServerEventResponseOutputItemDone, ServerEventResponseTextDelta, ServerEventResponseTextDone, ServerEventResponseVideoDelta, ServerEventResponseWebSearchCallCompleted, ServerEventResponseWebSearchCallInProgress, ServerEventResponseWebSearchCallSearching, ServerEventSessionAvatarConnecting, ServerEventSessionAvatarSwitchToIdle, ServerEventSessionAvatarSwitchToSpeaking, ServerEventSessionCreated, ServerEventSessionUpdated, ServerEventWarning |
| ServerEventConversationItemCreated |
Returned when a conversation item is created. There are several scenarios that produce this event: The server is generating a Response, which if successful will produce either one or two Items, which will be of type message (role assistant) or type function_call. The input audio buffer has been committed, either by the client or the server (in server_vad mode). The server will take the content of the input audio buffer and add it to a new user message Item. The client has sent a conversation.item.create event to add a new Item to the Conversation. |
| ServerEventConversationItemDeleted |
Returned when an item in the conversation is deleted by the client with a
|
| ServerEventConversationItemInputAudioTranscriptionCompleted |
This event is the output of audio transcription for user audio written to the user audio
buffer. Transcription begins when the input audio buffer is committed by the client or server
(in |
| ServerEventConversationItemInputAudioTranscriptionDelta |
Returned when the text value of an input audio transcription content part is updated. |
| ServerEventConversationItemInputAudioTranscriptionFailed |
Returned when input audio transcription is configured, and a transcription request for a user
message failed. These events are separate from other |
| ServerEventConversationItemRetrieved |
Returned when a conversation item is retrieved with |
| ServerEventConversationItemTruncated |
Returned when an earlier assistant audio message item is truncated by the client with a
|
| ServerEventError |
Returned when an error occurs, which could be a client problem or a server problem. Most errors are recoverable and the session will stay open, we recommend to implementors to monitor and log error messages by default. |
| ServerEventErrorDetails |
Details of the error. |
| ServerEventInputAudioBufferCleared |
Returned when the input audio buffer is cleared by the client with a
|
| ServerEventInputAudioBufferCommitted |
Returned when an input audio buffer is committed, either by the client or automatically in
server VAD mode. The |
| ServerEventInputAudioBufferSpeechStarted |
Sent by the server when in |
| ServerEventInputAudioBufferSpeechStopped |
Returned in |
| ServerEventMcpListToolsCompleted |
MCP list tools completed message. |
| ServerEventMcpListToolsFailed |
MCP list tools failed message. |
| ServerEventMcpListToolsInProgress |
MCP list tools in progress message. |
| ServerEventOutputAudioBufferCleared |
Returned when the output audio buffer has been cleared. |
| ServerEventResponseAnimationBlendshapeDelta |
Represents a delta update of blendshape animation frames for a specific output of a response. |
| ServerEventResponseAnimationBlendshapeDone |
Indicates the completion of blendshape animation processing for a specific output of a response. |
| ServerEventResponseAnimationVisemeDelta |
Represents a viseme ID delta update for animation based on audio. |
| ServerEventResponseAnimationVisemeDone |
Indicates completion of viseme animation delivery for a response. |
| ServerEventResponseAudioDelta |
Returned when the model-generated audio is updated. |
| ServerEventResponseAudioDone |
Returned when the model-generated audio is done. Also emitted when a Response is interrupted, incomplete, or cancelled. |
| ServerEventResponseAudioTimestampDelta |
Represents a word-level audio timestamp delta for a response. |
| ServerEventResponseAudioTimestampDone |
Indicates completion of audio timestamp delivery for a response. |
| ServerEventResponseAudioTranscriptAnnotationAdded |
Returned when an audio transcript annotation is added to a response. |
| ServerEventResponseAudioTranscriptDelta |
Returned when the model-generated transcription of audio output is updated. |
| ServerEventResponseAudioTranscriptDone |
Returned when the model-generated transcription of audio output is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled. |
| ServerEventResponseContentPartAdded |
Returned when a new content part is added to an assistant message item during response generation. |
| ServerEventResponseContentPartDone |
Returned when a content part is done streaming in an assistant message item. Also emitted when a Response is interrupted, incomplete, or cancelled. |
| ServerEventResponseCreated |
Returned when a new Response is created. The first event of response creation, where the
response is in an initial state of |
| ServerEventResponseDone |
Returned when a Response is done streaming. Always emitted, no matter the final state. The
Response object included in the |
| ServerEventResponseFileSearchCallCompleted |
Returned when a file search call has completed. |
| ServerEventResponseFileSearchCallInProgress |
Returned when a file search call is in progress. |
| ServerEventResponseFileSearchCallSearching |
Returned when a file search call is searching. |
| ServerEventResponseFunctionCallArgumentsDelta |
Returned when the model-generated function call arguments are updated. |
| ServerEventResponseFunctionCallArgumentsDone |
Returned when the model-generated function call arguments are done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled. |
| ServerEventResponseMcpCallArgumentsDelta |
Represents a delta update of the arguments for an MCP tool call. |
| ServerEventResponseMcpCallArgumentsDone |
Indicates the completion of the arguments for an MCP tool call. |
| ServerEventResponseMcpCallCompleted |
Indicates the MCP call has completed. |
| ServerEventResponseMcpCallFailed |
Indicates the MCP call has failed. |
| ServerEventResponseMcpCallInProgress |
Indicates the MCP call running. |
| ServerEventResponseOutputItemAdded |
Returned when a new Item is created during Response generation. |
| ServerEventResponseOutputItemDone |
Returned when an Item is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled. |
| ServerEventResponseTextDelta |
Returned when the text value of a "text" content part is updated. |
| ServerEventResponseTextDone |
Returned when the text value of a "text" content part is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled. |
| ServerEventResponseVideoDelta |
Returned when avatar video frame data is streamed. |
| ServerEventResponseWebSearchCallCompleted |
Returned when a web search call has completed. |
| ServerEventResponseWebSearchCallInProgress |
Returned when a web search call is in progress. |
| ServerEventResponseWebSearchCallSearching |
Returned when a web search call is searching. |
| ServerEventSessionAvatarConnecting |
Sent when the server is in the process of establishing an avatar media connection and provides its SDP answer. |
| ServerEventSessionAvatarSwitchToIdle |
Returned when the avatar switches to idle state. |
| ServerEventSessionAvatarSwitchToSpeaking |
Returned when the avatar switches to speaking state. |
| ServerEventSessionCreated |
Returned when a Session is created. Emitted automatically when a new connection is established as the first server event. This event will contain the default Session configuration. |
| ServerEventSessionUpdated |
Returned when a session is updated with a |
| ServerEventWarning |
Returned when a warning occurs that does not interrupt the conversation flow. Warnings are informational and the session will continue normally. |
| ServerEventWarningDetails |
Details of the warning. |
| ServerVad |
Base model for VAD-based turn detection. |
| SessionBase |
VoiceLive session object configuration. |
| StaticInterimResponseConfig |
Configuration for static interim response generation. Randomly selects from configured texts when any trigger condition is met. |
| SystemMessageItem |
A system message item within a conversation. |
| TokenUsage |
Overall usage statistics for a response. |
| Tool |
The base representation of a voicelive tool definition. You probably want to use the sub-classes and not this class directly. Known sub-classes are: FunctionTool, MCPServer |
| ToolChoiceFunctionSelection |
The representation of a voicelive tool_choice selecting a named function tool. |
| ToolChoiceSelection |
A base representation for a voicelive tool_choice selecting a named tool. You probably want to use the sub-classes and not this class directly. Known sub-classes are: ToolChoiceFunctionSelection |
| TranscriptionPhrase |
A transcribed phrase with timing information. |
| TranscriptionWord |
A time-stamped word in the transcription. |
| TurnDetection |
Top-level union for turn detection configuration. You probably want to use the sub-classes and not this class directly. Known sub-classes are: AzureSemanticVad, AzureSemanticVadEn, AzureSemanticVadMultilingual, ServerVad |
| UserMessageItem |
A user message item within a conversation. |
| VideoCrop |
Defines a video crop rectangle using top-left and bottom-right coordinates. |
| VideoParams |
Video streaming parameters for avatar. |
| VideoResolution |
Resolution of the video feed in pixels. |
| VoiceLiveErrorDetails |
Error object returned in case of API failure. |
Enums
| AnimationOutputType |
Specifies the types of animation data to output. |
| AudioTimestampType |
Output timestamp types supported in audio response content. |
| AvatarConfigTypes |
Avatar config types. |
| AvatarOutputProtocol |
Avatar config output protocols. |
| AzureVoiceType |
Union of all supported Azure voice types. |
| ClientEventType |
Client event types used in VoiceLive protocol. |
| ContentPartType |
Type of ContentPartType. |
| EouThresholdLevel |
Threshold level settings for Azure semantic end-of-utterance detection. |
| InputAudioFormat |
Input audio format types supported. |
| InterimResponseConfigType |
Interim response configuration types. |
| InterimResponseTrigger |
Triggers that can activate interim response generation. |
| ItemParamStatus |
Indicates the processing status of an item or parameter. |
| ItemType |
Type of ItemType. |
| MCPApprovalType |
The available set of MCP approval types. |
| MessageRole |
Type of MessageRole. |
| Modality |
Supported modalities for the session. |
| OpenAIVoiceName |
Supported OpenAI voice names (string enum). |
| OutputAudioFormat |
Output audio format types supported. |
| PersonalVoiceModels |
PersonalVoice models. |
| PhotoAvatarBaseModes |
Photo avatar base modes. |
| ReasoningEffort |
Constrains effort on reasoning for reasoning models. Check model documentation for supported values for each model. Reducing reasoning effort can result in faster responses and fewer tokens used on reasoning in a response. |
| RequestImageContentPartDetail |
Specifies an image's detail level. Can be 'auto', 'low', 'high', or an unknown future value. |
| ResponseItemStatus |
Indicates the processing status of a response item. |
| ResponseStatus |
Terminal status of a response. |
| ServerEventType |
Server event types used in VoiceLive protocol. |
| SessionIncludeOption |
Options for what additional data to include in session responses. |
| ToolChoiceLiteral |
The available set of mode-level, string literal tool_choice options for the voicelive endpoint. |
| ToolType |
The supported tool type discriminators for voicelive tools. Currently, only 'function' tools are supported. |
| TurnDetectionType |
Type of TurnDetectionType. |