The Agents SDK introduces several unique features, with one of the standout capabilities being voice functionality. The voice tutorial demonstrates how to build voice-enabled AI agents that can process spoken input, generate intelligent responses, and deliver those responses as natural-sounding speech.
Setup
If you're working in Google Colab or another remote notebook service, you can install the following requirements for this notebook. If running locally, refer to the uv setup instructions in the README.
text
!pip install -qU \
"matplotlib==3.10.1" \
"openai==1.68.2" \
"openai-agents[voice]==0.0.12" \
"sounddevice==0.5.1"
Working with Sound in Python
We'll be using the sounddevice library to handle the audio input and streaming — which allows us to record audio into a numpy array, and play audio from a numpy array.
Before recording / playing audio with sounddevice we need to find the sample rate of our input and output devices. We can find our input / output device details using the query_devices function like so:
We've seen how to work with audio in Python. Now it's time for us to jump into working with audio with the Agents SDK.
python
import os
import getpass
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") or getpass.getpass("OpenAI API Key: ")
python
from agents import Agent
agent = Agent(
name="Assistant",
instructions=(
"Repeat the user's question back to them, and then answer it. Note that the user is "
"speaking to you via a voice interface, although you are reading and writing text to "
"respond. Nonetheless, ensure that your written response is easily translatable to voice."
),
model="gpt-4.1-nano"
)
We will use the VoicePipeline from Agents SDK, which requires two parameters.
Workflow Parameter
The workflow which is our agent from above transformed into a voice workflow via the SingleAgentVoiceWorkflow object.
python
from agents.voice import SingleAgentVoiceWorkflow
workflow = SingleAgentVoiceWorkflow(agent)
Config Parameter
The config is where we pass our VoicePipelineConfig. Inside this config we provide a TTSModelSettings object within which we provide instructions on how the voice should be.
python
from agents.voice import TTSModelSettings, VoicePipelineConfig
# Define custom TTS model settings with the desired instructions
Now we can provide our audio_input to our pipeline to receive an audio output stream. This is handled asynchronousely so we must await the pipeline and capture the audio streamed events, which we find via the type="voice_stream_event_audio".
python
result = await pipeline.run(audio_input=audio_input)
response_chunks = []
async for event in result.stream():
if event.type == "voice_stream_event_audio":
response_chunks.append(event.data)
# concatenate all of the chunks into a single audio buffer
sd.wait() # this prevents the cell from finishing before the full audio is played
Great! We have our spoken response from our LLM. Now we can wrap this up into a more conversational interface. We will make it so that we can click the Enter key to speak, and press "q" once we're finished.
python
async def voice_assistant_optimized():
while True:
# check for input to either provide voice or exit
cmd = input("Press Enter to speak (or type 'q' to exit): ")
if cmd.lower() == "q":
print("Exiting...")
break
print("Listening...")
recorded_chunks = []
# start streaming from microphone until Enter is pressed
with sd.InputStream(
samplerate=in_samplerate,
channels=1,
dtype='int16',
callback=lambda indata, frames, time, status: recorded_chunks.append(indata.copy())