💡  Beta product

This product is a Beta product. This means that it and the information is subject for change, updates or removal.
If you test this product, please let us know your feedback, so that we can make it the best possible product for you. Please share your feedback with us here.

Realtime Voice API

The realtime voice API provides a websocket for streaming and processing bidirectional audio during phone calls. You should use this API only if you need programmatic access to the audio data itself, for example if you want to connect your virtual phone number to an AI agent.

Using the Realtime API requires both a regular virtual phone number and a websocket number. Your application must also accept websocket connections. We have a detailed setup guide if you need help getting this set up.

Overview

Each time your websocket number receives a call, the API will establish a new websocket connection to your application at the websocket_url specified on the number.

API -- wss://example.com/incoming-call --> Application

Once the websocket connection is established, the API and your application use JSON messages to start the session, negotiate audio formats, stream audio, and gracefully end the call. Each message has a field t that denotes its type.

The session ALWAYS starts with the API sending a hello message containing some metadata about the call…

// API -> Application
{
  "t": "hello",
  "callid": "c13d1e772...",
  "from": "+46701234567",
  "to": "+46766861234"
}

…and ALWAYS ends with the API sending a bye message. This contains the reason why the call ended, if any errors occurred, as well as a human-readable explanation as to what happened.

// API -> Application
{
  "t": "bye",
  "reason": "hangup",
  "message": "the caller hung up"
}

After the initial hello your application can start sending and receiving audio. To send audio, you must first send a sending message that specifies which audio format you'll be sending in. Likewise, to receive audio you must first send a listening message that specifies which audio format you want to receive in. You can choose to just send audio, just receive audio, or both.

// Application -> API
{
  "t": "sending",
  "format": "pcm_24000",
}
// Application -> API
{
  "t": "listening",
  "format": "pcm_24000",
}

Audio is sent in both directions via audio messages. These contain Base64 encoded audio data in the format specified in the previous sending and listening messages.

// API <-> Application
{
  "t": "audio",
  "data": "<base64 encoded audio data>",
}

If you want to end the call before the caller hangs up, you can do so by sending a bye message to the API.

// Application -> API
{
  "t": "bye",
}

The API will hang up the call once all buffered audio data has been played, then send a bye message back to your application. Your application must ALWAYS wait for the final bye from the API before disconnecting; failure to do so will cause the call to disconnect before all buffered audio has been played.

Putting it all together, the control flow of a typical bidirectional audio session will look something like this:

API                           Your application
 |                                    |
 | *Websocket conection established*  |
 |                                    |
 |         *API says hello*           |
 | ------------ hello --------------> |
 |                                    |
 |          *Audio setup*             |
 | <--------- listening ------------- |
 | <---------- sending -------------- |
 |                                    |
 |         *Audio streaming*          |
 | <----------- audio --------------- |
 | ------------ audio --------------> |
 | ------------ audio --------------> |
 | <----------- audio --------------- |
 | <------------ ... ---------------> |
 |                                    |
 |       *Application is done*        |
 | <------------ bye ---------------- |
 |                                    |
 | ------------ audio --------------> |
 | ------------ audio --------------> |
 | ------------- ... ---------------> |
 |                                    |
 |           *Call ends*              |
 | <------------ bye ---------------- |
 |                                    |
 |       *Websocket is closed*        |
 v                                    v

Example 1: An echo server

All examples use the websockets library in Python, and assume that a websocket connection between the API and your application has already been established. See our integration guide if you need help setting that up.

This code example implements an “echo server” that just plays the caller’s audio back to them.

async def echo_server(ws):
    # Get the call metadata from the hello message
    hello = json.loads(await ws.recv())
    print(f"Received {hello['to']} <- {hello['from']} ({hello['callid']})")

    # Tell the API the format we want to receive audio in
    await ws.send(json.dumps({
        "t": "listening",
        "format": "ulaw"
    }))

    # Tell the API the format we'll be sending audio in
    await ws.send(json.dumps({
        "t": "sending",
        "format": "ulaw"
    }))

    # Loop over and play back each audio message until the call ends
    async for raw in ws:
        msg = json.loads(raw)

        if msg["t"] == "audio":
            audio = base64.b64decode(msg["data"])

            # Echo the audio back to the server
            await ws.send(json.dumps({
                "t": "audio",
                "data": base64.b64encode(audio).decode()
            }))

        elif msg["t"] == "bye":
            print("Call ended:", msg["message"])
            break

Example 2: Playing an audio file

This example plays a WAV file from disk and then hangs up.

async def play_file(ws):
    # Get the call metadata from the hello message
    hello = json.loads(await ws.recv())
    print(f"Received {hello['to']} <- {hello['from']} ({hello['callid']})")

    # Tell the API the format we'll be sending audio in
    await ws.send(json.dumps({
        "t": "sending",
        "format": "wav"
    }))

    # Send the WAV file bytes in chunks
    with open("audio.wav", "rb") as f:
        for chunk in iter(lambda: f.read(32 * 1024), b""):
            await ws.send(json.dumps({
                "t": "audio",
                "data": base64.b64encode(chunk).decode()
            }))

    # Hang up
    await ws.send(json.dumps({
        "t": "bye"
    }))

    # Wait for the call to end
    msg = json.loads(await ws.recv())
    print("Call ended:", msg["message"])

Example 3: A memo recorder

This example records the caller’s audio to a WAV file on disk.

import wave
# ...
async def record_call(ws):

    # Get the phone call metadata from the hello message
    hello = json.loads(await ws.recv())
    print(f"Call started: {hello['to']} <- {hello['from']} ({hello['callid']})")

    # Tell the API the format we want to receive audio in
    await ws.send(json.dumps({
        "t": "listening",
        "format": "pcm_16000"
    }))

    # Open the wav file for writing
    with wave.open(f"{hello['callid']}.wav", "wb") as wf:
        wf.setnchannels(1)
        wf.setsampwidth(2)   # 16-bit
        wf.setframerate(16000)

        # Loop over each audio message until the call ends
        async for raw in ws:
            msg = json.loads(raw)

            if msg["t"] == "audio":
                audio = base64.b64decode(msg["data"])
                wf.writeframes(audio)

            elif msg["t"] == "bye":
                print("Call ended:", msg["reason"])
                break