BYOM (Bring Your Own Model)

BYOM lets you drive an agent’s conversation with any LLM or agent framework you choose — OpenAI, Anthropic Claude, Gemini, Llama on Ollama, a LangGraph pipeline, or a deterministic state machine. Voicebip handles telephony, STT, TTS, MNO failover, call control, billing, and HMAC-signed events; your webhook only has to return the next thing the agent should say.

If you want Voicebip to host the model for you, see Agents and set ai_provider to openai or gemini instead.

Hosted AI vs BYOM

Hosted AIBYOM
Model choiceGemini, OpenAIAnything you can host
System promptManaged by VoicebipManaged by you
Conversation stateManaged by VoicebipPassed to you every turn
Turn latency (P95)< 500ms end-to-end< 500ms + your webhook round-trip
When to useFastest path, standard use casesCustom logic, proprietary models, tool use

Both modes use the same voice pipeline, the same phone numbers, and the same event stream. You can switch an agent between modes with a single PATCH — nothing else in your integration changes.

Configure an Agent for BYOM

Set ai_provider to byom and point webhook_url at the endpoint that will handle conversation turns.

$curl -X POST "https://api.voicebip.com/v1/agents" \
> -H "Authorization: Bearer pk_live_your_key" \
> -H "Content-Type: application/json" \
> -d '{
> "display_name": "Loan Recovery Agent",
> "language": "en",
> "ai_provider": "byom",
> "webhook_url": "https://your-server.com/voicebip/byom"
> }'

Your webhook_url receives everything for that agent — both BYOM turn requests and lifecycle events (call.initiated, call.transcription, call.completed). Distinguish them by the presence of an event_type field in the JSON body: if event_type is present it is a lifecycle event and you should return {"received": true}; if absent it is a BYOM turn request and you must return a BYOMVoiceWebhookResponse with a text field. See Webhooks for the lifecycle event payload shapes.

The Turn Contract

On every caller turn, Voicebip POSTs a JSON body to your webhook_url and expects a JSON response. This is a single request/response — BYOM does not stream tokens.

Request

1{
2 "agent_id": "agt_PAEZ_njcfm2kycpjs",
3 "call_id": "call_01HXYZ...",
4 "transcription": "I'd like to check my outstanding balance.",
5 "messages": [
6 { "role": "agent", "text": "Hi, this is Kemi from FairMoney. How can I help?", "timestamp": "2026-04-14T09:00:02Z" },
7 { "role": "caller", "text": "I'd like to check my outstanding balance.", "timestamp": "2026-04-14T09:00:07Z" }
8 ]
9}
FieldTypeDescription
agent_idstringThe agent receiving the call.
call_idstringStable ID for this call. Use it as the session key on your side.
transcriptionstringThe most recent caller utterance — the turn you need to respond to.
messagesarrayRolling conversation history (see Conversation History).

Each entry in messages has:

FieldTypeDescription
turn_idstringStable ID for this turn. Useful for deduplication.
rolestring"caller" or "agent"
textstringSTT transcript (caller) or TTS text (agent)
timestampstring (ISO 8601)UTC time when the turn was recorded

You can pass the array straight into an LLM’s messages parameter after mapping roles.

Response

Return a JSON body with the agent’s next action. Only text is required.

1{
2 "text": "Your outstanding balance is ₦45,000, due on April 20th. Would you like to pay now?",
3 "end_call": false,
4 "transfer_to": null,
5 "dtmf": null
6}
FieldTypeDescription
textstringWhat the agent should say next. Rendered via TTS and played to the caller. Required.
end_callbooleanIf true, Voicebip speaks text, then hangs up cleanly.
transfer_tostringE.164 number to transfer the caller to after speaking text.
dtmfstringDTMF digits to play on the line (e.g. "1" to navigate an IVR).

end_call, transfer_to, and dtmf are mutually exclusive with a normal continue-the-conversation response — pick one per turn.

Voice Pipeline

When a call comes in on a BYOM agent:

Caller speaks
→ the voice pipeline (RTP media)
→ STT transcription
→ POST to your webhook_url
→ { text, end_call?, transfer_to?, dtmf? }
→ TTS synthesis
→ Caller hears the response

Your webhook sits squarely on the latency critical path. Voicebip enforces a 5-second hard timeout per request; responses that arrive later are treated as a webhook failure.

Latency Budget

The voice pipeline targets < 800ms P95 from your webhook response to TTS playback start. To stay inside a natural-feeling turn:

  • < 300ms is the target for your webhook round-trip end-to-end.
  • < 500ms + webhook RTT is the realistic total turn latency callers perceive.
  • 5s is the hard timeout. Beyond that, the turn is dropped.

Practical implications:

  • Host your webhook in the same region as the call. EU/Africa regions are recommended for Nigerian traffic; US regions add ~200ms baseline.
  • Use a warm, persistent HTTP server. Cold-started serverless functions blow the budget on the first turn of every call.
  • Stream from your LLM internally and start responding as soon as you have the first complete sentence — don’t wait for the full completion.
  • For tool calls that can’t finish in 300ms, return a filler like "One moment while I check that..." and resolve the real answer on the next turn.

See Best Practices for the full latency table.

Conversation History

Voicebip automatically trims messages to the last 20 turns before sending. This keeps payloads small and LLM context windows efficient. You do not need to maintain your own session store — pass the array straight through to your LLM as conversation context, keyed by call_id if you need to correlate with anything else on your side.

If you need the full transcript, subscribe to the call.completed event — it carries the complete turn-by-turn history when the call ends.

Errors and Failure Modes

BYOM has no hosted fallback. An agent configured with ai_provider: byom will only use your webhook — if it’s unreachable or the configuration is missing, the turn fails and the caller hears the configured failure prompt.

FailureBehavior
webhook_url missingAgent creation/update fails validation. No valid BYOM agent without one.
Webhook returns non-200Turn fails. Per-workspace circuit breaker records the failure.
Webhook times out (> 5s)Same as non-200 — turn fails, breaker records the failure.
Webhook returns invalid JSONTurn fails.
Response body > 1 MBTruncated; turn fails. Keep responses small.

Circuit Breaker

Each workspace has a BYOM circuit breaker. After 5 consecutive failures or > 50% error rate over 30 seconds, the breaker trips to OPEN and new turns fail fast for 30 seconds before entering HALF-OPEN probe mode. This protects the voice pipeline from a broken webhook dragging down every call on the workspace.

Watch the byom.webhook.errors and byom.breaker.state metrics (or the equivalent dashboard panels) if your agents go quiet.

Security

HMAC Signature Verification

Voicebip signs every BYOM request (voice and messaging) with two headers:

X-Voicebip-Signature: sha256=<hex_digest>
X-Voicebip-Timestamp: <unix_seconds>

The signature covers "{timestamp}.{raw_body}" — the timestamp string, a literal dot, then the exact raw request body bytes:

HMAC-SHA256(workspace_signing_secret, "{X-Voicebip-Timestamp}.{raw_body}")
digest = hex.EncodeToString(hmac_bytes)
expected_header = "sha256=" + digest

Verification checklist:

  1. Read X-Voicebip-Timestamp and reject if |now - timestamp| > 300 s (replay protection).
  2. Compute the HMAC over the raw body bytes — before any JSON parsing.
  3. Compare with hmac.Equal (constant-time) — never a plain string compare.
  4. Reject requests where the signature is absent or wrong.
1# Python
2import hmac, hashlib, time
3
4def verify(secret: str, timestamp: str, raw_body: bytes, sig_header: str) -> bool:
5 if abs(time.time() - int(timestamp)) > 300:
6 return False
7 message = f"{timestamp}.".encode() + raw_body
8 digest = hmac.new(secret.encode(), message, hashlib.sha256).hexdigest()
9 expected = "sha256=" + digest
10 return hmac.compare_digest(expected, sig_header)
1// Go
2func Verify(secret, timestamp string, body []byte, sigHeader string) bool {
3 if ts, err := strconv.ParseInt(timestamp, 10, 64); err != nil || abs(time.Now().Unix()-ts) > 300 {
4 return false
5 }
6 message := append([]byte(timestamp+"."), body...)
7 mac := hmac.New(sha256.New, []byte(secret))
8 mac.Write(message)
9 expected := "sha256=" + hex.EncodeToString(mac.Sum(nil))
10 return hmac.Equal([]byte(expected), []byte(sigHeader))
11}

Secret Rotation

When you rotate via POST /v1/workspace/signing-secret/rotate, Voicebip moves the old secret to a 24-hour grace window. During that window, two signature headers appear on every BYOM request:

  • X-Voicebip-Signature — new secret
  • X-Voicebip-Signature-Previous — old secret

Accept either as valid during your rollout, then stop accepting the old one after 24 hours. This prevents dropped turns while you deploy the new secret.

Other security practices

  • Idempotency: Voicebip does not retry BYOM turns (unlike event webhooks), but you may receive duplicate call_id + transcription pairs if a caller repeats themselves. Don’t assume uniqueness.
  • Never trust transcription as sanitized input. Treat it as user content and pass it to your LLM through the same guards you’d use for any user text.
  • Log call_id and agent_id on every turn so you can correlate with Voicebip’s request IDs when debugging.

Minimal Handler

A working Node.js/Express handler with HMAC verification lives in Code Examples. The shape below is the bare minimum if you already have signature verification wired up.

1app.post("/voicebip/byom", (req, res) => {
2 const { call_id, transcription, messages } = req.body;
3
4 const reply = await myLLM.complete({
5 system: "You are a friendly loan recovery agent for a Nigerian fintech.",
6 messages: messages.map(m => ({
7 role: m.role === "caller" ? "user" : "assistant",
8 content: m.text,
9 })),
10 });
11
12 res.json({ text: reply, end_call: false });
13});

Flask and Go versions are also in Code Examples.

Testing Your Webhook

Before wiring a real number, test the turn contract end-to-end:

  1. Point the agent at an ngrok tunnelwebhook_url: "https://abc123.ngrok-free.app/voicebip/byom".
  2. Use the sandbox — any API key prefixed with pk_test_ routes to sandbox mode. Sandbox calls synthesize transcription events and exercise the full BYOM path without touching real SIP/SMPP or being billed.
  3. Fire a test lifecycle event with POST /v1/webhooks/test to confirm your event handler works before a live call. See Webhooks → Testing for the full list of supported event_type values and a BYOM-specific end-to-end testing checklist.
  4. Place a real call to an agent’s number (still in sandbox) and watch your handler logs plus the dashboard’s live transcript view.

Gemini Live Mode

When you set ai_provider to gemini, Voicebip uses Gemini 2.0 Flash Live — a bidirectional audio streaming mode that bypasses the traditional STT → LLM → TTS pipeline entirely and saves ~400–600ms per turn.

The three pipeline modes

ModeHow it worksLatency profile
byomVoicebip runs STT, posts transcript to your webhook, runs TTS on your replySTT + your round-trip + TTS
openaiVoicebip runs STT, calls the hosted LLM, runs TTSSTT + LLM + TTS (~500ms P95 target)
geminiRaw audio streams bidirectionally to Gemini 2.0 Flash Live; no separate STT or TTS stepGemini Live’s own pipeline (saves ~400–600ms per turn)

Turn detection and barge-in

In Gemini Live mode, Gemini owns turn detection entirely — it decides when the caller has finished speaking and when the agent should respond.

  • Turn sensitivity is adjusted through Google’s Gemini Live session configuration, not through Voicebip agent settings.
  • The idle-silence prompt (played after a long caller silence) only applies to the classic STT path and does not run in Gemini Live mode.
  • Barge-in behavior is governed by Gemini’s real-time audio processing rather than the Voicebip barge coordinator used on classic STT calls.

SSE transcript stream

In Gemini Live mode, live transcript events may not appear in the SSE transcript stream (/v1/calls/{id}/transcript/stream). For authoritative call transcripts, use the call.completed webhook event — it carries the full turn-by-turn history when the call ends.

Messaging BYOM (SMS + WhatsApp)

The same webhook_url on an agent also handles inbound SMS and WhatsApp messages. Each inbound message triggers a POST to your webhook with the conversation history — route on channel to handle SMS and WhatsApp separately.

Messaging Request

1{
2 "agent_id": "agt_PAEZ_njcfm2kycpjs",
3 "conversation_id": "conv_msg_01HXY...",
4 "channel": "whatsapp",
5 "from_number": "+2348012345678",
6 "to_number": "+2349012345678",
7 "inbound_message": {
8 "message_id": "msg_01HXY...",
9 "body": "Hello, I need help with my account.",
10 "type": "text",
11 "received_at": "2026-05-23T12:34:56Z"
12 },
13 "history": [
14 { "direction": "inbound", "body": "Hello, I need help with my account.", "created_at": "2026-05-23T12:34:56Z" },
15 { "direction": "outbound", "body": "Hi! How can I help you today?", "created_at": "2026-05-23T12:35:01Z" }
16 ],
17 "metadata": {
18 "whatsapp": {
19 "window_open": true,
20 "window_expires_at": "2026-05-24T12:34:56Z"
21 }
22 }
23}
FieldTypeDescription
agent_idstringThe agent this conversation belongs to.
conversation_idstringStable ID for the conversation. Use as session key.
channelstring"whatsapp" or "sms"
from_numberstringSender E.164 number.
to_numberstringYour Voicebip number (E.164).
inbound_messageobjectThe message that triggered this turn (see below).
historyarrayPrior turns, oldest first, capped at 20. Empty on first turn.
metadata.whatsappobjectPresent on WhatsApp: window_open, window_expires_at.
metadata.smsobjectPresent on SMS: encoding (GSM7/UCS2), segments.

Messaging Response

1{
2 "reply": "I can help you with that. Can I have your account number?"
3}
FieldTypeDescription
replystringFree-form reply text. Max 160 chars (GSM-7 single segment) on SMS.
end_conversationbooleanClose the conversation.
escalatestringReason for escalating to a human. Sets ai_mode=human.
template_namestringWhatsApp pre-approved template name (for closed 24h windows).
template_paramsstring[]Values for {{1}}, {{2}} placeholders in the template.

When multiple fields are set, precedence is: end_conversation > escalate > template_name > reply. An empty {} body is valid — use it to silently acknowledge and take action on your side without sending a reply.

HMAC signing is identical to voice BYOM — same headers (X-Voicebip-Signature, X-Voicebip-Timestamp), same algorithm, same rotation grace-period behaviour. See HMAC Signature Verification above.

Switching Modes Mid-Lifetime

Switching an agent between byom, openai, and gemini is a single PATCH on ai_provider. New calls use the new mode immediately; in-flight calls keep the mode they started with. This is the intended migration path if you prototype on hosted AI and later move to BYOM (or vice versa).

$curl -X PATCH "https://api.voicebip.com/v1/agents/{agent_id}" \
> -H "Authorization: Bearer pk_live_your_key" \
> -H "Content-Type: application/json" \
> -d '{"ai_provider": "byom", "webhook_url": "https://your-server.com/voicebip/byom"}'