Your AI Agent Can Make Phone Calls Now. Here is How to Set It Up.

Your AI agent can browse the web, read your email, and manage your calendar. But can it pick up the phone?
Voice is having a moment. Restaurants are using AI to take reservations. Clinics confirm appointments with it. Real estate offices qualify leads over the phone without a human in the loop. If you've called a mid-size business recently, there's a decent chance the voice on the other end wasn't a person.
Most of this is happening in call centers and enterprise sales floors, though. What about your agent? The one running on your machine, connected to your tools, already aware of your schedule and preferences?
That part is catching up fast.
What a voice-enabled agent actually does
A voice-enabled AI agent can answer your phone when you're busy, take a message, and text you a summary. It can make outbound calls to schedule appointments, confirm reservations, or follow up with someone. It can join conference calls as a silent note-taker. It can handle the soul-crushing calls that nobody wants to make, like checking a bank balance through an automated menu or renewing a subscription.
The pieces exist. Speech-to-text is fast and accurate. Text-to-speech sounds convincingly human. Language models already understand context, follow instructions, and use tools. What's been missing is the plumbing that wires everything together and connects it to the rest of your agent's world.
How voice works under the hood
A voice-capable agent stacks several layers. Here's what each one does and why it matters.
Speech recognition (STT) converts audio from a phone call into text. Whisper, Deepgram, and AssemblyAI handle this. Latency is the thing you care about. You need transcription fast enough that the conversation doesn't feel laggy. The best systems now operate under 300ms.
Language model processing is your agent's brain doing its normal thing. The transcribed text hits Claude, GPT-4o, or Gemini. The model decides what to say, whether to call a tool, or whether to transfer to a human. Same model that handles your messages.
Text-to-speech (TTS) turns the response back into spoken audio. ElevenLabs, OpenAI TTS, and Play.ht all produce voices that sound natural. Some can clone your voice with a few minutes of sample audio, which is either very useful or very unsettling depending on who you ask.
Telephony integration is the actual phone connection. Twilio, Vonage, or Plivo give your agent a real phone number. They handle call routing, button presses (DTMF tones), voicemail, and recording.
Orchestration wires all of it together. Something needs to manage conversation state, route audio streams, handle when someone talks over the agent, and keep context across the full call. This is where your agent framework earns its keep.
Total round-trip latency, from someone speaking to hearing a response? With solid infrastructure, under 800ms. That's fast enough for natural conversation. Not perfect, but good enough that most people won't notice.
Scenarios that work right now
I want to be honest about what's practical today versus what's still wishful thinking.
Scheduling appointments. Your agent calls the dentist or the hair salon. It knows your availability from your calendar, negotiates a time, handles the "how about Tuesday instead?" back-and-forth, and adds the event when confirmed. This works because both sides want the same thing: a time slot. The conversation is structured and predictable.
Taking inbound calls. Someone calls you, your agent picks up, takes a message, and sends you a summary. If it's your mom, it warm-transfers to your actual phone. If it's spam, it hangs up. Think of it as an answering machine that can carry a conversation and make judgment calls about who matters.
Restaurant reservations. "Table for two, Saturday at 7, outdoor seating if possible." Your agent calls the restaurant, makes the request, handles the counter-offer of 7:30, and confirms. Simple enough to work reliably. I'd trust it for this.
Navigating automated phone systems. Calling your pharmacy's refill line, pressing the right buttons, confirming a pickup. Calling back a doctor's office to confirm a time. These calls follow scripts, and agents are good at scripts.
Delivering short messages. "Call the plumber back and tell them Thursday morning works." The agent dials, says the thing, handles a quick question about timing, and reports back. Two minutes. Done.
Where voice agents still fall short
I won't pretend this works everywhere.
Complex negotiations require reading tone, building rapport, and adapting strategy in real time. Talking down a car price, arguing with an insurance adjuster, handling a sensitive customer complaint. Models can do basic back-and-forth, but they miss the social undercurrents that matter in these conversations.
Emotional situations. Calling to cancel a deceased family member's subscription. Comforting a friend. Anywhere genuine empathy matters, not performed empathy. AI can approximate this, but most people sense the difference.
Badly designed phone trees. Some automated systems are so poorly built that humans struggle with them. AI agents fare worse. They can't always parse garbled audio prompts or know whether silence means "please wait" or "we disconnected you."
High-stakes conversations. Anything where a misstatement has real consequences: legal discussions, medical consultations, financial negotiations. Models still occasionally get facts wrong, and the liability question is genuinely unsettled.
Tough audio conditions. Thick accents, poor connections, background noise. STT has improved enormously, but call your neighborhood restaurant during dinner rush and expect a few transcription misses.
Setting up a voice-capable agent
If you want to wire this up, here's the practical architecture.
Get a phone number. Twilio's Voice API lets you answer and make calls programmatically. A number costs about $1/month. Per-minute charges run $0.01-0.02.
Add speech processing. Route incoming audio to Deepgram or Whisper for transcription. Send outgoing text to ElevenLabs or OpenAI TTS for synthesis. Budget roughly $0.006 per minute for STT and around $0.03 per 1,000 characters for TTS.
Wire it to your agent. When a call arrives, your agent decides how to respond based on who's calling, what time it is, what's on your calendar, and what you've told it to prioritize. OpenClaw handles this orchestration natively. Your agent already has access to your calendar, contacts, and preferences. Voice is another interface, not a different product.
Define the rules. This part matters more than the tech. What can your agent commit to on a call? Maybe it schedules but never agrees to spend money. Maybe it takes messages but never pretends to be you. Write these boundaries down. Be specific.
Call yourself first. Seriously. Call your own agent five times. See how it handles pauses, interruptions, and confused responses. Fix the obvious stuff before inflicting it on your dentist's receptionist.
What this costs
Voice adds a few line items to your agent's monthly bill:
| Component | Rough cost |
|---|---|
| Phone number (Twilio) | $1/month |
| Inbound minutes | $0.01/min |
| Outbound minutes | $0.015/min |
| STT (Deepgram) | $0.006/min |
| TTS (ElevenLabs) | ~$0.30/1K chars |
| Model API calls | Varies |
For a personal agent handling 30-50 calls per month at an average of 3 minutes each? You're looking at $15-25/month on top of your existing agent hosting.
That's less than a human assistant charges for one hour.
Privacy and consent
Voice brings up privacy questions that text conversations don't.
Recording laws vary by jurisdiction. Some states and countries require all parties to consent before a call is recorded. If your agent transcribes the call, that probably counts as recording. Check your local rules before deploying.
Should your agent say it's an AI? Ethically, I think yes. Legally, it depends where you are. Some places now mandate disclosure. Even where they don't, being upfront about it tends to go better than being caught. People don't like feeling tricked.
Call audio contains voice biometric data. Some privacy laws treat this as sensitive information. Make sure your STT provider lets you control data retention and opt out of model training.
And be respectful. The person on the other end didn't sign up for a conversation with an AI. If they seem uncomfortable, your agent should offer to have you call back yourself.
What's changing fast
Voice is moving faster than any other part of the AI agent stack right now.
Real-time voice models from OpenAI and Google can now process audio directly without a separate speech-to-text step. This cuts latency and lets the model hear tone, emphasis, and hesitation rather than reading a flat transcript. Conversations feel noticeably more natural.
Voice cloning lets services like ElevenLabs create a voice from a few minutes of your audio. Your agent could call the mechanic and sound like you. Make of that what you will.
Proactive calling is the thing I find most interesting. Instead of waiting for instructions, your agent notices your car registration is about to expire and calls the DMV. It spots a flight price drop and calls the airline. Agents that take initiative over the phone, not just in text, feel like a real shift in what "autonomous" means.
Try it with UniClaw
If you already have an OpenClaw agent running on UniClaw, adding voice means connecting a telephony provider and configuring voice skills. Your agent has the context it needs: calendar, contacts, preferences, communication history. Voice becomes one more way to reach the outside world.
UniClaw gives your agent a dedicated cloud machine with enough power for real-time audio processing. Zero-exposure firewall keeps call data protected. And because the agent runs around the clock, it picks up at 3 AM exactly as well as 3 PM.
The agent that handles your inbox, schedules your meetings, and browses the web? It's ready for phone calls too.
Get started at uniclaw.ai.
Ready to deploy your own AI agent?
Get Started with UniClaw