01
Feasibility: can AI hold a real phone conversation?
Yes, with real caveats. AI handles a scripted, goal-directed outbound call at a quality that gets work done in favorable conditions. It cannot yet replicate natural, unstructured human conversation reliably at scale. The gap between a polished vendor demo and a production line handling thousands of calls is measured in 15 to 25 percentage points of task completion.
Where it works in production
- Appointment scheduling, reminders, and confirmation calls (bounded, low emotion). Medical practices have automated up to 91% of reservation and appointment calls.
- Outbound surveys and notifications (no negotiation, predictable script).
- B2B qualification follow-up with phone-savvy recipients in quiet environments.
- Debt and payment reminders with clear resolution paths.
Where it still breaks
- Emotional attunement: upset customers, sensitive news, awkward pauses.
- Accents and background noise (55-65 dB) cut transcription accuracy 15-30%, and ASR produces plausible near-misses ("cancel my order" heard as "schedule my order").
- Multi-turn degradation: instruction-following falls sharply after 3-5 turns; GPT-4o drops to 50% on multi-turn function calling.
- Adversarial callers: under price or complaint pressure, an LLM "eventually concedes" unless a function layer enforces the position.
Architecture note: cascaded speech-to-text then LLM then text-to-speech dominates production in 2026 for debuggability, compliance, and cost. End-to-end speech-to-speech (OpenAI Realtime, Gemini Live) preserves tone but costs roughly 10x more and follows instructions worse.
02
The stack: build, buy, or hybrid
Three shapes: a managed platform (Vapi, Bland, Retell), a self-assembled pipeline (telephony plus your own speech-to-text, LLM, and text-to-speech), or an open framework like Pipecat that gives you the pipeline plumbing without writing raw audio-socket code.
| Option | Shape | Barge-in | Best for |
|---|---|---|---|
| Retell AI | Managed, custom-LLM mode | Built in (~800ms) | Fastest credible MVP. Your server returns each line over a websocket. |
| Vapi | Managed, swappable parts | Built in (sub-600ms) | Quick ship, tool-calling via webhooks. |
| Bland.ai | Managed, visual pathways | On by default | No-code flows, bring-your-own Twilio. |
| Twilio / Telnyx | Telephony + media-stream socket | Your code (VAD) | The transport layer under any self-build. Telnyx is ~30-50% cheaper. |
| Pipecat | Open framework (MIT) | Built-in VAD processor | Self-host at scale; swap any provider; native tool-call handlers. |
Recommended path
MVP: Retell AI in custom-LLM mode. It owns all telephony and audio, fires a websocket event on every turn, and your thin server returns the next line. You get barge-in, warm transfer, analytics, and number provisioning without operating audio infrastructure.
Scale: migrate to Telnyx (telephony) + Pipecat (framework) + Deepgram Nova-3 (speech-to-text) + a text LLM like Claude Haiku + Cartesia Sonic (text-to-speech). All-in around $0.045/min, which is what makes resale at $0.15 to $0.25/min profitable. Keep your LLM logic identical across the move.
03
What a call actually costs
Modeled at 150 words per minute of agent speech and a blended 3-minute connected call. Speech-to-text and text-LLM are rounding errors; text-to-speech and the realtime-audio LLM are where the money goes.
| Layer | Vendor | $/min |
|---|---|---|
| Telephony | Telnyx (US outbound) | $0.008 |
| Telephony | Twilio (US outbound) | $0.014 |
| Speech-to-text | Deepgram Nova-3 | $0.005 |
| LLM (text path) | GPT-4o-mini text | $0.002 |
| LLM (realtime audio) | OpenAI gpt-realtime | $0.10-0.30 |
| Text-to-speech | Cartesia Sonic | $0.034 |
| Text-to-speech | ElevenLabs Flash | $0.045 |
| Stack | $/min | 3-min call |
|---|---|---|
| Self-assembled budget (Telnyx + Deepgram + GPT-4o-mini + ElevenLabs Flash) | $0.060 | $0.18 |
| Self-assembled premium | $0.111 | $0.33 |
| Managed (Bland Build plan) | $0.120 | $0.36 |
| Managed (Retell, mid) | $0.165 | $0.50 |
| Native realtime audio (Path B, mid) | $0.165 | $0.50 |
Getting a number is instant via Twilio (local $1.15/mo, toll-free $2.15/mo) or Telnyx (from $1.00/mo). Local area codes answer 30-60% better than toll-free for cold outreach, but carriers scrutinize high volume from a single local line.
04
What you can charge, and the margin
Per-minute pricing is commoditizing fast (managed rates fell from $0.25/min in 2023 to $0.11-0.15 in 2026). The leverage is in outcome pricing, where the customer buys a booked meeting, not a minute.
| Model | Market price | Your cost | Gross margin |
|---|---|---|---|
| Per-minute markup | $0.20-0.50/min | $0.06-0.12/min | 60-75% |
| Per-call | $0.75-2.00/call | $0.18-0.50/call | 40-75% |
| Per-seat / month | $99-499/mo | by usage | 30-75% |
| Per-appointment booked | $50-300/appt | $15-40/appt | 80-95%+ |
The per-appointment math
At a realistic 5% connect rate and 10% book rate, one booked meeting takes ~200 dial attempts, costing roughly $16 on the budget stack (or $30-40 premium). Sell that meeting at $100-200 and gross margin before overhead is 75-90%. The anchor: a fully-loaded human SDR costs $300-500 per booked meeting, so AI undercuts by 3x to 10x. The catch is conversion risk: if connect rates fall to carrier spam filtering or your list is weak, the margin evaporates fast.
05
The legal landmine (read this first)
This is the part that decides whether the business survives. It is not optional hygiene; it is the current law, with a private right of action that lets any individual sue.
Every AI voice call is a regulated "artificial voice" call
On February 8, 2024 the FCC ruled (FCC 24-17) that AI-generated voice, including realtime synthesis and voice cloning, is an "artificial or prerecorded voice" under the TCPA. There is no "it sounds natural" exception and no B2B carve-out. Calling a cell phone with an AI voice without prior express consent is a violation worth $500 per call, or $1,500 if willful, with no cap. A 100,000-call campaign is $50M to $150M of exposure. TCPA class actions settle for tens of millions routinely.
The compliance floor, before the first call
- Consent. Prior express written consent for any marketing call; prior express consent for informational calls. Pre-checked boxes and buried fine print do not count.
- Calling hours. 8am to 9pm local time at the recipient's location (several states are stricter). Use the called party's physical location, not their area code.
- Do Not Call. Scrub every list against the National DNC Registry, keep an internal DNC for 5 years, honor opt-outs promptly.
- Per-call identity + opt-out. State the caller and company, give a callback, and offer an automated "stop calling" the recipient can trigger mid-call.
- AI self-disclosure. Open every call with "You are speaking with an AI assistant." It costs nothing and pre-empts the wave of state bot-disclosure laws (California SB 1001, Florida, Texas SB 140 with treble damages).
Deliverability: staying off "Spam Likely" the legitimate way
Three engines drive spam labels: Hiya (AT&T), TNS (Verizon), First Orion (T-Mobile). They flag high volume from one number, low answer rates, fast hang-ups, missing CNAM, and low STIR/SHAKEN attestation.
- Aim for STIR/SHAKEN Level A attestation: get numbers from a carrier under a verified business relationship and call only from numbers registered to you.
- Keep CNAM accurate and enroll in Branded Caller ID / Rich Call Data so your name and reason-for-call show on the recipient's screen.
- Register at FreeCallerRegistry.com, keep volume human-paced, and do not churn through many numbers (that pattern is itself a spam signal).
The honest framing: these tools exist to route identifiable, consented calls. They work against you precisely when you are calling people who did not ask to be called, which is exactly what they are designed to stop.
06
B2B reality: the "businesses are exempt" myth
The federal B2B exemption is narrow: a live human, manually dialing a business landline. The moment you use an autodialer, an AI voice, or call a mobile number (which is nearly every business contact today), TCPA applies in full and prior express written consent is required.
Where AI genuinely fits B2B: list-qualification sweeps, appointment reminders and confirmations, after-hours callback capture, later-touch follow-ups, and automatic CRM logging. Where it fails: replacing the human on a genuinely cold conversation, complex enterprise accounts, and regulated verticals. The teams that win run a hybrid: AI does research, prioritization, dialing, coaching, and logging; humans hold the conversation that matters.
07
Verdict and the MVP path
AI outbound calling is real, commercially deployed, and cheap to run. It is not human-equivalent and will not be for the hardest conversations for 2 to 3 years. The unit economics are genuinely strong; the business risk is almost entirely legal and reputational, not technical.
Build a narrow, consented, structured product
Appointment confirmations, reminders, inbound overflow, after-hours callback, and qualification of opted-in leads. Compliance baked in from call one: consent records, AI self-disclosure, automated opt-out, DNC scrubbing.
Cold AI dialing of strangers
Maximum legal exposure, fastest path to spam-flagging and burned lists, and the worst conversion. This is where the $500-per-call math turns lethal.
The concrete first build
- Stand up Retell AI in custom-LLM mode with your reasoning server returning each line.
- Pick one structured, consented use case (appointment confirmation or inbound-overflow answering) where success is "task done," not "stranger persuaded."
- Bake compliance in: opening AI disclosure, consent capture, automated opt-out, DNC scrub, local calling hours.
- Price per outcome (per confirmed appointment or per booked meeting), not per minute, to capture the 80%+ margin and dodge the commoditizing per-minute race.
- When volume justifies it, migrate to Telnyx + Pipecat + Deepgram + Cartesia at ~$0.045/min and keep the same logic.