[{"data":1,"prerenderedAt":194},["ShallowReactive",2],{"blog-/blog/call-thing-postmortem":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":9,"date":10,"tags":11,"draft":6,"body":15,"_type":188,"_id":189,"_source":190,"_file":191,"_stem":192,"_extension":193},"/blog/call-thing-postmortem","blog",false,"","Building Call-Thing: A 7-Month AI Phone Agent Post-Mortem","I built a fully autonomous AI phone agent for restaurant reservations. Then OpenAI released Voice Mode and I killed the project. Here's what was under the hood.","2024-06-15",[12,13,14],"AI","Python","Post-Mortem",{"type":16,"children":17,"toc":182},"root",[18,26,31,36,43,48,104,110,120,130,151,161,167,172,177],{"type":19,"tag":20,"props":21,"children":22},"element","p",{},[23],{"type":24,"value":25},"text","Over seven months, I built Call-Thing—a fully autonomous AI phone agent designed to handle restaurant reservations. It dialed out via SIP, listened, transcribed, generated LLM responses, and spoke back in real-time.",{"type":19,"tag":20,"props":27,"children":28},{},[29],{"type":24,"value":30},"Then OpenAI released Voice Mode, and I killed the project.",{"type":19,"tag":20,"props":32,"children":33},{},[34],{"type":24,"value":35},"Here is a look under the hood of what I built before a single API update made it obsolete.",{"type":19,"tag":37,"props":38,"children":40},"h2",{"id":39},"the-architecture-7-parallel-processes",[41],{"type":24,"value":42},"The Architecture: 7 Parallel Processes",{"type":19,"tag":20,"props":44,"children":45},{},[46],{"type":24,"value":47},"To achieve real-time conversational speeds, I split the pipeline into seven parallel worker processes communicating via TCP sockets and a custom 10-byte header IPC protocol.",{"type":19,"tag":49,"props":50,"children":51},"ul",{},[52,64,74,84,94],{"type":19,"tag":53,"props":54,"children":55},"li",{},[56,62],{"type":19,"tag":57,"props":58,"children":59},"strong",{},[60],{"type":24,"value":61},"Audio Capture:",{"type":24,"value":63}," Baresip streaming Opus 48kHz audio.",{"type":19,"tag":53,"props":65,"children":66},{},[67,72],{"type":19,"tag":57,"props":68,"children":69},{},[70],{"type":24,"value":71},"VAD & Silence Detection:",{"type":24,"value":73}," Silero VAD processing 400ms chunks, paired with a custom silence detector to stop Whisper from hallucinating on dead air.",{"type":19,"tag":53,"props":75,"children":76},{},[77,82],{"type":19,"tag":57,"props":78,"children":79},{},[80],{"type":24,"value":81},"Transcription & Agentic LLM:",{"type":24,"value":83}," Whisper translated speech to text, passing it to a fine-tuned GPT-3.5-turbo. By hooking it up with function calling for live reservation modifications, it was essentially doing \"agentic AI\" before everyone started using the buzzword.",{"type":19,"tag":53,"props":85,"children":86},{},[87,92],{"type":19,"tag":57,"props":88,"children":89},{},[90],{"type":24,"value":91},"Caching:",{"type":24,"value":93}," Qdrant checked for semantic similarity (>0.85) to serve cached responses, saving both latency and cost.",{"type":19,"tag":53,"props":95,"children":96},{},[97,102],{"type":19,"tag":57,"props":98,"children":99},{},[100],{"type":24,"value":101},"TTS:",{"type":24,"value":103}," Google Cloud TTS generated audio for immediate SIP playback.",{"type":19,"tag":37,"props":105,"children":107},{"id":106},"the-hardest-battles",[108],{"type":24,"value":109},"The Hardest Battles",{"type":19,"tag":20,"props":111,"children":112},{},[113,118],{"type":19,"tag":57,"props":114,"children":115},{},[116],{"type":24,"value":117},"Python's GIL:",{"type":24,"value":119}," Threading could not beat the Global Interpreter Lock. Switching to multiprocessing in spawn mode (required for CUDA) was the breakthrough that finally enabled low-latency conversations.",{"type":19,"tag":20,"props":121,"children":122},{},[123,128],{"type":19,"tag":57,"props":124,"children":125},{},[126],{"type":24,"value":127},"Local vs. Cloud:",{"type":24,"value":129}," After weeks of testing local fine-tunes (Gemma, Mistral) on WSL, I pivoted to GPT-3.5-turbo. Relying on local models meant downtime whenever my PC in Germany was turned off—a dealbreaker for availability.",{"type":19,"tag":20,"props":131,"children":132},{},[133,138,140,149],{"type":19,"tag":57,"props":134,"children":135},{},[136],{"type":24,"value":137},"The TTS Rabbit Hole:",{"type":24,"value":139}," I spent a good chunk of time trying to get ",{"type":19,"tag":141,"props":142,"children":146},"a",{"href":143,"rel":144},"https://github.com/coqui-ai/TTS",[145],"nofollow",[147],{"type":24,"value":148},"Coqui TTS",{"type":24,"value":150}," running locally for faster response times. The quality was promising, but getting it to produce reliable, low-latency output in a real-time pipeline was a battle I eventually lost—Google Cloud TTS won on consistency.",{"type":19,"tag":20,"props":152,"children":153},{},[154,159],{"type":19,"tag":57,"props":155,"children":156},{},[157],{"type":24,"value":158},"The Docker Diet:",{"type":24,"value":160}," Packing ML models and audio libraries together initially resulted in a massive 65GB Docker image. Shrinking that down to 15GB through multi-stage builds and strict dependency pruning was a war story in itself.",{"type":19,"tag":37,"props":162,"children":164},{"id":163},"the-end",[165],{"type":24,"value":166},"The End",{"type":19,"tag":20,"props":168,"children":169},{},[170],{"type":24,"value":171},"By the end, Call-Thing was deployed on Scaleway. A TypeScript Job Coordinator watched a Firestore queue, spawning parallel Docker containers on demand. It supported English, German, Spanish, and Dutch, autonomously detecting and switching languages at runtime.",{"type":19,"tag":20,"props":173,"children":174},{},[175],{"type":24,"value":176},"It was an incredible technical gauntlet—right up until OpenAI's Voice Mode turned seven months of latency optimization and pipeline orchestration into a few lines of code. May it rest in peace.",{"type":19,"tag":20,"props":178,"children":179},{},[180],{"type":24,"value":181},"Maybe one day I'll resurrect it using ElevenLabs or Gemini Live.",{"title":7,"searchDepth":183,"depth":183,"links":184},2,[185,186,187],{"id":39,"depth":183,"text":42},{"id":106,"depth":183,"text":109},{"id":163,"depth":183,"text":166},"markdown","content:blog:call-thing-postmortem.md","content","blog/call-thing-postmortem.md","blog/call-thing-postmortem","md",1773128206838]