When Voice AI Meets the Suburbs
VoiceAI is hard - a bot can transcribe "Geelong" easily, or could mis-spell it to "Jlong" (based on how it's pronounced by locals). High accuracy in specific domains (medicine, suburbs, aviation) is the hard part. This is a field note on the gap in between.
Read these out loud
How a local actually says it
Many of these are Indigenous place names. Read those out loud, then reveal how a local actually says them. That gap between how a place is spelled and how it sounds is where most voice systems fail.
Initial impression
A modern speech model and it will happily transcribe a sentence. But it might also write Walomby for Wollombi, Bath-est for Bathurst, and J'long for Geelong. In a contact centre setting, where these inputs drive down AHT and cost, a near-enough value is a lost dollars and lost opportunity to serve another customer: a question asked twice, a transfer that should never have happened.
The real challenge is in mapping speech to a domain. For this article, the domain is a short-list of roughly two thousand Australian suburbs, a good share of them Indigenous names that a general model has barely heard. Imagine this is for a contact centre that needs to capture caller’s suburbs accurately and quickly.
The pipeline
Because suburbs are a closed set, we can write them all down to try and stop the model from guessing. We use this to bias toward this set in the intial speech recognition. Following that we post-process using a fast algorithm to align any missed detections further to known suburbs.
Quick Comparison
In no way can I claim this to be scientific, but it does show that the approach works. This was a proof of concept on a laptop and compared to offerrings of some big players it did pretty well!
The Experiment
Three systems against the same 14-second recording: a local pipeline running entirely on an M1 MacBook, Google Cloud Speech-to-Text with model adaptation, and Deepgram Nova-3. Each received identical suburb hints.
Errors are underlined in red; hover any underlined word to see the correct suburb it should have been.
The biggest, “smartest” model lost on this sentence despite being given the suburb list as hints. Its instinct to nudge unfamiliar words toward normal English is actually the wrong behaviour for Brewarrina or Goonoo Goonoo. The local pipeline, running entirely offline, made only one error 😎. GCP made five, DeepGram made four.
What this means depends on where you sit
If you’re a CTO choosing between a cloud API and self-hosting: the cheapest option here ran on a laptop with no network round-trip and made the fewest mistakes. We did that by bounding domain with explicit hints and a small constrained model could beat large general ones.
If you’re hiring for an AI engineering role: this is what the job actually looks like. Not a prompt, but data you curate, an eval that reflects real calls, and a model quick enough to run for your realtime interaction requirements.
If you run the contact centre today: the path forward is smaller than the brochures suggest. Build the suburb list, build the eval, bias at every stage. The first two are data work, not model work; the pipeline can even run on a laptop.