When Voice AI Meets the Suburbs

Read these out loud

How a local actually says it

Geelong

VIC · 3220

tap to see it →

“J’long”

/dʒəˈlɒŋ/

Cairns

QLD · 4870

tap to see it →

“Caans”

/kænz/

Lalor

VIC · 3075

tap to see it →

“Law-ler”

/ˈlɔːlə/

Moe

VIC · 3825

tap to see it →

“Mow-ee”

/ˈmoʊi/

Brewarrina

NSW · 2839

tap to see it →

“Brew-ar-een”

/brʊˈwɑːrɪn/

Goonoo Goonoo

NSW · 2340

tap to see it →

“Gunna g’noo”

/ˌɡʌnə ɡəˈnuː/

Wollombi

NSW · 2325

tap to see it →

“Wol-lum-bye”

/wəˈlɒmbaɪ/

Creswick

VIC · 3363

tap to see it →

“Crez-zick”

/ˈkrɛzɪk/

Many of these are Indigenous place names. Read those out loud, then reveal how a local actually says them. That gap between how a place is spelled and how it sounds is where most voice systems fail.

Initial impression

A modern speech model and it will happily transcribe a sentence. But it might also write Walomby for Wollombi, Bath-est for Bathurst, and J'long for Geelong. In a contact centre setting, where these inputs drive down AHT and cost, a near-enough value is a lost dollars and lost opportunity to serve another customer: a question asked twice, a transfer that should never have happened.

The real challenge is in mapping speech to a domain. For this article, the domain is a short-list of roughly two thousand Australian suburbs, a good share of them Indigenous names that a general model has barely heard. Imagine this is for a contact centre that needs to capture caller’s suburbs accurately and quickly.

The pipeline

Fig 1 — processing pipeline.

Because suburbs are a closed set, we can write them all down to try and stop the model from guessing. We use this to bias toward this set in the intial speech recognition. Following that we post-process using a fast algorithm to align any missed detections further to known suburbs.

Quick Comparison

In no way can I claim this to be scientific, but it does show that the approach works. This was a proof of concept on a laptop and compared to offerrings of some big players it did pretty well!

The Experiment

Three systems against the same 14-second recording: a local pipeline running entirely on an M1 MacBook, Google Cloud Speech-to-Text with model adaptation, and Deepgram Nova-3. Each received identical suburb hints.

Errors are underlined in red; hover any underlined word to see the correct suburb it should have been.

The biggest, “smartest” model lost on this sentence despite being given the suburb list as hints. Its instinct to nudge unfamiliar words toward normal English is actually the wrong behaviour for Brewarrina or Goonoo Goonoo. The local pipeline, running entirely offline, made only one error 😎. GCP made five, DeepGram made four.

What this means depends on where you sit

If you’re a CTO choosing between a cloud API and self-hosting: the cheapest option here ran on a laptop with no network round-trip and made the fewest mistakes. We did that by bounding domain with explicit hints and a small constrained model could beat large general ones.

If you’re hiring for an AI engineering role: this is what the job actually looks like. Not a prompt, but data you curate, an eval that reflects real calls, and a model quick enough to run for your realtime interaction requirements.

If you run the contact centre today: the path forward is smaller than the brochures suggest. Build the suburb list, build the eval, bias at every stage. The first two are data work, not model work; the pipeline can even run on a laptop.