VoiceAI
Word-level timestamping for Arabic speech — Kaldi to Whisper, with the alignment Whisper didn't ship.
NeuralSpace · Arabic speech intelligence
Overview
The speech team owned the models and brought me into the R&D. The brief: accurate Arabic speech — and, increasingly, knowing exactly when each word was spoken.
01 — The model
Kaldi to Whisper
We started where serious ASR did — Kaldi, the classic hybrid pipeline. Then Whisper changed the math: an end-to-end model where accuracy scaled with data. The strategy got simple — feed it more, and better, Arabic audio.
02 — The hard ask
When each word was said
Customers wanted word-level timestamps. Whisper, at the time, only gave them per segment. We added word-level timing by aligning the transcript to the audio — Whisper's cross-attention run through dynamic time warping, with a wav2vec2 forced aligner for the hard cases — so every word landed on an exact start and end.
03 — The fuel
Accuracy follows data
Since accuracy tracked data, we built for data — a collection pipeline and an augmentation layer that turned every clip into many: noise, speed, pitch, spectral masking. More coverage and more robustness, from the same source hours.
04 — Speaking back
Knowing when to build, when to buy
For Arabic ASR we ran our own fine-tuned Whisper, benchmarked against OpenAI's and ElevenLabs'. For text-to-speech we built on Coqui TTS — but TTS lives or dies on data, and Arabic voice data was the bottleneck. So we made the pragmatic call: ship on ElevenLabs, and put our energy where we had the edge.
Coqui TTS
Built on it — the closest open option to ElevenLabs quality.
ElevenLabs
Chosen — Arabic voice data was the bottleneck, so data made the call.
Whisper heard Arabic. We taught it exactly when each word was said.
Role
Joined the speech team's R&D — the move to Whisper, the word-level alignment it didn't ship out of the box, and the data collection and augmentation pipeline behind the accuracy.