2022–24

VoiceAI

Word-level timestamping for Arabic speech — Kaldi to Whisper, with the alignment Whisper didn't ship.

NeuralSpace · Arabic speech intelligence

FIG. 07

Overview

The speech team owned the models and brought me into the R&D. The brief: accurate Arabic speech — and, increasingly, knowing exactly when each word was spoken.

01 — The model

Kaldi to Whisper

We started where serious ASR did — Kaldi, the classic hybrid pipeline. Then Whisper changed the math: an end-to-end model where accuracy scaled with data. The strategy got simple — feed it more, and better, Arabic audio.

02 — The hard ask

When each word was said

Customers wanted word-level timestamps. Whisper, at the time, only gave them per segment. We added word-level timing by aligning the transcript to the audio — Whisper's cross-attention run through dynamic time warping, with a wav2vec2 forced aligner for the hard cases — so every word landed on an exact start and end.

03 — The fuel

Accuracy follows data

Since accuracy tracked data, we built for data — a collection pipeline and an augmentation layer that turned every clip into many: noise, speed, pitch, spectral masking. More coverage and more robustness, from the same source hours.

04 — Speaking back

Knowing when to build, when to buy

For Arabic ASR we ran our own fine-tuned Whisper, benchmarked against OpenAI's and ElevenLabs'. For text-to-speech we built on Coqui TTS — but TTS lives or dies on data, and Arabic voice data was the bottleneck. So we made the pragmatic call: ship on ElevenLabs, and put our energy where we had the edge.

Coqui TTS
Built on it — the closest open option to ElevenLabs quality.
ElevenLabs
Chosen — Arabic voice data was the bottleneck, so data made the call.

Whisper heard Arabic. We taught it exactly when each word was said.

Role

Joined the speech team's R&D — the move to Whisper, the word-level alignment it didn't ship out of the box, and the data collection and augmentation pipeline behind the accuracy.

Work with me