DocAI
SOTA Arabic document understanding — pre-trained from scratch when nothing else worked.
NeuralSpace · Arabic document intelligence
Overview
Customers didn't want text pulled off a page. They wanted the document understood — in Arabic, where almost nothing worked well.
Getting there meant climbing through the entire OCR landscape, hitting its ceiling, and finally pre-training a model from scratch.
01 — The ladder
Every OCR approach, until one stuck
We started with Tesseract — synthetic data, a full train-deploy-test pipeline. Good results, but it broke on Arabic diacritics and skewed scans. DocTR moved the needle on nothing. Transformer-based TrOCR finally fixed diacritics — but by then every inbound use case wanted more than characters off a page. They wanted the document understood.
02 — The benchmark
A gold set of the hardest documents
So we built our own benchmark — a gold set of the weirdest, most complex documents customers actually sent. Then we evaluated the field, including the 2024 wave of vision-language models. None held up on Arabic. Pix2Struct showed the most promise; finetuned on Arabic it was okay — and okay wasn't the product.
- Donut
- LayoutLMv3
- UDOP
- Nougat
- Qwen2-VL
- Idefics2
- Pix2Struct
03 — From scratch
So we pre-trained our own
We built the engine ourselves. A web scraper with a random outline drawer collected 2M+ screenshots. On AWS SageMaker we pre-trained an Arabic Pix2Struct from scratch. A second engine synthesised a VQA dataset — simulating documents, augmenting noise — and we finetuned on the downstream task.
04 — Results
State of the art, and a clear trade
Zero diacritic errors. Real document understanding and Arabic VQA. OCR CER at 2% — state of the art. The one cost was throughput: a bulky model runs slower — but the market told us plainly that for documents like these, customers will pay more for accuracy.
In production
- Ministry of Human Resources — Saudi Arabia
- Dubai RTA
- EDC Dubai
Active users for their most complex Arabic document use cases.
OCR was the easy part. Understanding Arabic was the product.
Role
Core developer and customer-facing lead — drove the model research and the from-scratch pre-training, built the data engines (scraper and VQA synthesis), and ran it on SageMaker to a SOTA Arabic document-understanding model in production.