Abstract:Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline.
| Comments: | Under submission |
| Subjects: | Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) |
| Cite as: | arXiv:2605.12623 [cs.CL] |
| (or arXiv:2605.12623v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.12623 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Ahmed Heakl [view email]
[v1]
Tue, 12 May 2026 18:09:38 UTC (4,813 KB)
