Abstract:Text simplification plays a crucial role in improving the accessibility and comprehensibility of written information for diverse audiences, including language learners and readers with limited literacy. Despite its importance, large-scale, high-quality datasets for training and evaluating text simplification models remain scarce for languages other than English. This paper reports an experimental study on the collection and processing of crowd-sourced simplification data from comparable corpora to construct a corpus suitable for both training and testing text simplification systems across multiple languages (Catalan, English, French, Italian and Spanish). We report mechanisms for sentence-level alignment from document-level data. The resulting dataset of the aligned sentence pairs is publicly available.
| Comments: | Accepted at BUCC 2026 workshop at LREC 2026 |
| Subjects: | Computation and Language (cs.CL); Artificial Intelligence (cs.AI) |
| ACM classes: | I.2.7; I.2.6 |
| Cite as: | arXiv:2605.09476 [cs.CL] |
| (or arXiv:2605.09476v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.09476 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Luis Kenji Hilasaca Sanchez [view email]
[v1]
Sun, 10 May 2026 11:07:24 UTC (65 KB)
