Abstract:Many European languages possess rich biblical translation histories, yet existing corpora - in prioritizing linguistic breadth - often fail to capture this depth. To address this gap, we introduce a multilingual corpus of 651 New Testament translations, of which 334 are unique, spanning five languages with 2.4-5.0x more translations per language than any prior corpus: English (194 unique versions from 390 total), French (41 from 78), Italian (17 from 33), Polish (29 from 48), and Spanish (53 from 102). Aggregated from 12 online biblical libraries and one preexisting corpus, each translation is annotated with metadata that maps the text to a standardized identifier for the work, its specific edition, and its year of revision. This canonicalization allows researchers to define "uniqueness" for their own needs: they can perform micro-level analyses on translation families, such as the KJV lineage, or conduct macro-level studies by deduplicating closely related texts. By providing the first multilingual resource with sufficient depth per language for flexible, multilevel analysis, the corpus fills a gap in the quantitative study of translation history.
| Comments: | v3 - fixed duplicated references section heading, fixed reference v2 - camera ready version |
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2602.09724 [cs.CL] |
| (or arXiv:2602.09724v3 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2602.09724 arXiv-issued DOI via DataCite |
Submission history
From: Maciej Rapacz [view email]
[v1]
Tue, 10 Feb 2026 12:27:57 UTC (218 KB)
[v2]
Mon, 16 Mar 2026 17:34:15 UTC (640 KB)
[v3]
Tue, 12 May 2026 21:30:45 UTC (641 KB)
