Abstract:This article investigates the performance of automatic evaluation metrics (AEMs) and LLM-as-a-judge evaluation on literary translation across multiple languages, genres, and translation modalities. The aim is to assess how well these tools align with professionals when evaluating translation, creativity (creative shifts & errors), and see if they can substitute laborious manual annotations. A dataset of literary translations across three modalities (human translation, machine translation, and post-editing), three genres and three language pairs was created and annotated in detail for creativity by experienced professional literary translators. The results show that both AEMs and LLM-as-a-judge evaluations correlate poorly with professional evaluations on creativity, with LLM-as-a-judge showing a systematic bias in favour of machine-translated texts and penalising creative and culturally appropriate solutions. Moreover, performance is consistently worse for more literary genres such as poetry. This highlights fundamental limitations of current automatic evaluation tools for literary translation and the need to create new tools that do not frequently consider out of routine translations as errors.
| Comments: | This paper has been accepted to the EAMT Conference 2026 in Tilburg on June 15-18 2026 |
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2605.13596 [cs.CL] |
| (or arXiv:2605.13596v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.13596 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Kyo Gerrits [view email]
[v1]
Wed, 13 May 2026 14:30:41 UTC (160 KB)
