Abstract:Iterative self-refinement is a simple inference-time strategy for machine translation: an LLM revises its own translation over multiple inference-time passes. Yet document-scale refinement remains poorly understood: 1) which pipelines work best, 2) what quality dimensions improve, and 3) how refiners behave. In this paper, we present a systematic study of document-level literary translation, covering nine LLMs and seven language pairs. Across nine translation-refinement granularity combinations and five refinement strategies, we find a robust recipe: document-level MT followed by segment-level refinement yields strong and stable improvements. In contrast, document-level refinement often makes fewer edits and leads to smaller or less reliable gains. Beyond granularity, A simple general refinement prompt consistently outperforms error-specific prompting and evaluate-then-refine schemes. Our large-scale human evaluation shows that refinement gains come primarily from fluency, style, and terminology, with limited and less consistent improvements in adequacy. Experiments varying model strength reveal refinement projects outputs toward the refiner's distribution rather than performing targeted error repair. These findings clarify the mechanisms and limitations of current refinement approaches.
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2605.13368 [cs.CL] |
| (or arXiv:2605.13368v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.13368 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Shaomu Tan [view email]
[v1]
Wed, 13 May 2026 11:27:32 UTC (2,664 KB)
