Chain-of-Thought Reasoning Improves Context-Aware Translation with Large Language Models

ArXi:2510.18077v2 Announce Type: replace This paper assesses the ability of large language models (LLMs) to translate texts that include inter-sentential dependencies. We use the English-French DiscEvalMT benchmark (Bawden, 2018) with pairs of sentences containing translation challenges for pronominal anaphora and lexical cohesion. We evaluate 12 LLMs from the DeepSeek-R1, GPT, Llama, Mistral and Phi families on two tasks: (1) distinguish a correct translation from a wrong but plausible one; and (2) generate a correct translation.