Evaluating Machine Translation of the Shan Hai Jing: An MQM-Based Analysis of Google Translate vs. ChatGPT with Prompting Effects

Wenqi Duan; Chwee Fang Ng; Hazlina Abdul Halim; Zhongming Zhang

Evaluating Machine Translation of the Shan Hai Jing: An MQM-Based Analysis of Google Translate vs. ChatGPT with Prompting Effects

Wenqi Duan, Chwee Fang Ng, Hazlina Abdul Halim, Zhongming Zhang

Abstract

Culturally dense classical texts pose persistent challenges for machine translation, particularly in reconstructing compressed semantic hierarchies and culture-specific references. Although Neural Machine Translation (NMT) and Large Language Models (LLMs) have substantially improved fluency and contextual coherence, previous studies have given limited attention to the evaluation of their performance on culturally embedded classical texts using MQM-based human evaluation alongside automatic metrics. Focusing on the English translation of the Shan Hai Jing, a culturally dense and semantically complex classical Chinese text, this study investigates whether different translation systems produce distinct error patterns in culturally compressed contexts and whether prompting strategies influence translation performance. Selected textual segments were translated using Google Translate and ChatGPT under minimal and enriched prompting strategies. Translation quality was assessed through MQM-based human evaluation alongside several automatic metrics (BLEU, chrF, BERTScore, and COMET-Kiwi). MQM analysis reveals clear differences in error patterns across systems: NMT outputs show a higher incidence of high-severity mistranslations, whereas LLM outputs tend to exhibit semantic generalisation and shifts in cultural references. By contrast, automatic metrics show limited differentiation in system rankings, with no significant main effect of system observed. Prompt enrichment does not produce consistent quality improvements and occasionally increases semantic drift. These findings suggest that translation quality in culturally compressed texts may be better interpreted through structural error patterns across MQM dimensions rather than metric-based rankings alone. Evaluation sensitivity appears to be shaped by text type, and increased prompt complexity does not necessarily enhance semantic precision in classical translation tasks.

Keywords: Neural Machine Translation (NMT); Large Language Models (LLMs); Multidimensional Quality Metrics (MQM); Shan Hai Jing; Prompt Strategies

ABSTRAK

Teks klasik yang sarat dengan unsur budaya sering menimbulkan cabaran kepada sistem terjemahan mesin, khususnya dalam membina semula hierarki semantik yang padat serta rujukan budaya yang khusus. Walaupun Terjemahan Mesin Neural (Neural Machine Translation, NMT) dan Model Bahasa Besar (Large Language Models, LLMs) telah meningkatkan kelancaran bahasa dan koherensi konteks secara ketara, kajian terdahulu masih kurang memberi perhatian terhadap penilaian prestasi sistem ini dalam teks klasik yang berunsur budaya dengan menggunakan penilaian manusia berasaskan MQM bersama metrik automatik. Dengan memfokuskan pada terjemahan bahasa Inggeris bagi Shan Hai Jing, sebuah teks klasik Cina yang terkenal dengan kepadatan semantik, rujukan mitologi, dan unsur budaya yang kompleks, kajian ini bertujuan untuk menyiasat sama ada sistem terjemahan yang berlainan menghasilkan corak kesilapan yang berbeza dalam konteks yang padat dengan unsur budaya serta sama ada strategi prompt mempengaruhi prestasi terjemahan. Segmen teks terpilih diterjemahkan menggunakan Google Translate dan ChatGPT di bawah dua strategi prompt, iaitu prompt minimum dan prompt diperkaya. Kualiti terjemahan dinilai melalui penilaian manusia berasaskan Multidimensional Quality Metrics (MQM) di samping beberapa metrik automatik seperti BLEU, chrF, BERTScore dan COMET-Kiwi. Analisis MQM menunjukkan perbezaan yang jelas dalam corak kesilapan antara sistem: output NMT menunjukkan kadar kesilapan salah terjemahan berkeparahan tinggi yang lebih tinggi, manakala output LLM cenderung memperlihatkan penggeneralisasian makna serta peralihan dalam rujukan budaya. Sebaliknya, metrik automatik menunjukkan perbezaan yang terhad dalam pemeringkatan sistem tanpa kesan utama sistem yang signifikan. Pengayaan prompt tidak menghasilkan peningkatan kualiti yang konsisten dan dalam beberapa kes meningkatkan penyimpangan semantik. Dapatan ini mencadangkan bahawa kualiti terjemahan dalam teks yang padat dengan makna budaya lebih sesuai ditafsirkan melalui corak kesilapan struktur merentas dimensi MQM berbanding penilaian berasaskan metrik semata-mata. Selain itu, sensitiviti penilaian turut dipengaruhi oleh jenis teks, dan peningkatan kerumitan prompt tidak semestinya meningkatkan ketepatan semantik dalam terjemahan teks klasik.

Kata kunci: Terjemahan Mesin Neural (NMT); Model Bahasa Besar (LLMs); Multidimensional Quality Metrics (MQM); Shan Hai Jing; Strategi Prompt

Keywords

Neural Machine Translation (NMT); Large Language Models (LLMs); Multidimensional Quality Metrics (MQM); Shan Hai Jing; Prompt Strategies

Full Text:

PDF

References

Aixelá, F. (1996). Culture-specific items in translation. In R. Alvarez & M. C. Vidal (Eds.), Translation, power, subversion (pp. 52–78). Multilingual Matters.

Baker, M. (2011). In other words: A coursebook on translation (2nd ed.). Routledge.

Bennett, E., Han, H., Yang, X., Schonebaum, A., & Carpuat, M. (2025). Evaluating evaluation metrics for Ancient Chinese to English machine translation. In Proceedings of the Second Ancient Language Processing Workshop associated with NAACL 2025 (pp. 71–76). Association for Computational

Linguistics. https://aclanthology.org/2025.alp-1.9/

Chen, A., Lou, L., Chen, K., Bai, X., Xiang, Y., Yang, M., Zhao, T., & Zhang, M. (2025). Benchmarking LLMs for translating classical Chinese poetry: Evaluating adequacy, fluency, and elegance. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp.

–33036). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.emnlp-main.1678

Chen, S., & Lin, Y. (2025). A multidimensional comparison of ChatGPT, Google Translate, and DeepL in Chinese tourism texts translation: Fidelity, fluency, cultural sensitivity, and persuasiveness. Frontiers in Artificial Intelligence, 8, 1619489. https://doi.org/10.3389/frai.2025.1619489

Chow, R. C., Angeline, V., Jayata, G., Mujhid, A., & Hidayaturrahman. (2025). Comparing LLMs and NMTs performances in translating English–Indonesian texts. Procedia Computer Science, 269, 1455–1465. https://doi.org/10.1016/j.procs.2025.09.087

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.

Creswell, J. W., & Poth, C. N. (2018). Qualitative inquiry and research design: Choosing among five approaches (4th ed.). SAGE.

Dunder, I., Seljan, S., & Pavlovski, M. (2021). What makes machine-translated poetry look bad? A human error classification analysis. In Proceedings of the Central European Conference on Information and Intelligent Systems (pp. 183–191). Faculty of Organization and Informatics, Varaždin, Croatia.

Fakih, A., Ghassemiazghandi, M., Fakih, A. H., & Singh, M. K. (2024). Evaluation of Instagram’s Neural Machine Translation for literary texts: An MQM-Based analysis. GEMA Online Journal of Language Studies, 24(1), 213–233. https://doi.org/10.17576/gema-2024-2401-13

Gao, R., Lin, Y., Zhao, N., & Cai, Z. G. (2024). Machine translation of Chinese classical poetry: A comparison among ChatGPT, Google Translate, and DeepL Translator. Humanities and Social Sciences Communications, 11 (1), 1-10. https://doi.org/10.1057/s41599-024-03363-0

Gao, Y., Wang, R., & Hou, F. (2023). How to design translation prompts for ChatGPT: An empirical study. In Proceedings of the 6th ACM International Conference on Multimedia in Asia Workshops (pp. 1–7). Association for Computing Machinery. https://doi.org/10.1145/3700410.3702123

He, S. (2024). Prompting ChatGPT for translation: A comparative analysis of translation brief and persona prompts. In Proceedings of the 25th Annual Conference of the European Association for Machine Translation (pp. 316–326). European Association for Machine Translation.

He, L., Ghassemiazghandi, M., & Subramaniam, I. (2024). Comparative assessment of Bing Translator and Youdao Machine Translation Systems in English-to-Chinese literary text translation. Forum for Linguistic Studies, 6(2), 1189–1198.

Jiang, Z., Lv, Q., Zhang, Z., & Lei, L. (2024). Convergences and divergences between automatic assessment and human evaluation: Insights from comparing ChatGPT-Generated translation and Neural Machine Translation. arXiv preprint arXiv:2401.05176.

Jiao, W., Wang, W., Huang, J. T., Wang, X., & Tu, Z. P. (2023). Is ChatGPT A Good Translator? Yes with GPT-4 as the Engine. arXiv. https://doi.org/10.48550/arXiv.2301.08745

Karabayeva, I., & Kalizhanova, A. (2024). Evaluating machine translation of literature through rhetorical analysis. Journal of Translation and Language Studies, 5(1), 1–9. https://doi.org/10.48185/jtls.v5i1.962

Kocmi, T., & Federmann, C. (2023). Large language models are state-of-the-art evaluators of translation quality. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation (pp. 193–203). European Association for Machine Translation.

Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174. https://doi.org/10.2307/252931

Li, Z., & Chen, L. (2025). Mind vs. machine: Comparative analysis of metaphor-related word translation by human and AI systems. Training, Language and Culture, 9(1), 10–27. https://doi.org/10.22363/2521-442X-2025-9-1-10-27

Lommel, A., Uszkoreit, H., & Burchardt, A. (2014). Multidimensional quality metrics (MQM): A framework for declaring and describing translation quality metrics. Tradumàtica, 12, 455–463.

Lu, Q., Qiu, B., Ding, L., Zhang, K., Kocmi, T., & Tao, D. (2024). Error analysis prompting enables human-like translation evaluation in Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2024 (pp. 8801–8816). Association for Computational Linguistics.

Lyu, C., Du, Z., Xu, J., Duan, Y., Wu, M., Lynn, T., Aji, A. F., Wong, D. F., Liu, S., & Wang, L. (2023). A paradigm shift: The future of Machine Translation lies with Large Language Models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and

Evaluation (LREC-COLING 2024) (pp. 1339–1352). ELRA and ICCL.

Mathur, N., Baldwin, T., & Cohn, T. (2020). Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 4984–4997). Association for Computational Linguistics.

https://doi.org/10.18653/v1/2020.acl-main.448

Naveen, P., & Trojovský, P. (2024). Overview and challenges of machine translation for contextually appropriate translations. iScience, 27. (10). https://doi.org/10.1016/j.isci.2024.110878

Newmark, P. (1988). A textbook of translation. Prentice Hall.

Peng, K., Ding, L., Zhong, Q., Shen, L., Liu, X., Zhang, M., ... & Tao, D. (2023). Towards making the most of ChatGPT for machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 5622–5633). Association for Computational Linguistics.

https://doi.org/10.18653/v1/2023.findings-emnlp.373

Post, M. (2018). A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers (pp. 186–191). Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-6319

Rei, R., Stewart, C., Farinha, A. C., & Lavie, A. (2020). COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 2685–2702). Association for Computational Linguistics.

https://doi.org/10.18653/v1/2020.emnlp-main.213

Rei, R., Treviso, M., Guerreiro, N. M., Zerva, C., Farinha, A. C., Maroti, C., de Souza, J. G. C., Glushkova, T., Alves, D. M., Lavie, A., Coheur, L., & Martins, A. F. T. (2022). COMETKIWI: IST-Unbabel 2022 submission for the quality estimation shared task. In Proceedings of the Seventh Conference on

Machine Translation (WMT) (pp. 634–645). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.wmt-1.60

Rivera-Trigueros, I. (2022). Machine translation systems and quality assessment: A systematic review. Language Resources and Evaluation, 56(2), 593–619. https://doi.org/10.1007/s10579-021-09537-5

Sellam, T., Das, D., & Parikh, A. (2020). BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 7881–7892). Association for Computational Linguistics. https://aclanthology.org/2020.acl-main.704/

Sharifian, F. (2017). Cultural linguistics: Cultural conceptualisations and language. John Benjamins.

Shen, S., Wang, W., & Birch, A. (2025). Liaozhai through the looking-glass: On paratextual explicitation of culture-bound terms in machine translation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 34400–34416). Association for

Computational Linguistics. https://aclanthology.org/2025.emnlp-main.1744/

Shi, Y., Xu, H., Kwok, H. L., & Liu, K. (2024). ChatGPT in professional translation: A double-edged sword – Insights from Chinese translators on capabilities, concerns, and future prospects. In Translation Studies in the Age of Artificial Intelligence (pp. 125–149). Routledge.

https://doi.org/10.4324/9781003482369-7

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998–6008.

Wang, Q., Amini, M., & Tan, D. A. L. (2025). Strategies, errors, and challenges in translating culture-specific items in Chinese-English literary works: A systematic review. Jurnal Arbitrer, 12(2), 259–273. https://doi.org/10.25077/ar.12.2.259-273.2025

Wang, L., Lyu, C., Ji, T., Zhang, Z., Yu, D., Shi, S., & Tu, Z. (2023). Document-level machine translation with large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 16646–16661). Association for Computational Linguistics.

https://doi.org/10.18653/v1/2023.emnlp-main.1036

Wang, Q. (2025). Evaluating Uighur literary translation: A comparative study of ChatGPT, Google Translate, and Bing Translator. PLOS ONE, 20(10), e0335261.

Wang, J. (2024). Exploring the potential of ChatGPT-4o in Translation Quality Assessment. Journal of Theory and Practice in Humanities and Social Sciences, 1(3), 18–30.

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, L., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., … Dean, J. (2016). Google’s neural machine translation system: Bridging the gap between human

and machine translation. arXiv preprint arXiv:1609.08144.

Wu, S., Wieting, J., & Smith, D. A. (2025). Multiple references with meaningful variations improve literary machine translation. arXiv preprint arXiv:2412.18707.

Yamada, M. (2023). Optimizing machine translation through prompt engineering: An investigation into ChatGPT’s customizability. In Proceedings of Machine Translation Summit XIX, Vol. 2: Users Track (pp. 195-204).

Yao, G., & Fan, L. (2025). An entropy-based study of simplification in ChatGPT translations compared to neural machine translation and human translation across genres. PLOS ONE, 20(12): e0339762. https://doi.org/10.1371/journal.pone.0339762

Zhang, B., Haddow, B., & Birch, A. (2023). Prompting large language models for machine translation: A case study. In Proceedings of the 40th International Conference on Machine Learning (pp. 41092–41110). https://proceedings.mlr.press/v202/zhang23m.html

Zhang, Z., Syed Abdullah, S. N., Abdullah, M. A. R., & Duan, W. (2025). Evaluating Google neural machine translation from Chinese to English: Technical vs. literary texts. GEMA Online® Journal of Language Studies, 25(3), 732–753. https://doi.org/10.17576/gema-2025-2503-09

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representations (ICLR) (pp. 1–43). https://openreview.net/forum?id=SkeHuCVFDr

Zhang, R., Zhao, W., & Eger, S. (2025). How good are LLMs for literary translation, really? Literary translation evaluation with humans and LLMs. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language

Technologies (Volume 1: Long Papers) (pp. 10961–10988). Association for Computational Linguistics.

Zhang, R., Zhao, W., Macken, L., & Eger, S. (2025). LiTransProQA: An LLM-based literary translation evaluation metric with professional question answering. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (pp. 29087-29109). Association for Computational

Linguistics.

Zhao, Z., Sun, G., Liu, C., & Wang, D. (2025). Research on machine translation of ancient books in the era of large language models. npj Heritage Science, 13, Article 122. https://doi.org/10.1038/s40494-025-01697-9