Constructing an Academic Thai Plagiarism Corpus for Benchmarking Plagiarism Detection Systems

Supawat Taerungruang, Wirote Aroonmanakun


Plagiarism is a major problem in the academic world. It does not only undermine the credibility of educational institutions, but also interrupts the processes of creating knowledge in the academic community. To lessen this problem, many plagiarism detection systems have been developed to detect plagiarized texts in academic works. In this paper, we describe the design and process in creating an academic Thai plagiarism corpus. This corpus is necessary for training and testing plagiarism detection systems for Thai. In order to make this corpus a comprehensive representation of plagiarism, the data has been divided into various types based on the degree of the linguistic mechanisms used in plagiarism. Data compiled in our corpus comes through two main methods: manually created by participants and automatically generated by a program. After the corpus is created, its validity is verified by using three measurements: a measurement of similarity between suspicious texts at the character level, a measurement of similarity between suspicious texts at the word level, and a comparison of different types of data compiled in the corpus based on the similarity measured. The results of the analyses indicate that the corpus created by the proposed methods is effective in training and testing plagiarism detection systems.



plagiarism; Thai plagiarism detection; corpus creation; language resources; natural language processing

Full Text:



Alzahrani, S. M., Salim, N. & Abraham, A. (2012). Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews). Vol. 42(2), 133-149.

Barrón-Cedeño, A., Vila, M., Martí, M. A. & Rosso, P. (2013). Plagiarism Meets Paraphrasing: Insights for the Next Generation in Automatic Plagiarism Detection. Computational Linguistics. Vol. 39(4), 917-947.

Bretag, T. & Mahmud, S. (2009). A Model for Determining Student Plagiarism: Electronic Detection and Academic Judgement. Journal of University Teaching & Learning Practice. Vol. 6(1), 49-60.

Cheema, W. A., Najib, F., Ahmed, S., Bukhari, S. H., Sittar, A. & Nawab, R. M. A. (2015). A Corpus for Analyzing Text Reuse by People of Different Groups—Notebook for PAN at CLEF 2015. Paper presented at the CLEF 2015 Evaluation Labs and Workshop. A Conference at the Météo-

CERFACS center. France, september.

Chulalongkorn University. (2012). Academic Plagiarism: an Issue We Should Be Aware of. Bangkok: Author.

Clough, P. & Stevenson, M. (2009). Creating a corpus of plagiarised academic texts. Paper presented at the Corpus Linguistics Conference (CL2009). UK: University of Liverpool, July.

Clough, P. & Stevenson, M. (2011). Developing a Corpus of Plagiarised Short Answers. Language Resources and Evaluation. Vol. 45(1), 5-24.

Dice, L. R. (1945). Measures of the Amount of Ecologic Association Between Species. Ecology. Vol. 26(3), 297-302.

Gaizauskas, R., Foster, J., Wilks, Y., Arundel, J., Clough, P. & Piao, S. (2001). The METER Corpus: A corpus for analysing journalistic text reuse. Paper presented at the Corpus Linguistics 2001 conference. UK: Lancaster University, March.

Henry, J. A. (Ed.) (1971). The Compact Edition of the Oxford English Dictionary. Oxford: Oxford University Press.

Kruskal, W. H. & Wallis, W. A. (1952). Use of Ranks in One-Criterion Variance Analysis. Journal of the American Statistical Association. Vol. 47(260), 583-621.

Mohtaj, S., Asghari, H. & Zarrabi, V. (2015). Developing monolingual English corpus for plagiarism detection using human annotated paraphrase corpus. Paper presented at the Conference and Labs of the

Evaluation Forum (CLEF 2015) A Conference at the Météo-CERFACS center. France, september.

Park, C. (2003). In Other (People's) Words: Plagiarism by university students--literature and lessons. Assessment & Evaluation in Higher Education, Vol. 28(5), 471-488.

Pecorari, D. (2008). Academic Writing and Plagiarism: A Linguistic Analysis. London: Continuum.

Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B. & Rosso, P. (2011). Overview of the 3rd international competition on plagiarism detection. Paper presented at the CLEF 2011 Labs and Workshops. A Conference at the Casa 400 Hotel. Netherlands: University of Amsterdam, September.

Potthast, M., Hagen, M., Völske, M. & Stein, B. (2013). Crowdsourcing interaction logs to understand text reuse from the web. Paper presented at the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013). Bulgaria, August.

Potthast, M., Stein, B., Barrón-Cedeño, A. & Rosso, P. (2010). An evaluation framework for plagiarism detection. Paper presented at the 23rd International Conference on Computational Linguistics (COLING 2010). China, August.

Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A. & Rosso, P. (2009). Overview of the 1st International Competition on Plagiarism Detection. In

B. Stein, P. Rosso, E. Stamatatos, M. Koppel, & E. Agirre (Eds.), SEPLN 09 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 09) (pp. 1-9). Valencia, Spain:

Ronald, A. & Suharjito. (2014). P Lagiarism Detection Algorithm Using Natural Language Processing Based on Grammar Analyzing. Journal of Theoretical and Applied Information Technology. Vol. 63(1), 168-180.

Ross, C. & Thomas, A. (2003). Writing for Real: A Handbook for Writers in Community Service. New York: Longman.

Sharjeel, M., Rayson, P. & Nawab, R. M. A. (2016). UPPC - Urdu paraphrase plagiarism corpus. Paper presented at the Language Resource and Evaluation Conference (LREC) 2016. Slovenia, May.

Sindhu.L, Thomas, B. B. & Idicula, S. M. (2011). A Study of Plagiarism Detection Tools and Technologies. IJART. Vol. 1(1), 64-70.

Sørensen, T. J. (1948). A Method of Establishing Groups of Equal Amplitude in Plant Sociology Based on Similarity of Species and Its Application to Analyses of the Vegetation on Danish Commons. Kongelige

Danske Videnskabernes Selskab. Vol. 5(4), 1-34.

Sriganesh, V. & Iyer, P. (2007). Plagiarism and Medical Writing. Indian Journal of Radiology and Imaging. Vol. 17(3), 146-147.

Srisongkram, W. (2011). Development of plagiarism understanding of undergraduate students based on survey research and documentary analysis results. Ph.D thesis, Chulalongkorn University, Bangkok,


Sutherland-Smith, W. (2008). Plagiarism, the Internet, and Student Learning: Improving Academic Integrity. New York: Routledge.

Taerungruang, S. & Aroonmanakun, W. (2015). Konlawithii Laklok Ngaan Wichakan Phasa Thai: Kan Wikrao Thaang Phasasaat. [Plagiarism Strategies in Thai Academic Texts: a Linguistic Analysis]. Language and Linguistics. Vol. 34(1), 38-65.

Warn, J. (2007). Plagiarism Software: No Magic Bullet! Higher Education Research & Development. Vol. 25(2), 195-208.



  • There are currently no refbacks.




eISSN : 2550-2131

ISSN : 1675-8021