Part-of-Speech Tagger for Malay Social Media Texts

Siti Noor Allia Noor Ariffin, Sabrina Tiun


Processing the meaning of words in social media texts, such as tweets, is challenging in natural language processing. Malay tweets are no exception because they demonstrate distinct linguistic phenomena, such as the use of dialects from each state in Malaysia; borrowing foreign language terms in the context of Malay language; and using mixed languages, abbreviations and spelling errors or mistakes in sentence structure. Tagging the word class of tweets is an arduous task because tweets are characterised by their distinctive style, linguistic sounds and errors. Currently, existing works on Malay part-of-speech (POS) are based only on standard Malay and formal texts and are thus unsuitable for tagging tweet texts. Thus, a POS model of tweet tagging for non-standardised Malay language must be developed. This study aims to design and implement a non-standardised Malay POS model for tweets and performs assessment on the basis of the word tagging accuracy of test data of unnormalised and normalised tweet texts. A solution that adopts a probabilistic POS tagging called QTAG is proposed. Results show that the Malay QTAG achieves best average POS tagging accuracies of 90% and 88.8% for normalised and unnormalised test datasets, respectively.


part-of-speech; informal Malay text; Malay POS tagger; Malay tweet; QTAG

Full Text:



Abdulkareem, M. & Sabrina Tiun. (2017). Comparative Analysis of ML POS on Arabic Tweets. Journal of Theoretical & Applied Information Technology. Vol. 95(2), 403-411.

Albogamy, F. & Ramsay, A. (2015). POS Tagging for Arabic Tweets. International Conference Recent Advances in Natural Language Processing Proceedings, 7–9 September, Hissar, Bulgaria.

Al-Sabbagh, R. & Girju, R. (2012). A Supervised POS Tagger for Written Arabic Social Networking Corpora. KONVENS 2012 Conference Proceedings,19-21 September, Vienna, Austria.

Alshaikhdeeb, B. & Ahmad, K. (2016). Biomedical Named Entity Recognition: A Review. International Journal on Advanced Science, Engineering and Information Technology. Vol.6 (6), 889-895.

Altawaier, M. M. & Sabrina Tiun, (2016). Comparison of Machine Learning Approaches on Arabic Twitter Sentiment Analysis. International Journal on Advanced Science, Engineering and Information Technology. Vol. 6(6), 1067-1073.

Anbananthen, K. S. M., Krishnan, J. K., Sayeed, M. S. & Muniapan, P. (2017). Comparison of Stochastic and Rule-Based POS Tagging on Malay Online Text. American Journal of Applied Sciences. Vol. 14(9), 843-851.

Antony, P. J., Mohan, S. P. & Soman, K. P. (2010, March). SVM Based Part of Speech Tagger for Malayalam. 2010 International Conference in Recent Trends in Information, Telecommunication and Computing Proceedings, 12-13 March, Kerala, India.

Arbak Othman. (2005). Kamus Komprehensif Bahasa Melayu. Shah Alam: Oxford Fajar.

Berger, A., L., Della Pietra, SA. & Della Pietra, V.J. (1996). A Maximum Entropy Approach to Natural Language Processing. Computational

Linguistics. Vol. 22(1), 39-72.

Chekima, K. & Rayner Alfred. (2017). Sentiment Analysis of Malay Social Media Text. 4th International Conference on Computational Science and Technology Proceedings, 29-30 November, Kuala Lumpur, Malaysia.

Chowdhury, G. G. (2003). Natural Language Processing. Annual Review of Information Science And Technology. Vol. 37(1), 51-89.

Cox, C. (2010). Probabilistic Tagging of Minority Language Data: A Case Study Using Qtag In Gries, T. S., Wulff, S. & Davies, M. (Eds.), Corpus Linguistic Applications: Current Studies, New Directions (pp. 213-231). Amsterdam: Rodopi.

Derczynski, L., Ritter, A., Clark, S. & Bontcheva, K. (2013). Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data. International Conference Recent Advances in Natural Language

Processing RANLP 2013 Proceedings, 7-13 September, Hissar, Bulgaria.

Elworthy, D. (1995). Tagset Design and Inflected Languages. Paper presented at EACL SIGDAT Workshop. Dublin, Ireland, January.

Gimpel, K., Schneider, N., O'Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman M., Yogatama, D., Flanigan, J. & Smith, N. A. (2011). Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments. 49th Annual Meeting of the Association for Computational Linguistics Proceedings, 19-24 June, Portland, Oregon.

Gui, T., Zhang, Q., Huang, H., Peng, M. & Huang, X. (2017). Part-of-speech Tagging for Twitter with Adversarial Neural Networks. 2017 Conference on Empirical Methods in Natural Language Processing Proceedings, 7-11 September, Copenhagen, Denmark.

Halid, N. A. & Nazlia Omar. (2017). Malay Part of Speech Tagging Using Ruled-Based Approach. Asia-Pacific Journal of Information Technology and Multimedia. Vol. 6(2), 91-107.

Hassan Ahmad. (1985). The Role of Dewan Bahasa dan Pustaka in the Advancement of Indigenous Academic Publishing in Malaysia. In S. Gopinathan (Ed.), Academic Publishing in ASEAN. Singapore: Festival of Books Singapore.

Hawkins, J. M. (2008). Kamus Dwibahasa Bahasa Inggeris–Bahasa Malaysia. Selangor: Oxford Fajar.

Hock, O. Y. (2009). Kamus Dwibahasa. Petaling Jaya: Pearson Longman.

Java, A., Song, X., Finin, T. & Tseng, B. (2007, August). Why We Twitter: Understanding Microblogging Usage and Communities. 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis Proceedings, 12-15 August, San Jose, CA, USA.

Juhaida Abu Bakar, Khairuddin Omar, Mohammad Faidzul Nasrudin & Mohd Zamri Murah. (2013). Morphology Analysis in Malay POS Prediction. International Conference on Artificial Intelligence in Computer Science and ICT (AICS 2013) Proceedings, 25-25 November, Langkawi,


Juhaida Abu Bakar, Khairuddin Omar, Mohammad Faidzul Nasrudin & Mohd Zamri Murah. (2013). Part-of-Speech for Old Malay Manuscript Corpus: A Review. Second International Multi-Conference on Artificial Intelligence Technology (M-CAIT’13) Proceedings, 28-28 August, Shah Alam, Malaysia.

Karim Harun & Maslida Yusof. (2015). Komunikasi Bahasa Melayu-Jawa Dalam Media Sosial. Jurnal Komunikasi, Malaysian Journal of Communication. Vol.31(2), 617-629.

Knowles, G. O. & Zuraidah Mohd. Don. (2006). Word Class in Malay: A Corpus-Based Approach. Kuala Lumpur: Dewan Bahasa dan Pustaka.

Li, D. (1998) The Plight of the Purist. In Pennington, M. (Ed.). Language in Hong Kong at Century’s End (pp. 161-190). Hong Kong: Hong Kong University Press.

Maslida Yusof (2018). Trend Ganti Nama Diri Bahasa Melayu dalam Konteks Media Sosial. Jurnal Komunikasi, Malaysian Journal of Communication. Vol. 34(2), 36-50.

Mason, O. & Tufis, D. (1997). Probabilistic Tagging in a Multi-Lingual Environment: Making an English Tagger Understand Romanian. Third European TELRI Seminar proceedings, 16-18 October, Montecatini, Italy.

Mohamed, H., Nazlia Omar & Mohd. Juzaiddin Ab. Aziz. (2015). Malay Part of Speech Tagger: A Comparative Study on Tagging Tools. Asia-Pacific Journal of Information Technology and Multimedia. Vol. 4(1), 11-23.

Muysken, P. (2000). Bilingual Speech: A Typology of Code-Mixing. United Kingdom: Cambridge University Press.

Nasiroh Omar, Ahmad Farhan Hamsani, Nur Atiqah Sia Abdullah & Siti Zaleha Zainal Abidin. (2017). Construction of Malay Abbreviation Corpus Based on Social Media Data. Journal of Engineering and Applied Sciences. Vol. 12(3), 468-474.

Nguyen, T. M. H., Vu, X. L. & Le-Hong, P. (2003). A Case Study of the Probabilistic Tagger QTAG for Tagging Vietnamese Texts. 1st National Conference ICT RDA Conference Proceedings.

Nielsen, F. Å. (2011). A new ANEW: Evaluation of a Word List for Sentiment Analysis in Microblogs. The Computing Research Repository (CoRR11).

Nøklestad, A. & Søfteland, Å. (2007). Tagging a Norwegian Speech Corpus. 16th Nordic Conference of Computational Linguistics NODALIDA-2007 Proceedings, 25-26 May, Estonia, Tartu.

Nooralahzadeh, F., Brun, C. & Roux, C. (2014). Part of Speech Tagging for French Social Media Data. 25th International Conference on Computational Linguistics Conference (COLING 2014) Proceedings, 23-29August, Dublin, Ireland.

Nurul Iman Ahmad Bukhari, Azu Farhana Anuar, Khairunnisa Mohad Khazin & Tengku Mohd Farid Bin Tengku Abdul Aziz. (2015). English-Malay Code-Mixing Innovation In Facebook Among Malaysian University Students. International Refereed Research Journal. Vol. 6(4), 1-10.

Owoputi, O., O'Connor, B., Dyer, C., Gimpel, K., Schneider, N. & Smith, N. A. (2013). Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters. 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Proceedings, 9–14 June, Westin Peachtree Plaza Hotel Atlanta, Georgia, USA.

Rayner Alfred, Adam Mujat & Joe Hendry Obit. (2013). A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles. Asian Conference on Intelligent Information and Database Systems Proceedings, 18-20 March, Kuala Lumpur, Malaysia.

Sornlertlamvanich, V., Charoenporn, T., & Isahara, H. (1997). ORCHID: Thai Part-Of-Speech Tagged Corpus. Technical Report, National Electronics and Computer Technology Center .

Tagging with QTAG. (2007). Retrieved 15 May, 2018 from

Toutanova, K. & Manning, C. D. (2000). Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora Proceeding, 7-8 October, Hong Kong.

Toutanova, K., Klein, D., Manning, C. D. & Singer, Y. (2003). Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technolog Proceeding, 27 May - 1 June, Edmonton, Canada.

Tran, O. T., Le, C. A., Ha, T. Q. & Le, Q. H. (2009). An Experimental Study on Vietnamese POS Tagging. International Conference Asian Language Processing (IALP'09) Proceedings, 7-9 September, Singapore.

Tufis, D. & Mason, O. (1998, May). Tagging Romanian Texts: A Case Study for Qtag, A Language Independent Probabilistic Tagger. First International Conference on Language Resources and Evaluation (LREC) Proceedings, 28-30 May, Granada, Spain.

van der Goot, R., Plank, B. & Nissim, M. (2017). To Normalize, or Not to Normalize: The Impact of Normalization on Part-of-Speech Tagging. 3rd Workshop on Noisy User-generated Text Proceedings, 7 September, Copenhagen, Denmark.

Xian, B. C. M., Lubani, M., Ping, L. K., Bouzekri, K., Mahmud, R. & Lukose, D. (2016). Benchmarking Mi-Pos: Malay Part-of-Speech Tagger. International Journal of Knowledge Engineering. Vol. 2(3), 115-121.

Yang, L. C., Selvaretnam, B., Hoong, P. K., Tan, I. K., Howg, E. K. & Kar, L. H. (2016). Exploration of Road Traffic Tweets for Congestion Monitoring. Journal of Telecommunication, Electronic and Computer Engineering (JTEC). Vol. 8(2), 141-145.



  • There are currently no refbacks.




eISSN : 2550-2131

ISSN : 1675-8021