Domain-specific Stop Words in Malaysian Parliamentary Debates 1959 – 2018
Abstract
Removal of stop words is essential in Natural Language Processing and text-related analysis. Existing works on Malay stop words are based on standard Malay and Quranic/Arabic translations into Malay. Thus, there is a lack of domain-specific stop word list, making it discordant for processing of Malay parliamentary discourse. In this paper, we propose a semantic approach towards identifying and removing Malay, conventional Malay spelling and English functional words in analysing a time-series corpus, namely the Malaysian Hansard Corpus (MHC), to extract a Malay specific-domain stop word list. The study utilised a combination of Z-method of most frequently occurring words, words that appear once, and the classic method. The dataset of the corpus evaluated comprised Parliament 1 (year 1959) to Parliament 13 (year 2018). The study then categorised the stop word list according to domain-specific related words. The resulting list comprised 587 stop words. New stop words that emerged from the MHC include parliamentary-related words like ‘Berhormat’ (salutation to the members of the Parliament), ‘Pertua’ (salutation to the Speaker of the House), ‘ketawa’ (laugh) and ‘tepuk’ (clap). Other than typical English stop words like ‘and’ and ‘the’, there are also words like ‘hon’ble’ (short for ‘Honourable’) and ‘honourable’. The list also includes stop words in conventional Malay spelling like ‘untok’ (for), ‘lebeh’ (more), and ‘kapada’ (to). The proposed set of stop words can be further utilised to assist natural language processing and text analysis.
Keywords
Full Text:
PDFReferences
Alshanik, F., Apon, A., Herzog, A., Safro, I. & Sybrandt, J. (2020). Accelerating text mining using domain-specific stop word lists. 2020 IEEE International Conference on Big Data (Big Data), 2639-2648.
Ayral, H. & Yavuz, S. (2011). An automated domain specific stop word generation method for natural language text classification. 2011 International Symposium on
Innovations in
Intelligent Systems and Applications, Istanbul.
Baldwin, T. & Su’ad Awab. (2006). Open source corpus analysis tools for Malay. Proceedings of the Fifth International Conference on Language Resources and
Evaluation, Italy.
Chekima, K. & Alfred, R. (2016). An automatic construction of Malay stop words based on aggregation method. In M. Berry, Hj. Mohamed A., & B. Yap, (Eds.). Soft
computing in data science. Communications in Computer and Information Science, Vol. 652. Singapore: Springer.
Chong, T.Y., Banchs, R.R. & Chng, E.S. (2012). An empirical evaluation of stop word removal in statistical machine translation. Proceedings of the 13th Conference of
the European Chapter of the Association for Computational Linguistics. France: Association for Computational Linguistics.
Choy, M. (2012). Effective listings of function stop words for Twitter. International Journal of Advanced Computer Science and Application. 3(6), 8–11.
Chua, S. & Nohuddin, P.N.E. (2017). Relationship analysis of keyword and chapter in Malay-translated tafseer of al-Quran. Journal of Telecommunication, Electronic and
Computer Engineering. 9(2-10), 185-189.
Haddi, E., Liu, X. & Shi, Y. (2013). The role of text pre-processing in sentiment analysis. Procedia Comput. Sci. 17, 26–32.
Fatimah Dato Ahmad (1995). A Malay language document retrieval system: An experimental approach and analysis. Unpublished PhD thesis, Universiti Kebangsaan
Malaysia, Bangi, Malaysia.
Fatimah Sidi, Marzanah Abdul Jabar, Mohd Hasan Selamat, Abdul Azim Abd Ghani, Md. Nasir Sulaiman & Salmi Baharom (2011). Malay interrogative knowledge corpus.
American Journal of Economics and Business Administration. 3(1), 171–176.
Green, D. & Cross, J, P. (2017). Exploring the political agenda of the European Parliament using a dynamic topic modeling approach. Cambridge: Cambridge University
Press.
Hassan Saif, Fernández, M., He, Y. & Harith, A. (2014). On stopwords, filtering and data sparsity for sentiment analysis of Twitter. Proceeding of Ninth International
Conference on Language Resources and Evaluation, Iceland. 810–817.
Hamood Ali Alshalabi, Sabrina Tiun & Nazlia Omar (2017). A comparative study of the ensemble and base classifiers performance in Malay text categorization. Asia-
Pacific Journal of Information Technology and Multimedia. 6(2), 53–64.
Hofmann, K., Marakasova, A., Baumann, A., Neidhardt, J., & Wissik, T. (2020). Comparing lexical usage in political discourse across diachronic corpora. Proceedings of
ParlaCLARIN II Workshop, 58–65.
Imran Ho-Abdullah, Zaharani Ahmad, Rusdi Abdul Ghani, Nor Hashimah & Idris Aman (2004). A practical grammar of Malay – A corpus-based approach to the
description of Malay. First COLLA Regional Workshop. Malaysia: Putrajaya, June.
Imran Ho Abdullah, Anis Nadiah Che Abdul Rahman & Azhar Jaludin (2017). The Malaysian Hansard Corpus.
Kaur, J. & Buttar, P.K. (2018). A systematic review on stopword removal algorithms. International Journal on Future Revolution in Computer Science & Communication
Engineering. 4(4), 207–210.
Keshavarz, H. & Abadeh, M.S. (2017). ALGA: Adaptive lexicon learning using genetic algorithm for sentiment analysis of microblogs. Knowledge-Based Systems. 122,
–16.
Khan, N., Bakht, M.B., Khan, M.J., Samad, A. & Sahar, G. (2019). Spotting Urdu stop words by Zipf's statistical approach. 13th International Conference on
Mathematics, Actuarial Science, Computer Science and Statistics (MACS). 1–5, doi: 10.1109/MACS48846.2019.9024817.
Koteyko, N. (2014). Compilation of specialised corpora. In Language and politics in Post-Soviet Russia: A corpus-assisted approach (pp. 48–64). London: Palgrave
Macmillan.
Kwee, A.T., Tsai, F.S. & Tang W. (2009) Sentence-level novelty detection in English and Malay. In T. Theeramunkong, B. Kijsirikul, N. Cercone, & T.B. Ho, (Eds.).
Advances in knowledge discovery and data mining. PAKDD 2009. Lecture Notes in Computer Science, Vol. 5476. Berlin: Springer. https://doi.org/10.1007/978-3-642-
-2_7
Liu, J., Ren, X., Shang, J., Cassidy, T., Voss, C.R. & Han, J. (2016). Representing documents via latent keyphrase inference. Proc Int World Wide Web Conf. 1057–
doi: 10.1145/2872427.2883088.
Lo, R. T.-W., He, B. & Ounis, I. (2005). Automatically building a stopwordlist for an information retrieval system. J. Digit. Inf. Manag. Spec. Issue. 5th Dutch-Belgian
Inf. Retr. Work. 5(2005), 17–24.
Luhn, H.P. (1960). Key word‐in‐context index for technical literature (KWIC Index). American Documentation. 11, 288–295.
Makrenchi, M. & Kamel, M.S. (2017). Extracting domain-specific stopwords for text classifiers. Intelligent Data Analysis. 21(1), 39–62.
Manning, C.D., Raghavan, P. & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.
Mohd Amin Mohd Yunus, Aida Mustapha & Noor Azah Samsudin (2017). Query translation and Quran result in TreeMap. MATEC Web of Conferences 135. 1–7.
Muhammed Salehudin Aman (2021). Sinopsis sistem ejaan Bahasa Melayu. KLIKWeb DBP. Retrieved May 7th, 2021 from http://klikweb.dbp.my/?p=6003
Muhamad Taufik Abdullah (2006). Monolingual and crosslanguage information retrieval approaches for Malay and English language documents. Unpublished Ph.D
thesis. Universiti Putra Malaysia, Serdang, Malaysia.
Muhamad Taufik Abdullah, Fatimah Ahmad, Ramlan Mahmod, & Tengku Mohd Tengku Sembok (2005). Improvement of Malay information retrieval using local stop
words. International Advanced Technology Congress: Conference on Computer Integrated Systems. Putrajaya, Malaysia.
Munková, D., Munk, M. & Vozár, M. (2014). Influence of stop-words removal on sequence patterns identification within comparable corpora. In V. Trajkovik & A.
Mishev, (Eds.). Advances in intelligent systems and computing (pp. 67–76). Switzerland: Springer International Publishing Switzerland.
Norsimah Mat Awal, Azhar Jaludin, Anis Nadiah Che Abdul Rahman & Imran Ho Abdullah (2019). “Is Selangor in deep water?”: A corpus-driven account of air/water in
the Malaysian Hansard Corpus (MHC). GEMA Online® Journal of Language Studies. 19(2), 99–120.
Nor Fariza Mohd Nor, Anis Nadiah Che Abdul Rahman, Azhar Jaludin, Imran Ho Abdullah & Sabrina Tiun (2019). A corpus driven analysis of representations around the
word ‘ekonomi’ in Malaysian Hansard Corpus. GEMA Online® Journal of Language Studies. 19(4), 66–95.
Puri, R, Bedi, R. P. S. & Goyal, V. (2013). Automated stopwords identification in Punjabi documents. An Int. J. Eng. Sci. 8(2013), 119–125.
Rani, R. & Lobiyal, D.K. (2018). Automatic construction of generic stop words list for Hindi text. Procedia Computer Science. 132, 362-370.
Raulji, J.K & Saini, J.R. (2016). Stop-word removal algorithm and its implementation for Sanskrit language. International Journal of Computer Applications. 150(2), 15–
Raulji, J.K. & Saini, J.R. (2017). Generating stopwordlist for Sanskrit language. 2017 IEEE 7th International Advance Computing Conference (IACC).
Rose, S., Engel, D., Cramer, N. & Cowley, W. (2010). Automatic keyword extraction from individual documents. In Berry, M.W., & Kogan, J., (Eds.). Text mining:
Applications and theory. New Jersey: John Wiley and Sons, Ltd.
Sabrina Tiun, Nor Fariza Mohd Nor, Azhar Jalaludin & Anis Nadiah Che Abdul Rahman. (2020). Word embedding for small and domain-specific Malay corpus. In Alfred
R., Lim Y., Haviluddin H., & On, C., (Eds). Computational science and technology. Lecture notes in electrical engineering. Singapore: Springer.
Sabrina Tiun, Saidah Saad, Nor Fariza Mohd Nor, Azhar Jalaludin & Anis Nadiah Che Abdul Rahman (2020). Quantifying semantic shift visually on a Malay domain-
specific corpus using temporal word embedding approach. Asia-Pacific Journal of Information Technology and Multimedia. 9(2), 1–10.
Sadeghi, M. & Vegas, J. (2014). Automatic identification of light stop words for Persian information retrieval systems. Journal of Information Science. 40(4).
Scott, M. (2008). WordSmith Tools version 5. Liverpool: Lexical Analysis Software.
Weisser, M. (2103). Tools, ideas & resources for linguistics. Retrieved November 18, 2020 from http://martinweisser.org/
Wild, F., Kalz, M., Demnati, H., Paliwoda-Pekosz, G. & Naili, M. (2020). Stopwords: Stop wordlists in German, English, Dutch, French, Polish, and... in lsa: Latent
Semantic Analysis. R Package Documentation. Retrieved November 4, 2020 from https://rdrr.io/cran/lsa/man/stopwords.html
Yuan, T., Lo, D., & Lawall, J. (2014). Automated construction of a software-specific word similarity database. 2014 Software Evolution Week - IEEE Conference on
Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE) 2014. 44–5. doi: 10.1109/CSMR-WCRE.2014.6747213.
Zheng, A. (2018). Feature engineering for machine learning. Sebastool, USA: O'Reilly Media, Inc.
Zhi, L.G. (2003). Using mutual information to identify new features for text documents of various domains. PACLIC 2003. 372–379.
Zipf, G.K. (1949). Human behavior and the principle of least Effort. Cambridge, Massachusetts: Addison-Wesley.
DOI: http://dx.doi.org/10.17576/gema-2021-2102-01
Refbacks
- There are currently no refbacks.
eISSN : 2550-2131
ISSN : 1675-8021