The Effectiveness of Bottom Up Technique with Probabilistic Approach for A Malay Parser

Muhammad Azhar Fairuzz Hiloh; Mohd Juzaiddin Ab Aziz; Lailatul Qadri Zakaria

doi:10.17576/gema-2018-1802-09

The Effectiveness of Bottom Up Technique with Probabilistic Approach for A Malay Parser

Muhammad Azhar Fairuzz Hiloh, Mohd Juzaiddin Ab Aziz, Lailatul Qadri Zakaria

Abstract

Parsing is a process of analyzing the input string in a sentence to define the syntax structures according to rules of grammar. This task is performed by a parser which will produce a parse tree as output. However, a problem occurs when the parsing process produces two or more parse trees in which the parser unable to represent a precise parse tree. This limitation is caused by ambiguity in the structure of sentences. Ambiguity is occurred when a word is classified more than one category of syntax and its usage will affect the semantics of the sentence. Thus, the parser needs to have an approach to solve the ambiguity problem and is able to process the most appropriate parse tree to present a sentence. Like other languages in the world, Malay language, a national language for Malaysian, is not exempted from ambiguity problem. However, due to its grammar being context-free grammar, the probabilistic context-free grammar approach can be used to support the parser in determining a more accurate parse tree. This study focuses on the development of statistical parser using a bottom-up technique for Malay language. The training data, in the form of simple Malay language sentences, are collected from various sources. Based on this training data, a statistical lexical corpus of Malay language which consists of vocabulary, grammar rules and their probability was developed. The bottom up parsing will be supported by implementing Cocke–Younger–Kasami (CYK) algorithm. The parser’s performance is evaluated based on its effectiveness to overcome ambiguity by suggesting a more precise parse tree. In conclusion, the Malay Language Parser can be useful to help user identify the appropriate parse tree and solve ambiguity issues in Malay Language.

Keywords

parsing technique; probabilistic context-free grammar; CYK; ambiguity; Malay language

Full Text:

PDF

References

Ahmad I. Z. Abidin, S.P. Yong, Rozana Kasbon & Hazreen Ahmad. (2007). Utilizing Top-Down Parsing Technique in the Development of a Malay Language Sentence Parser, Proceeding of the 2nd International Conference on Informatics. pp. 125-131.

Alavi, M. and, Umanath, N.S. (1989). Application Software Prototyping, in Encyclopedia of Computer Science and Technology. Vol. 21(6). In Kent,

A. & Williams, J. G. (Eds.). ADA and Distributed Systems to Visual

Languages (pp. 19- 38). Florida: CRC Press.

Boar, B.H. (1984). Application Prototyping: A Requirements Definition Strategy of the 80s: New York: John Wiley & Sons,.

Charniak, E. (1993). Statistical Language Learning. Cambridge: MIT

Press.

Charniak, E. (1997). Statististical Techniques for Natural Language Parsing. AI Magazine. Vol. 18, 33-43.

Chomsky, N. (1980). Rules and Representations. New York: Columbia University Press.

Collins, M.J. (2003). Head-Driven Statistical Models For Natural Language Parsing. MIT Press Journal. Vol. 29 (4), 589-637.

Dale, R. (2000). Symbolic Approaches to Natural Language Processing. In Dale, R., Moisl, H. and Somers, H. (Eds.), Handbook of Natural Language Processing (pp.1-9). New York: Marcel Dekker.

Hashim Musa. (1990). Sintaksis Bahasa Melayu: Suatu Huraiann Berdasarkan Rumus Struktur Frasa. Kuala Lumpur: Agensi Penerbitan Nusantara.

Hassan Mohamed, Nazlia Omar & Mohd Juzaiddin. (2011). Malay Part of Speech Tagger, A comparative study on Tagging Tools. Asia-Pacific Journal of Information Technology and Multimedia. Vol. 4(1), 11-23.

Hassan Mohamed, Nazlia Omar & Mohd Juzaiddin. (2011). Statistical Malay Part-of-Speech (POS) Tagger using Hidden Markov Approach. International Conference on Semantic Technology and Information Retrieval. pp.231-236. Putrajaya.

Hopcroft, J.E., Motwani, R. & Ullman., J.D. (2006). Introduction to Automata Theory, Languages, and Computation. Boston: Pearson Addison Wesley.

Jurafsky, D. & Martin, J.H. (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. New Jersey: Prentice-Hall.

Knowles, G. O. & Zuraidah Mohd Don. (2006). Word Class in Malay: A Corpus-Based Approach. Kuala Lumpur: Dewan Bahasa dan Pustaka.

Nik Safiah Karim, Farid M. Onn, Hashim Haji Musa & Abdul Hamid Mahmood. (2008). Tatabahasa Dewan, 3rd ed. Kuala Lumpur: Dewan Bahasa dan Pustaka.

Nik Safiah Karim. (1975). The Major Syntactic Structures of Bahasa Malaysia and their Implication of the standardization of the Language. Unpublish PhD Thesis, Ohio University.

Noor Hafhizah Abd Rahim. (2011). A Statistical Parser To Reduce Structural Ambiguity in Malay Grammar Rules. Unpublish Master Thesis, Malayan University of Malaysia.

Nur Ashikin Halid, Nazlia Omar. (2017). Malay Part Of Speech Tagging Using Ruled-Based Approach. Asia-Pacific Journal of Information Technology and Multimedia. Vol. 6(2).

Palmer, D.D. (2000). Tokenisation and Sentence Segmentation. in R. Dale, H. Moisl and. H. Somers(Eds.). Handbook of Natural Language Processing. New York: Marcel Dekker, pp. 11-35.

Resnik, P. & Lin, J. (2013). The Handbook of Computational Linguistics and Natural Language Processing, Wiley Blackwell.

Rozana Kasbon, Nurul Atiqah Amran, Eliza Mazmee Mazlan &

Saipunidzam Mahamad. (2011). Malay Language Sentence Checker. World Applied Sciences Journal. Vol. 12, 19-25.

Sabrina Tiun, Rosni Abdullah, Tang Enya Kong, Siti Khaotijah Muhammad. (2011). Korpus Pertuturan Sintaksis-Prosodi Bahasa Melayu. Asia-Pacific Journal of Information Technology and Multimedia. Vol. 2(1), 1-12.

Yusnita Mohd Noor & Zulaikha Jamaluddin. (2012.) Malay declarative Sentence: Visualization & Sentence Correction. IEEE Conference on Open Systems. pp. 1-5.

Zulkifley Hamid, Ramli Md Salleh & Rahim Aman. (2015). Linguistik Melayu. Bangi: Penerbit Universti Kebangsaan Malaysia.

DOI: http://dx.doi.org/10.17576/gema-2018-1802-09