Pembangunan Taksonomi dari Teks Melayu Menggunakan Algoritma Kunang-Kunang Pembahagi Dua Sama (Taxonomy Development from Malay Text Using Firefly Bisection Algorithm)

Mohd Zakree Ahmad Nazri, Tri Basuki Kurniawan, Abdul Razak Hamdan, Salwani Abdullah, Mohammed Azlan Mis

Abstract


Taksonomi digunakan untuk menerangkan bahawa haiwan boleh dikelaskan kepada beberapa kategori seperti mamalia, reptilia dan buaya. Taksonomi biologi ini membolehkan persamaan, perbezaan malah hubungan antara haiwan ditakrifkan. Konsep dan fungsi taksonomi biologi ini ‘dipinjam’ oleh saintis dan jurutera Internet dalam membangunkan taksonomi untuk Internet. Seperti taksonomi biologi, membangunkan taksonomi untuk Internet secara manual bukanlah suatu yang mudah dan murah. Tugas ini mengambil masa dan memerlukan kepintaran dalam bidang. Justeru saintis komputer telah menggunakan pendekatan kecerdasan buatan untuk membangunkan taksonomi secara automatik dari teks. Algoritma pembelajaran mesin dicipta untuk membolehkan mesin ‘membaca’ teks dan kemudiannya ‘belajar’ untuk  membina taksonomi dari konteks yang diperolehi dari teks. Objektif utama kajian ini adalah untuk membangunkan algoritma pembelajaran taksonomi dari Bahasa Melayu yang lebih berkesan dari algoritma sedia ada menggunakan kaedah penghibridan. Makalah ini menyiasat keberkesanan algoritma hibrid antara Algoritma Kunang-Kunang (AKK) dengan Algoritma K-Min Pembahagi Dua Sama (PDS) yang dipanggil Algoritma Kunang-Kunang Pembahagi Dua Sama (AKK-PD). Kajian empirikal ini mengumpul data dari eksperimen yang dijalankan ke atas tiga teks Bahasa Melayu dari bidang Fekah, Biokimia dan Teknologi Maklumat. Perbandingan data ketepatan berasaskan  ukuran-F menunjukkan algoritma hybrid AKK-PD membina taksonomi yang lebih tepat berbanding menggunakan algoritma sedia ada. AKK-PD didapati lebih berkesan dan mantap berbanding algoritma bandingan apabila mengendalikan masalah kejarangan data . Walau bagaimanapun, kajian penerokaan ini perlu diteruskan kepada korpus Bahasa Melayu yang lebih besar untuk menguji ketahanan algoritma ini apabila berhadapan dengan korpus yang lebih umum sifatnya berbanding korpus teks yang teknikal dan menjurus kepada suatu bidang sahaja. Teknik pengekstrakan ciri berasakan kebergantungan sintaksis juga perlu dipertingkatkan kerana jelas teknik telah menghasilkan konteks yang mengalami masalah kejarangan data yang serius. Justeru memberi cabaran baharu untuk penyelidikan pembelajaran taksonomi dari teks Melayu.

 

Kata Kunci: Pembelajaran Mesin; Pembelajaran Taksonomi; Algoritma Kunang-Kunang; Ciri; Teks Bahasa Melayu

  

ABSTRACT

 

Taxonomy is used to explain that animals can be classified into categories such as mammals, reptiles and crocodiles. This biological taxonomy allows similiarities, differences and relationship between animals to be defined. The concept and function of biological taxonomy is 'borrowed' by Internet scientists and engineers in developing taxonomies for the Internet. Like biological taxonomy, developing taxonomies for the Internet manually is not easy and expensive because the task takes time and requires ingenuity in the field. Thus, computer scientists have used artificial intelligence approaches to develop taxonomies automatically from text. Machine learning algorithms are created to allow the machine to 'read' the text and then 'learn' to construct taxonomy from the context derived from the text. The main objective of this study is to develop an effective taxonomic learning algorithm from Malay text than the existing algorithms using hybridization methods. This study investigates the effectiveness of hybrid algorithms between the Firefly Algorithm (AKK) and the K-Means Bisecting Algorithm (PDS) and thi hybrid algorithm is called the Firefly-Bisecting Algorithm (AKK-PD). This empirical study collects data from experiments carried out on three Malay texts from the Islamic Jurisdiction, Biochemistry and Information Technology. Comparison of accuracy using F-measure shows that the AKK-PD build more accurate taxonomies than using existing algorithms when dealing with data sparseness problem.  The AKK-PD  is revealed to be more effective and robust compared to the seven existing algorithms.  However, this exploratory study needs to be continued with a larger Malay corpus to test the robustness and resilience of this algorithm when dealing with a more general corpus than its technical and specific corpus of texts. The syntactic dependency-based extraction technique needs to be enhanced as it is obvious that this technique have resulted in the context of having serious data sparseness problems. Thus, it opens up new challenge for research about taxonomic learning from Malay texts.

 

Keywords: Machine Learning; Taxonomy Learning; Firefly algorithm; Features; Malay text


Full Text:

PDF

References


Abhay Jain, Srujan Chinta & Tripathy, B.K. (2017). Stabilizing Rough Sets Based Clustering Algorithms Using Firefly Algorithm over Image Datasets. International Conference on Information and Communication Technology for Intelligent Systems Conference Proceedings, 325-332. Springer, Cham.

Abhishek Bafna & Wiens, J. (2015). Automated feature learning: Mining unstructured data for useful abstractions. 2015 IEEE International Conference on Data Mining Conference Proceeding, 703-708.

Amirah Ismail, Joy, M.S., Sinclair, J.E. & Mohd Isa Hamzah. (2009). A metametadata taxonomy to support semantic searching algorithms in metadata repository. International Conference on Electrical Engineering and Informatics Conference Proceedings, vol. 2, 1-6. IEEE.

Charles W.G. (2000). Contextual correlates of meaning. Applied Psycholinguistics 21, 505-524.

Cimiano, P., Hotho, A. & Staab, S., (2005). Learning Concept Hierarchies From Text Corpora Using Formal Concept Analysis. J. Artif. Intell. Res.(JAIR). Vol. 24(1), 305-339.

Cimiano, P. & Staab, S. (2005).Learning Concept Hierarchies from Text with a Guided Agglomerative Clustering Algorithm. International Conference on Machine Learning 2005 (ICML 2005) Conference

Proceedings, Bonn Germany.

Cimiano, P. (2006). Ontology Learning and Population From Text. Springer Berlin.

De Castro, L.N. & Timmis, J. (2002). Artificial Immune Systems: A New Computational Intelligence Approach. Springer Science & Business Media.

de Mantaras, R.L. & Saitia, L. (2004). Comparing conceptual, divisive and agglomerative clustering for learning taxonomies from text. 16th European Conference on Artificial Intelligence Conference Proceedings, Vol. 110, 435. IOS Press.

Fister, I., Fister Jr, I., Yang, X.S. & Brest, J. (2013). A Comprehensive Review of Firefly Algorithms. Swarm and Evolutionary Computation. Vol. 13, 34-46.

Firth, J.R. (1957). A Synopsis of Linguistic Theory 1930-1955. Longman. London.

Izfa Riza Hazmi, & Sharifah Aliya Syed Sagaff (2018). Fireflies Population and the Aquaculture Industry (Coleoptera: Lampyridae) of the Sungai Sepetang, Kampung Dew, Perak, Malaysia. Serangga. Vol. 22(2).

Harris, Z. (1954). Distributional Structure. Word. Vol. 10(23), 146-162.

Harris, Z. (1968). Mathematical Structure of Language. Wiley.

Herna Banati & Monika Bajaj. (2013). Performance Analysis of Firefly Algorithm for Data Clustering Int. J. Swarm Intelligence. Vol. 1(1).

Jay J. Jiang & Conrath D.W. (1997). Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. International Conference Research on Computational Linguistics ROCLING X Conference Provceedings Taipei, Taiwan, 1997.

Lefever, E. (2016). A Hybrid Approach to Domain-independent Taxonomy Learning. Applied Ontology. Vol. 11(3), 255-278.

Lewis, S.M. & Cratsley, C.K. (2008). Flash Signal Evolution, Mate Choice, and Predation in Fireflies. Annual Review of Entomology. Vol. 53, 293-321

Luu Anh Tuan, Yi Tay, Siu Cheung Hui & See Kiong Ng. (2016). Learning term embeddings for taxonomic relation identification using dynamic weighting neural network. Conference on Empirical Methods in Natural Language Processing Conference Proceedings, 403-413.

Miller, G.A. & Charles, W.G. (1991). Contextual Correlates of Semantic Similarity. Language and Cognitive Processes. Vol. 6(1),1-28.

Mohammed, A.J., Yusof, Y. & Husni, H. (2014). Weight-based Firefly algorithm for document clustering. First International Conference on Advanced Data and Information Engineering (DaEng-2013) Conference Proceedings, 259-266. Singapore.

Mohd Zakree Ahmad Nazri, Siti Mariyam Shamsuddin, Azuraliza Abu Bakar & Tarmizi Abd Ghani. (2008). Using linguistic patterns in FCA-based approach for automatic acquisition of taxonomies from Malay text. 2008 International Symposium on Information Technology Conference Proceedings, Vol. 2, 1-7.

Mohd Zakree Ahmad Nazri, Siti Mariyam Shamsuddin, Azuraliza Abu Bakar & Salwani Abdullah. (2011). A Hybrid Approach for Learning Concept Hierarchy From Malay Text Using Artificial Immune Network. Natural Computing. Vol. 10, 275-304.

Nur Hudawiyah, O., Nurul Wahida & S. Norela. (2015, September). Gross anatomy of central nervous system in firefly, Pteroptyx tener (Coleoptera: Lampyridae). In AIP Conference Proceedings (Vol. 1678, No. 1, p. 020017). AIP Publishing.

Nayak, J., Nanda, M., Nayak, K., Naik, B. & Behera, H.S. (2014). An Improved Firefly Fuzzy C-means (Fafcm) Algorithm for Clustering Real World Data Sets. In Advanced Computing, Networking and Informatics. Springer, Cham.

Ristoski, P., Faralli, S., Ponzetto, S.P. & Paulheim, H. (2017). August. Large-scale taxonomy induction using entity and word embeddings. The International Conference on Web Intelligence Conference Proceedings, 81-87.

Sarma, P.N. & Gopi, M. (2014). Energy Efficient Clustering Using Jumper Firefly Algorithm in Wireless Sensor Networks. arXiv preprint arXiv:1405.1818.

Senthilnath, J., Omkar, S.N. & Mani, V. (2011). Clustering Using Firefly Algorithm: Performance Study. Swarm and Evolutionary Computation. Vol. 1(3), 164-171.

Wan Faridah Akmal Jusoh, Nor Faridah Hashim & Nur Azura Adam. (2013). Distribution of the Synchronous Flashing Beetle, Pteroptyx Tener Olivier (Coleoptera: Lampyridae), in Malaysia. The Coleopterists Bulletin.

Wan Juliana, W.A, Md. Shahril, M.H., Nik Abdul Rahman, N.A., Nurhanim, M.N., Maimon Abdullah, M. & Norela Sulaiman. (2012). Vegetation Profile of the Firefly Habitat Along the Riparian Zones of Sungai Selangor at Kampung Kuantan, Kuala Selangor. Malaysian Applied Biology. Vol. 41(1), 55-58.

Wang, C., He, X. & Zhou, A. (2017). A Short Survey on Taxonomy Learning from Text Corpora: Issues, Resources and Recent Advances. 2017 Conference on Empirical Methods in Natural Language Processing Conference Proceedings, 1190-1203.

Wang, C., Fan, Y., He, X. & Zhou, A. (2018). Predicting Hypernym–hyponym Relations for Chinese Taxonomy Learning. Knowledge and Information Systems. 1-26.

Wong, L.A., Shareef, H., Mohamed, A. & Ibrahim, A.A. (2014). Optimal battery sizing in photovoltaic based distributed generation using enhanced opposition-based firefly algorithm for voltage rise mitigation. The Scientific World Journal.

Xiujuan Lei, Fei Wang, Fang-Xiang Wu, Aidong Zhang & Pedrycz, W. (2016). Protein Complex Identification Through Markov Clustering With Firefly Algorithm on Dynamic Protein–protein Interaction Networks. Information Sciences. Vol. 329, 303-316.

Yang, X.S. (2008). Nature-Inspired Metaheuristic Algorithm. Frome.

Yang, X.S. (2010). Firefly Algorithm, Stochastic Test Functions and Design Optimisation. International Journal of Bio-Inspired Computation. Vol. 2(2), 78-84.

Yong-Bin Kang, Haghigh, P.D. & Burstein, F., (2016). Taxof Inder: a Graph-based Approach for Taxonomy Learning. IEEE Transactions on Knowledge and Data Engineering. Vol. 28(2), 524-536.

Zipf, G.K. (1935.) The Psychobiology of Language. Houghton-Mifflin.




DOI: http://dx.doi.org/10.17576/gema-2018-1802-13

Refbacks

  • There are currently no refbacks.


 

 

 

eISSN : 2550-2131

ISSN : 1675-8021