Preliminary Analysis of Malaysian Corpus of Financial English (MaCFE)

Roslan Sadjirin, Roslina Abdul Aziz, Norzie Diana Baharum, Noli Maishara Nordin, Mohd Rozaidi Ismail


This paper presents the findings of the preliminary analysis conducted on the Malaysian Corpus of Financial English (MaCFE). MaCFE is a specialised corpus consisting of written documents compiled from banks in Malaysia and the corpus is currently housing approximately 4.3 million word tokens. The aim of the analysis was to evaluate the suitability of the texts chosen to represent the financial domain. The preliminary analysis involved generating the word list and lists of co-occurrences from MaCFE.  RapidMiner Studio Educational 7.5.001 and an in-house Java programming solution was utilised to perform the analysis.  The word list and lists of 50 most frequent two-word and three-word co-occurrences generated from the analysis reveal that the text compilation is representative of the financial domain in Malaysia. The study concludes by discussing the pedagogical implications of the findings.


Keywords: Corpus linguistics; Co-occurrences; Financial corpus; Specialised corpus; Word list

Full Text:



Ang, L.H. & Tan, K.H.(2018). Specificity in English for Academic Purposes (EAP): A corpus analysis of lexical bundles in academic writing. 3L: The Southeast Asian Journal of English Language Studies, 24(2) 82 – 94.

Aston, G. (1997). Small and large corpora in language learning. In B. Lewandowska-Tomaszczyk & J. P.Melia (Eds.), Practical applications in language corpora (pp. 51-62). Lodz, Poland:

Lodz University Press.

Baldwin, T. & Su N. K. (2010). Multiword expressions. In N. Indurkhya & F. J. Damerau (Eds.), Handbook of natural language processing, second edition. Boca Raton, FL: CRC Press, Taylor

and Francis Group.

Bennett, G. R. (2010). Using corpora in the language learning classroom: Corpus Linguistics for teachers part 1. Using Corpora in the Language Learning Classroom: Corpus Linguistics for

Teachers Part 1, 22.

Benson, M., Benson, E. & Ilson, R. (1986b). The BBI combinatory dictionary of English: A guide to word combinations. Amsterdam: John Benjamins.

Bickel, S., Haider, P., & Scheffer, T. (2005). Predicting sentences using N-gram language models. Proceedings of the Conference on Human Language Technology and Empirical Methods in

Natural Language Processing, 193–200.

Cavnar, W. B., Trenkle, J. M., & Mi, A. A. (1994). N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval,


Centre for Shariah Reference in Islamic Finance, (2017) Shariah Standard. Retrieved Jun 16, 2020, from

Ching, H. L. & Yen.L.L. (2019). Grammatical and lexical patterning of make in Asian learner writing: A corpus-based study of ICNALE: 3L: The Southeast Asian Journal of English Language

Studies. Vol 25(3): 1 – 15.

Coxhead, A. (2000) A new academic word list. TESOL Quarterly, 34(2), 213-238.

El Shamsy, A. & Coulson, N.J. (2019, Nov 03). Shariah. Encyclopædia Britannica, inc.

Houvardas, J., & Stamatatos, E. (2006). N-gram feature selection for authorship identification. Artificial Intelligence Methodology Systems and Applications, 4183, 77–86.

Hyland, K. (1998). Hedging in scientific research articles. John Benjamins: Armsterdam.

Indurkhya, N. & Damerau, F. J. (2010). Handbook of natural language processing (2nd. Edition). Taylor & Francis Group: Boca Raton.

Kennedy, G. (1998). An introduction to Corpus Linguistics. Longman, Londan and New York.

Kennedy, C., & Miceli, T. (2001). An evaluation of intermediate students’ approaches to corpus investigation. Language Learning & Technology, 5(3), 77-90.

Kondrak, G. (2005). N-gram similarity and distance. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

(Vol. 3772 LNCS, pp. 115–126).

MacWhinney, B. (2000). The CHILDES project: Tools for analysing talk, 3rd ed. Mahwah, NJ: Erlbaum.

Mayfield, J., & McNamee, P. (2003). Single n-gram stemming. Proceedings of the 26th Annual International …, 1(240), 415–416.

Nation, P. (2008). Teaching vocabulary: Strategies and techniques. Boston: Heinle.

Nation P. & Webb S. (2011). Researching and analyzing vocabulary. Boston: Heinle Cengage.

Poznanski, A., & Wolf, L. (2016). CNN-N-gram for handwriting word recognition. Cvpr, 2305–2314.

Reppen, R. (2010). Building a corpus: What are the key considerations? In O’Keeffe, A and McCarthy, M. (Eds).The Routledge handbook of Corpus Linguistics (pp.31-37). Milton, United Kingdom:Routledge.

Roslan S., Roslina A. A., Noli M. N., Mohd Rozaidi I. & Norzie D. B. (2018). The development of Malaysian Corpus of Financial English (MaCFE). GEMA Online® Journal of Language Studies,

(3), 73-100.

Scott, M. & Tribble, C. (2006). Textual patterns: Key words and corpus analysis in language education. Amsterdam: John Benjamins.

Shterev, Y. (2013). Demo: Using RapidMiner for text mining. RapidMiner Possibility for text Mining, 3, 3–5.

Sinclair, J. (1991). Corpus, concordance and collocation. Oxford University Press: New York.

Sinclair, J. (2004). Corpus and Text — Basic principles. In Developing linguistic corpora: A guide to good practice (pp. 5–24).

Taylor, C. (2006). What is corpus linguistics? What the data says, 179–200.

Toutanova, K., Klein, D., & Manning, C. D. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American

Chapter of the Association for Computational Linguistics on Human Language Technology – Volume 1 (NAACL ’03), 252–259.

Toutanova, K., & Manning, C. D. (2000). Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. Proceedings of the 2000 Joint SIGDAT Conference on

Empirical Methods in Natural Language Processing and Very Large Corpora Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics -, 13, 63–70.

Verma, T., & Gaur, D. (2014). Tokenization and filtering process in RapidMiner. International Journal of Applied Information Systems, 7(2), 16–18.

Warren, M. (2010). Online corpora for specific purposes. ICAME Journal, 34, 169–188. Retrieved from

Wattenberg, M., & Viégas, F. B. (2008). The word tree, an interactive visual concordance. In IEEE Transactions on Visualization and Computer Graphics (Vol. 14, pp. 1221–1228).

Wood, D. (2006). Uses and functions of formulaic sequences in second language speech: An exploration of the foundations of fluency. Canadian Modern Language Review, 63(1), 13-33.

Yoon, H. (2011). Concordancing in L2 writing class. An overview of research and issues. Journal of English for Academic Purposes 10, 130-139.

Zimmermann, T., & Weißgerber, P. (2004). Preprocessing CVS data for fine-grained analysis. Proc. MSR, 2–6.



  • There are currently no refbacks.




eISSN : 2550-2247

ISSN : 0128-5157