The Development of Malaysian Corpus of Financial English (MaCFE)

This paper presents the processes involved in the design and development of the Malaysian Corpus of Financial English (MaCFE); a specialized corpus containing a wide range of online/internet documents (i.e. communiqué) from various financial institutions in Malaysia. It describes in detail the processes involved in the collection and selection of data and preprocessing of raw data, which includes data digitizing, cleansing and tagging. This paper also introduces the user interface for MaCFE with its built-in linguistic analysis features. MaCFE was designed and developed with the intention of providing corpus linguistic researchers with the avenue to explore the field and for ESP/EAP practitioners in Malaysia, as the resources for the development of local-based ESP/EAP curriculum and teaching and learning materials. It would also serve as a learning avenue for future financial professionals in their training. MaCFE corpus has approximately 4.3 million words from 1472 electronic documents retrieved from banks and financial institutions’ official websites. At present, users can make queries to the MaCFE database using its built-in concordancer. In the future, its language-data-processing facilities will be expanded to include tools for keyword, wordlist and word collocations queries.


INTRODUCTION
A corpus is a subset of electronic texts library developed on a large scale, which contains extensive collections of transcribed utterances or written texts (McEnery & Hardie, 2011).It is built according to explicit design criteria for a specific purpose which not only serves as a findings contributing to the studies of English teaching and learning as a second language in Malaysia.An error analysis study by Arshad and Hawanum (2010) for instance made use of the data from this corpus to investigate the use of auxiliary BE in the essays written by Malaysian Primary 5 students.
The study found many instances where students overgeneralized the use of was to show past tense and were unable to differentiate between the use of BE as an auxiliary and as a main verb.Besides that, Rafidah (2013) in her investigation of the use of six phrasal verbs with particle UP by Malaysian ESL learners had also made use of EMAS corpus.The Malaysian learners' use of phrasal verbs was compared to that of native speakers' from Bank of English (BoE) corpus.The findings revealed that wrong usage of common phrasal verbs (e.g.pick up, wake up, get up) has strong association with the learners' lexical knowledge, their awareness of common collocates, familiarity with the context of use and their mother tongue.The appropriateness in the use of phrasal verbs was also found to improve over time, suggesting that learners had benefited from longer exposure to the target language.EMAS was also utilized by Zarifi and Jayakaran (2014) in a corpusbased analysis of the creativity and unnaturalness in the use of phrasal verbs among Malaysian ESL learners.The acceptability of the phrasal verbs used or created by learners was judged with the help of dictionaries and those without dictionary entry were judged against BNC.Learners were found to use phrasal verbs quite frequently, however, some of the phrasal verbs created by the learners appeared unnatural.In discussing the pedagogical implications of the study, the researchers suggested that material developers and teachers should emphasize on distinguishing the semantic functions of every single particle and the way to combine them with various lexical verbs.
The Corpus Archive of Learner English in Sabah-Sarawak (CALES) developed by Botley and Doreen (2007) is a complementary corpus for the University of Malaya's MACLE.As of 2007, the corpus contains around 400,000-word argumentative essays produced by diploma and degree students taking English proficiency courses at four public universities in East Malaysia namely UiTM Sarawak, UiTM Sabah, Universiti Malaysia Sarawak (UNIMAS) and Universiti Malaysia Sabah (UMS).The learner corpus is closely modelled after the International Corpus of Learner English (ICLE) (Granger, 1998;Granger, 2002).Among the studies utilizing the corpus archive is the one by Botley and Doreen (2007) which analysed spelling errors in the 281 essays selected from the corpus.The errors were grouped according to the framework developed by James (1998 as cited in Botley et al., 2007) which sees mechanical errors like doubling (abbuse), omission (vacum) and mis-ordering (frobidden), mis-spellings (prostitude, sofisticated), interlingual mis-encoding (accaunting, karier) in the selection of CALES texts.
In addition to the corpora reviewed, there are also the Malaysian Corpus of Students' Argumentative Writing -MCSAW1 developed by Jayakaran and Rezvani Kalajahi (2013) and the Written English Corpus for Malay ESL Learners (WECMEL), a collection of 470,000 word argumentative essays produced by Universiti Teknologi MARA pre-Law students (Shazila & Noorzan, 2013).
The literature proves that corpus-based research is growing synchronously with corpora development in the country.This is especially true for the Malaysian learner corpora.However, development of specialized corpus in the country has been rather limited.So far, only one specialized corpus containing data from Malaysia has been developed i.e.Corpus of Malaysia Memoranda of Understanding (MoA), which contains legal documents compiled by Su'ad (1999Su'ad ( , 2003)).Considering the importance of specialized corpora in ESP/EAP contexts and the need to provide language instructors and learners with data relevant to the local setting, the Malaysian Corpus of Financial English (MaCFE) was developed.

DESIGN AND DEVELOPMENT OF MaCFE
MaCFE is designed and developed following the current methodology of corpus linguistics.In its construction, the research team has adhered as closely as possible to the corpus design principles posited by Sinclair (2004), which are summarized below: 1.The contents of a corpus should be selected according to their function in the community in which they arise.2. The corpus should be as representative as possible of the chosen language.3.Only components in the corpus that are designed to be independently contrasted are contrasted.4. Criteria determining the structure of the corpus are small in number, separate from each other, and efficient at delineating a corpus that is representative.5. Any information about a text is stored separately from the plain text and only merged when needed.6.Samples of language for the corpus, whenever possible, consist of entire texts.7. The design and composition of the corpus are fully documented with full justifications.8.The corpus design includes, as target notions, representativeness, and balance.9.The control of subject matter in the corpus is imposed by the use of external, and not internal, criteria.10.The corpus aims for homogeneity in its components while maintaining adequate coverage, and rogue texts should be avoided.
(cited in Warren, 2010, p. 170) In addition, the work has also benefitted from previous practices of specialized corpus building.Much of the design framework especially in data compilation (i.e.setting external criteria and text categories) generally follows the framework established by Warren (2010) in building HKFSC.Nevertheless, some adjustments had to be made on the design whenever needed, for instance the text categories finalized in MaCFE did not include some of the text categories used for the development of HKFSC due to issues on confidentiality and accessibility.
Furthermore, MaCFE has also adapted the Aksan and Aksan (2009) workflow packages.The corpus development is divided into 4 major processes namely; (1) data collection and selection, (2) data preprocessing which includes data digitizing, data cleansing, part-of-speech (POS) and meta-linguistic tagging, (3) user interface, and (4) text and linguistic analysis.This section briefly discusses these processes, and Fig. 1 depicts the framework of the MaCFE design.

FIGURE 1. MaCFE design framework
A corpus is designed to constitute a representative sample of a defined language type (Atkins et al., 1991).Therefore, data selection is key to the successful design and development of the specialized corpus.As mentioned earlier MaCFE has adopted the text categories of the HKFSC (Warren, 2010).In determining that the range of text types is representative of the English used by professionals in the financial sectors in Hong Kong, Warren (2010) has sought expert advice of professional bodies, government departments, private sectors as well as individual professionals from the financial service sector.Based on the experts' advice, HKFSC comprises of 26 text types, all of which characterize the language read and written by financial professionals in Hong Kong.Most of the text types also typify the written language of financial institutions bodies in Malaysia and some adjustment had to be made to the text categories to suit the Malaysian finance situation for instance the descriptions of the products offered by the banking institutions (insurance, investment, credit cards, etc.)  Malaysia operates a dual-banking system; conventional banking system operating in tandem with Islamic banking system.Since the enactment of the Islamic Banking Act 1983 and the establishment of Malaysia's first Islamic Bank, a significant number of full-fledged Islamic banks have been established in the country including Bank Islam Malaysia Berhad and Bank Muamalat Malaysia Berhad.In recent years, Malaysia has also seen the increase of local conventional banks establishing Islamic subsidiaries offering various products and services complying with Sharia Law (e.g.Public Islamic Bank Berhad, CIMB Islamic Bank Berhad, RHB Islamic Bank Berhad).The liberalization of the Islamic financial system and government-facilitated business environment have also attracted a number of foreign-owned financial institutions to set their Islamic banks and subsidiaries in the country (e.g.Al Rajhi Banking and Investment Corporation, OCBC Al-Amin Bank Berhad, Standard Chartered Saadiq Berhad).In fact, Islamic banking has become an integral part of the financial system in Malaysia that at present, Malaysia's Islamic banking assets have reached USD65.6 billion with an average growth rate of 18-20% annually (Bank Negara, 2017).Due to this development, the data from local as well as international Islamic financial entities are gathered for the development of MaCFE.The final release of MaCFE will cover four major categories of finance institutions; Local Islamic Bank, Foreign Islamic Bank, Local Conventional Bank and Foreign Conventional Bank as displayed in Fig. 2 Presently, 1472 electronic documents related to the Malaysian financial domain have been gathered and compiled amounting to a total number of approximately 4,373,230 million tokens.These electronic documents were retrieved and collected from banks' official websites, which are accessible via the public domain.

DATA PREPROCESSING
After the targeted data were selected and collected, preprocessing steps were applied.According to Zimmermann and Weißgerber (2004), preprocessing has a direct impact on the quality of the results returned by an analysis.MaCFE underwent four stages of data preprocessing; (i) data digitizing, (ii) data cleansing, (iii) part-of-speech tagging, and (iv) meta-linguistic annotation/markup.Each of the stages is explained in the following subsections.

Data Digitizing
In order to transform the collected data into machine readable texts and integrate them with MaCFE's user interface, all the documents compiled have to be converted into text files.Text file format is a human-readable sequence of characters, which can be encoded into machine readable formats.Each converted file will be renamed as follows: a. Naming convention for bank documents: The plus ( + ) sign in the naming convention for {SequenceOfDocument} and {Month} indicates that encoding is optional, because some documents only provide the year of publication and do not include the sequence and month of publication.Table 2 shows the text types and the respective codes assigned for document naming convention and Table 3 presents the examples of documents in the MaCFE text collection.
2. Data Cleansing The next stage in preprocessing is data cleansing.Data cleansing, also known as data cleaning or data scrubbing, involves the process of removing or eliminating noise from the data, which includes tables, images and special characters (refer to Table 4 for examples of special characters).According to Chu et al. (2016), failure in data cleansing leads to inaccurate analysis and unreliable decision.As an example, tables and images need to be removed as they contain isolated terms and figures that would be counted by lexical analysis software in its overall analysis, thus affecting the overall statistical findings of wordlists and concordances.In general, too much noise in the datasets might render the data unfit and unsuitable for data analytics.As for MaCFE, there are four mandatory data cleansing procedures required, which are: i. Remove/eliminate tables ii.Remove/eliminate images iii.Correct misspelling iv.Remove/eliminate special characters (e.g.^ % #) Tables and images were automatically removed during data digitizing process.This process involves converting the data sources into text files using PDF Foxit Reader software.During the conversion, tables and images were simultaneously removed.Spelling correction was performed with the aid of Microsoft Word spelling checker, which was used to identify and correct misspelled words.Finally, special characters were removed automatically using RapidMiner Studio Educational (7.5.001)Text Processing Package by utilizing an algorithm as shown in Figure 3.
Table 4 displays some examples of special characters that need to be removed from the text collection.The algorithm for removing the special characters is presented in Fig. 3.The next stage of preprocessing is part-of-speech (POS) tagging.POS tagging is a basic form of syntactic analysis (Gimpel et al., 2011) and according to Leech (1997) is the most frequently used form of annotation.POS tagging involves assigning each lexical unit in the datasets a code to indicate its part of speech, for example NNP for singular proper noun, RB for adverb or JJ for adjective.Information regarding the parts of speech is primary in increasing the specificity of data retrieval and an important foundation for further forms of analysis such as syntactic parsing and semantic field annotation (McEnery & Hardie, 2011).Additionally, it could also contribute to various computational linguistic applications.
Nevertheless, to manually POS tag each lexical unit in a large corpus is timeconsuming and a tedious process.Therefore, MaCFE was tagged using an automated POS tagger developed by Toutanova and Manning (2000) at Stanford University.The tagger was further improved by Toutanova, Klein and Manning (2003).The Tautanova and Manning's POS tagger can be retrieved and downloaded from https://nlp.stanford.edu/software/tagger.shtml.Table 5 illustrates the encoding of POS tagsets and the respective descriptions, which are based on the tagsets of the Penn Treebank (Marcus, Santorini & Marcinkiewicz, 1993).The complete Penn Treebank tagsets can be viewed in Santorini (1990).MaCFE is still at the initial stage of development and has yet to be equipped with its own range of text processing facilities.RapidMiner and an in-house stand-alone Java program were employed to generate the wordlist for MaCFE.The wordlist produced would then be used to evaluate the suitability of the texts chosen to represent the financial domain.Table 6 presents the first fifty high frequency words ranked in MaCFE.The wordlist was obtained using the following steps and procedures.
Step 1: In this step, works done by Verma and Gaur (2014) and Shterev (2013) were adapted.At this stage, the RapidMiner Studio Educational (7.5.001)Text Processing Package (see Appendix A for steps taken to generate wordlist performed on RapidMiner) was employed, and the operators utilized for the process are in the following orders: a. Transform Cases: This operator transforms all characters into lowercase.b.Tokenize (mode: non-letters): Split text document containing non-letters into single token.c.Tokenize (mode: linguistic sentences; language: English): Split text document containing linguistics sentences into single word token.d.Tokenize (mode: linguistic tokens; language: English): Split word token into single character.e. Tokenize (mode: specify character): Split word token into single character with specified delimiter.f.Filter Special Characters (Dictionary): Remove special characters (refer to Table 4).
Although special characters have been removed during data cleansing, this operation needs to be performed to ensure the texts are free from all possible special characters.g.Filter Stopwords2 (English): Remove tokens that are English stopwords (refer to Appendix C for samples).
After performing all the actions in Step 1, a list that contains three tuples, namely Attribute Name, Total Occurrences, and Document Occurrences was produced.The list generated is shown in Table 6.The explanation of each tuple is as follows: • Attribute Name: Contains a set of word tokens extracted from the text collection.
• Total Occurrences: Contains the number of occurrences of each token in a whole text collection.• Document Occurrences: Contains the number of document in which the token appeared.Step 2: The next step is tagging each of the extracted token with its POS tag using the automated POS Tagger developed by Toutanova and Manning (2000) (refer to Appendix B for detailed steps).The list of tokens after POS tagging was performed is shown in Table 7.

Meta-Linguistic Annotation/Markup
The final step in data preprocessing is meta-linguistic markup.Meta-linguistic annotation or markup is a process of adding description to the datasets, for instance information about a text; text type, year published, gender of author and etc.For MaCFE, the added markup includes the title of the document, type of document and year of publication.The markup was administered manually using the system presented in Table 8.Basically, common markup system includes <, ! and >, however, for the MaCFE datasets, those symbols were omitted because they are considered as noise.
Typically, a markup system would also involve adding codes to indicate features of the original structure of a text, such as paragraph/sentence/chapter start/end points/page breaks/headings so that a word can be searched together with a markup code.As an example the use of pronoun we in the introduction section of scientific journal articles.However, the markup system applied in MaCFE was specifically designed to provide textual information of a text (or the header) i.e. title of document, type of document and year/month of publication.Other elements in the text (paragraph/sentence/chapter start/end points/page breaks/headings) were not annotated.The lack of markup system to set boundaries on paragraphs/sentences in the text would not, however, affect results of wordlist and concordance enquiries, as sentence and paragraph boundaries can still be distinguished through the use of punctuation (full stop) and spaces respectively.Table 8 presents the meta-linguistic markup system used for MaCFE, while Fig. 4 depicts the overview of text documents after performing meta-linguistic markup.The MaCFE PROTOTYPE MACFE is built entirely using the Hypertext Preprocessor or PHP, an open source scripting language for building web applications and MySQL, an open source relational database management system.The PHP codes are executed on the MySQL server to render interaction with users via a web browser (i.e.Internet Explorer, Chrome, Firefox, Safari etc.).The corpus can be accessed at http://learningdistance.org/mycorpus/macfe/.
As shown in Fig. 5 below, the interface has a basic, clean design with a welcome page and only 3 options: 'Home' will bring the user back to the welcome page, 'Login' to start using MaCFE, and 'Register' which the user has to first complete before they can log in into the corpus.

FIGURE 5. MaCFE user-interface
Once logged in, users will be able to make queries to the MaCFE database.Using this prototype, users can generate concordance lines of the MaCFE database.A concordance line is a line of text from a corpus.It can be at the beginning, middle or end of the texts; made up of one sentence, part of a sentence or part of two sentences.To make a query the user enters the target word in word search box: i.e. 'finance' (see Fig. 6).The 'context' option allows the user to decide the number of words before and after the target word.In this case, 12 words before and after the target word 'finance'.Each concordance line (see Fig. 7) includes the target word, i.e. the word being studied.The target word is always in the middle of the concordance line.So when users search for a word in a set of concordance lines, they can see its context or the words, which are used before and after it.Note that there are complete sentences, incomplete sentences and also lines showing only part of the sentences.
At the bottom of each query results table, the users have the option of navigating through all the instances of the word 'finance'.The frequency counts of the search enquiry will be displayed at the bottom right of the result table.By analyzing a set of concordance lines, users can analyze how a target word is used in context.They will also be able to analyze other linguistics elements relevant to the target word being studied.Obviously, the MaCFE prototype is still presently quite basic.Further upgrades and improvements are definitely necessary and are currently underway.At present the research team is adding several other query options, which include enabling users to search according to types of banks, types of documents, year and month.The process is still ongoing and is projected to complete in June 2018.When completed, users are able to narrow their queries to specific areas of the data, depending on the purpose and scope of their analysis.

PROBLEMS ENCOUNTERED IN BUILDING MaCFE
MaCFE as much as possible aims to represent the language written and read by the professionals in the financial sector in Malaysia as well as achieve the desired balance in her language representation.Nevertheless, compiling a large amount of data is not without its challenges.One of the major issues concerning data compilation is obtaining documents that were not accessible to the public.Documents like minutes of 'General Meetings' and 'Agreements' are generally not published online.Gaining access to these documents has proven difficult as most banking institutions were generally reluctant to grant access due to issues of security and confidentiality.As a result, comparatively fewer numbers of these documents were included in MaCFE.Nonetheless, the team had obtained the summary of the 'minutes', which are generally available online.The issue on 'Agreements' was resolved by compiling personal copies of 'Agreements' from clients of the financial institutions involved in this study.The number of 'Agreements' available in MaCFE is relatively small at present, which prompted for future works on MaCFE to include efforts to increase its number.Data preprocessing of a sizeable corpus like MaCFE also involved laborious and tedious works.Some of the processes were not entirely automated, therefore, requiring some forms of manual labor.Data cleaning process for instance, required for each text to be examined manually to identify misspelled words and special characters.In checking and correcting spelling mistakes, the team had utilized Microsoft Word spell checker, which to an extent had improved the speed of the process.Nonetheless, due to the number and length of the documents involved, the entire process took the research team several months to complete.Removing special characters from the data was also a time-consuming process, as it had to be administered to each individual document.Nevertheless, with the aid of an algorithm system written in Java, the processing time was approximately reduced to half.Each document regardless of the length can be processed in less than 5 minutes, instead of 10 to 20 minutes taken when administered manually.

FUTURE DIRECTION
Future planning of the corpus is to include language data-processing tools that would enable lexical analysis other than concordancing, to be administered using the MaCFE platform.The research team is considering incorporating RapidMiner 7.5.0013(Text Processing Package) to generate wordlists, word occurrences, document occurrences and n-grams (bigram and tri-gram) and a Java program for the computation of word-form frequency and to generate the association of n-grams.In doing so the team needs to conduct preliminary analysis using these tools in order to gauge the suitability and reliability of the tools.The analysis will also determine if the engine currently employed to operate MaCFE would be able to support these additional software and program.As mentioned earlier, MaCFE utilizes MySQL management system and in order to support future extension, the system has to be upgraded to MySQLi.
The completion of MaCFE has also enabled efforts in designing and developing discipline-specific language materials for EAP/ESP settings.The corpus will be utilized as the reference tool (Yoon, 2011), where samples of authentic language will be extracted to be used in the development of online language modules.The rich collection of authentic language data will be mined to provide authentic phrases, expressions or short passages for the language activities designed.Samples of how the language is used in the forms of concordances will also be available for the learners to analyze.In order to complete the language activities learners would be required to consult the concordances extracted.This approach to learning language promotes inductive learning (Johns, 1991).Johns (1991) in advocating data-driven learning (DDL), pointed out that the use of corpora can foster inductive learning through learners' active participation in analyzing the language sample.More importantly, the learners will also be presented with authentic language and benefit from the abundance of samples of how the language is actually used in the written communications transpiring in the financial sector in Malaysia.The modules, when complete is hoped to prepare learners with the language skills they would require to function effectively in the financial, business and corporate settings.
Efforts are also underway for the designing and development of training modules for future and current financial professionals in the country.Presently, the research team is preparing to conduct needs analysis on the language needs and requirements of financial professionals serving the local as well as international financial institutions in the country.The findings from the analysis will then be utilized in designing and developing the said modules.Upon completion, the modules will be the first to offer corpus-based training materials that would cater to the needs and requirements of financial professionals in this country and beyond.(Shterev, 2013;Verma & Gaur, 2014) to produce its wordlist and the automated POS Tagger (Tautanova & Manning, 2000) to facilitate the team in POS tagging the datasets.The online MaCFE, which was built entirely using the Hypertext Preprocessor or PHP and MySQL, can be freely accessed at http://learningdistance.org/mycorpus/macfe/ via a web browser such as Internet Explorer, Chrome, Firefox, and Safari among others.Upon logged in, users are able to make queries to the MaCFE database and to generate concordance lines of searched items.

CONCLUSION
MaCFE is seen as a significant language resource not only for linguistic researchers and ESP/EAP practitioners, but also financial professionals in their pursuit to further enhance their professional communicative competence.Thus, it is imperative to inform professionals, researchers and EAP/ESP practitioners of MaCFE's existence and to encourage and promote the specialized corpus as an invaluable resource capable of further enhancing their professional communication, expanding their research horizon and enriching their teaching and learning avenue.In achieving these aims, the research team strives to publish as many works on MaCFE as possible in the local as well as international journals and conferences.At the same time we intend to reach a number of professional bodies, organizations and individual professionals by conducting a series of training workshops on how to use MaCFE as a language learning resource.Finally, it is hoped that the establishment of MaCFE will provide an impetus for the development of other specialized corpora, which consequently would benefit not only researchers and language practitioners, but also professionals and stakeholders in the respective sectors.

APPENDIX A
Step in generating wordlist using RapidMiner Studio Educational (7.5.001)Text Processing Package: Step 1: Create a process named "Process Documents from Files" FIGURE 8. Process Step 2: Assign the source of the folders and documents (text directories), and the value of vector creation on the parameters of the process.Steps in POS tagging wordlist: Step 1: Employ stanford-postagger-2016-10-31 to tag the word to its part-of-speech.Figure 16 below shows the POS tagged list of words generated from MaCFE.Step 2: Execute the following Java application program (as shown in Table 9) to produce the list of formatted POS-tagged words and the frequency as shown in Figure 17 and Figure 18.

FIGURE 3 .
FIGURE 3. Algorithm to remove non-letters and special characters 3. Part-of-Speech (POS) TaggingThe next stage of preprocessing is part-of-speech (POS) tagging.POS tagging is a basic form of syntactic analysis(Gimpel et al., 2011) and according toLeech (1997) is the most frequently used form of annotation.POS tagging involves assigning each lexical unit in the datasets a code to indicate its part of speech, for example NNP for singular proper noun, RB for adverb or JJ for adjective.Information regarding the parts of speech is primary in increasing the specificity of data retrieval and an important foundation for further forms of analysis such as syntactic parsing and semantic field annotation(McEnery & Hardie, 2011).Additionally, it could also contribute to various computational linguistic applications.Nevertheless, to manually POS tag each lexical unit in a large corpus is timeconsuming and a tedious process.Therefore, MaCFE was tagged using an automated POS tagger developed byToutanova and Manning (2000) at Stanford University.The tagger was further improved byToutanova, Klein and Manning (2003).The Tautanova and Manning's POS tagger can be retrieved and downloaded from https://nlp.stanford.edu/software/tagger.shtml.Table5illustrates the encoding of POS tagsets and the respective descriptions, which are based on the tagsets of the Penn Treebank(Marcus, Santorini & Marcinkiewicz, 1993).The complete Penn Treebank tagsets can be viewed inSantorini (1990).

FIGURE
FIGURE 6. Query facility

FIGURE 14 .
FIGURE 14.The jobs of process

A) DATA COLLECTION & SELECTION End Users Perform Query (D) RESULTS/ TEXT ANALYSIS/ LINGUISTICS ANALYSIS/ DOMAIN ANALYSIS
These additional text types are available in all banking institutions involved in this study and considered important as they are means for the banks to communicate with their clients (e.g.Publication), to disclose information to the general public, internal and external stakeholders (e.g.Media Coverage, Publication, CSR Reports) and to advertise their products (e.g.Advertisements).Table1summarizes the text types for MaCFE.

TABLE 2 .
Type and code for document naming convention

TABLE 3 .
Samples of MaCFE text collection

TABLE 4 .
Examples of special characters

TABLE 5 .
Part-of-speech tagsets used in coding MaCFE

TABLE 6 .
Wordlist containing 50 most frequent words produced after completing Step 1

TABLE 7 .
Wordlist after POS tagging process

TABLE 8 .
MaCFE meta-linguistic markup systemDescription macfeBeginIndicates the beginning of metalinguistics of text document.macfeTitleBeginIndicatesthe beginning of metalinguistics for document title.macfeDocTypeBegin Indicates the beginning of metalinguistics for type of document.
eISSN: 2550-2131 ISSN: 1675-8021 85 MaCFE was designed and developed with the intention of providing corpus linguistic researchers and ESP/EAP practitioners in Malaysia, with the avenue to expand research in the field and the resource for the development of local-based ESP/EAP curriculum and teaching and learning materials.Currently, MaCFE has gathered and compiled 1472 electronic financial documents retrieved and collected from banks' official websites.It now contains approximately 4.3 million words.Its final release covers four major categories of finance institutions; Local Islamic Bank, Foreign Islamic Bank, Local Conventional Bank and Foreign Conventional Bank.MaCFE has also employed a computer-based methodology, RapidMiner Studio Educational (7.5.001)Text Processing Package