Arabic Nested Noun Compound Extraction Based on Linguistic Features and Statistical Measures

The extraction of Arabic nested noun compound is significant for several research areas such as sentiment analysis, text summarization, word categorization, grammar checker, and machine translation. Much research has studied the extraction of Arabic noun compound using linguistic approaches, statistical methods, or a hybrid of both. A wide range of the existing approaches concentrate on the extraction of the bi-gram or tri-gram noun compound. Nonetheless, extracting a 4-gram or 5-gram nested noun compound is a challenging task due to the morphological, orthographic, syntactic and semantic variations. Many features have an important effect on the efficiency of extracting a noun compound such as unit-hood, contextual information, and term-hood. Hence, there is a need to improve the effectiveness of the Arabic nested noun compound extraction. Thus, this paper proposes a hybrid linguistic approach and a statistical method with a view to enhance the extraction of the Arabic nested noun compound. A number of pre-processing phases are presented, including transformation, tokenization, and normalisation. The linguistic approaches that have been used in this study consist of a part-of-speech tagging and the named entities pattern, whereas the proposed statistical methods that have been used in this study consist of the NC-value, NTC-value, NLC-value, and the combination of these association measures. The proposed methods have demonstrated that the combined association measures have outperformed the NLC-value, NTC-value, and NC-value in terms of nested noun compound extraction by achieving 90%, 88%, 87%, and 81% for bigram, trigram, 4-gram, and 5-gram, respectively.


INTRODUCTION
Noun compound (NC) is a phrase that is made up of a combination of two or more nouns which are sometimes joined together: for example, the words 'tooth' and 'paste' are both nouns and if they are connected to each other they produce a new word "toothpaste".Occasionally, compound nouns appear as two separate words such as "Christmas tree"; sometimes they are joined using a hyphen such as "father-in-law" (Albared et al., 2016).Currently, Arabic text has rapidly increased over the Internet whether in social media, news agencies or advertisements.Hence, extracting these noun compounds to meaningful information is an essential demand.Noun compounds (NCs) frequently appear in Arabic text, which make the extraction of these nouns an important role in the field of Information Extraction.Extraction of Arabic NCs is one of the challenging tasks in natural language processing(NLP) where in Arabic the words do not have capital or small letters.Moreover, Arabic NCs may contain semantic ambiguity, for example, ‫اليمن"‬ ‫"باب‬ which means "the door of Yemen" and also a famous place in Yemen.This may lead to a misunderstanding when attempting to identify this noun compound.On the other side, most Arabic noun compounds rely on the occurrences of two or more words together such as ‫عاجل"‬ ‫"خبر‬ which means "breaking news".Such two words frequently occur together rather than with synonyms of them such as ‫طارئ"‬ ‫"خبر‬ which means "emergency news".Finally, there is a lack of available resources of Arabic noun compound lexicon.To overcome this limitation we need to extract those compound nouns to process it further.
The identification and extraction of noun compounds have been widely researched by a number of researchers.For example, the research by (Buckeridge & Sutcliffe, 2002) proposes that the modifier and the head should be nouns.One of the most popular languages is the Arabic language which also contains several kinds of NCs.Nested noun compound (NNC) is one of these kinds, which in turn consists of multiple NCs (Salehi, 2016).It may consist of two to five words.Moreover, the NNC is a word that is used very frequently where new NNC is created to describe the exact meaning of the language terms.It is difficult to identify those nested noun compounds manually, due to the high cost and time.The problem with the NNC is that it relies on its frequent occurrences within the text.Extracting these noun compounds are important for numerous domains of research such as Information Retrieval, Sentiment Analysis, and Question Answering, and seems to be a challenging issue (Korayem, Crandall & Abdul-Mageed, 2012).
There are some differences between Arabic and English noun phrases.While English has both definite and indefinite articles and both occur before the noun, in Arabic there is only the definite article al 'the' but no overt indefinite article.English and Arabic differ with regard to the position of ordinals in the noun phrase.Ordinals can only precede the headword of a noun phrase but in Arabic they can also follow the headword in the structure of a noun phrase.In English adjectives are not inflected for number and gender but in Arabic they are.In Arabic, the number and the gender of the possessor has an impact on the form of the headword (kitab-u-hu ‫كتابه‬ 'his book', kitab-u-ha ‫كتابها‬ 'her book', kitab-u-hum ‫كتابهم‬ 'their book') but in English the number and the gender has no impact on the form of the headword at all, e.g., his book, her book or their book.Therefore, this makes Arabic noun compound extraction more challenging compared to English.
Identifying Arabic noun compounds is an important issue in NLP.It is a necessity to automatically extract them before they are translated to other languages or used for various tasks or applications.Much research has been conducted on this issue and proposes many methods for identifying NCs (Hazaa, Omar, Ba-Alwi & Albared, 2016).Some research have used linguistic pattern methods, statistical methods, or a combination of both with a view to the extraction of bi-gram and tri-gram NCs.Nevertheless, the Arabic language comprises a multiple of noun compounds named the nested noun compound, which makes the process of extraction more difficult.The process occurs because of the need to extract more than two noun compounds such as ‫الوزراء"‬ ‫,"رئيس‬ which means Prime Minister to five compounds such as ‫دغر"‬ ‫بن‬ ‫أحمد‬ ‫الوزراء‬ ‫,"رئيس‬ which means Prime Minister Ahmed bin Daghr.A combination method of linguistic and statistical approaches has been proposed by (Al-Mashhadani & Omar, 2015) for extracting Arabic NNCs using one to five-gram candidates.However, this method has some limitations in terms of accuracy.The limitations can be stated as follow: First, there is a lack of extraction in terms of relying on just POS tagging alone where in such, the patterns are not accurate.Second, the statistical method where the association measures NC-value, NTC-value, and NLC are proposed by (Al-Mashhadani & Omar, 2015) lacks in terms of accuracy.Arabic language has numerous types of NCs which have been associated with complexities regarding to the morphological differences that lie in Arabic.Based on Bounhas and Slimani (2009) Several approaches are introduced with a view to identify Arabic multi-word expressions.A wide range of them use the linguistic approach, statistical approach, or a combination of both.For example, Attia et al. 2010 have introduced three integral approaches with a view to automatically identify and evaluate multi-words for the Arabic dataset.In this study, a cross-lingual consistency asymmetry has been applied, aiming to extract multi-words from the Arabic Wikipedia (Ar.Wikipedia), with a view to generate multi-word candidates based on a multi-lingual lexicon for the named entities.Following this, the English multiwords that have been extracted from the Princeton WordNet have been translated to Arabic in order to validate the candidates.Lastly, a point-wise mutual information and POS tagging hybrid method has been used in order to generate multi-word candidates in unigram, bigram, and tri-gram.Saif and Aziz ( 2011) propose a combination of linguistic and statistical methods in order to identify Arabic collocations from a newspaper dataset.Lemmatisation and POS tagging are used as a linguistic approach in order to generate and filter unigram and bigram candidates.Following this, the authors use statistical methods consisting of the following association measurers: PMI, chi-square, LLR and improved mutual information.The ranking process of the candidates used is based on co-occurrence.Lastly, the authors conclude that LLR is outperforming the other association measures.Mahdaouy, Ouatik and Gaussier (2014) present a hybrid method of linguistic and statistical approaches in order to identify Arabic multi-words.First, the authors use a POS tagger as a linguistic pattern borrowed from Diab, Hacioglu and Jurafsky (2004) with a view to assign tags for every word which is essential for filtering candidates.Second, they use three statistical measures, namely NC-value, NTC-value, and NLC.Lastly, the authors illustrate that the NLC-value has outperformed the other measures.Al-Balushi et al. (2014) present a combination of linguistic and statistical approaches with a view to detect the Arabic nested noun compound.First, lemmatisation and POS tagging are used as linguistic patterns to enable the process of filtering candidates.Second, in order to rank the candidates, the authors use three association measures: LLR, PMI, and NCvalue.Lastly, the authors demonstrate that NC-value has outperformed the other association measures.Al-Mashhadani and Omar (2015) suggest a combination of linguistic and statistical approaches with a view to extract the Arabic nested noun compound.First, normalisation, stemming, and tokenisation are performed as pre-processing tasks which work to remove unwanted data already used.Second, the authors apply candidate extraction which contains POS tagging.Third, the authors use three statistical methods introduced by (Mahdaouy et al., 2014) which are NC-value, NTC-value, and NLC.Lastly, the authors report that the NLCvalue has outperformed NTC-value and NC-value with regards to nested noun compound extraction by achieving 83%, 76%, 72% and 65% for bigram, trigram, 4-gram and 5-gram, respectively.[Table 1] shows a summary of the related work.In short, according to Bounhas and Slimani (2009), identifying Arabic MWEs is a difficult task due to the complication that relies on semantic or syntactic of its morphological variations.Some example of the MWE variations includes graphical variants (the graphic alternations between the letters " ‫"ها‬ "ha'a" and " ‫تاء‬ ‫مربوطه‬ " "Ta'a marbutah"), inflectional variants (the number inflection of nouns, the number and gender inflections of adjectives and the definite article " ‫ل‬ ‫ا‬ " (al)), morphosyntactic variants (the synonymy relationship between two MWEs of different structures) and syntactic variants (the modifications of the internal structure of the base-term, without affecting the grammatical categories of the main item which remain identical).Moreover, the current methods have some limitations in terms of the extraction of nested noun compound.A wide range of the current approaches has been proposed to extract bi-gram and tri-gram candidates.However, there are two approaches that have been proposed in order to extract nested noun compound.The first approach is presented by Al-Balushi (2014).This method has some limitations which is described as follows; (i) the linguistic approach used is limited to simple linguistic patterns containing Noun + Noun, Noun + Adjective and their extensions, (ii) the association measures that have been used are (LLR, PMI and NC-value) have a limitation regarding to their own features.The second approach has been proposed by Al-Mashhaddani et al. (2015).This method has some limitations in terms of its accuracy.The limitations can be stated as follows: (i) there is a lack of extraction in terms of POS tagging in which such patterns are not accurate.(ii) the statistical method where the following association measures: NC-value, NTC-value and NLC have been proposed by Al-Mashhadani et al. (2015) lacks in term of accuracy.Thus, this study aims to propose a combination method of POS tagging, named entity and the combination of the association measures that improve these limitations.

METHODOLOGY
Several phases are involved in the presented approach as shown in [Figure 1].These phases include the following: (i) the dataset that is used in this study, (ii) the transformation phase that proposes to fit the data into an internal representation, (iii) the pre-processing phase which consists of two tasks: tokenisation, which intends to divide the words of the dataset into groups of consecutive morphemes; and normalisation, which works to remove the unwanted data, (iv) extraction of the candidate which also consists of two tasks: POS tagging, which aims to define the word categories such as verb, noun, or adjective; and named entity, which aims to improve the process of identifying Arabic nested noun compounds.Furthermore, the suggested method includes the procedure of detecting the noun compounds candidates using the n-gram model to produce bigram, tri-gram, 4-gram and 5gram, (v) the association measures containing NC-value, NTC-value, and NLC-value, (vi) the combination mechanism of the three association measures, and lastly (vii) the evaluation of the presented method.

CORPUS
The dataset that is used in this study is presented by (Saif & Aziz, 2011), which is a collection of text files of two online Arabic newspapers, namely Al-jazeera.net and Almotamar.net.[Table 2] provides the numerical details about the Arabic corpus used.The distribution of the noun compounds across the corpus is also provided.The purpose of this phase is to convert the data into an internal illustration to obtain an accurate compiling that enables the application of the pre-processing steps.UTF-8 encoding has to be used in order to provide an illustration for the Arabic letters (Selamat & Ng, 2011) since the letters cannot be demonstrated by the ANSI code.

PRE-PROCESSING
The pre-processing phase performs numerous steps which aim to turn the data into a suitable format that allows the application of statistical measures by cleansing the dataset of unnecessary data.Therefore, two tasks of pre-processing phase are carried out, namely tokenisation and normalisation, and they are defined as follows:

TOKENIZATION
The task of tokenisation is to split words from text into sets of consecutive morphemes (Aliwy, 2012).For example, after applying the tokenisation process on "United Arab Emirates" it would become "United_ Arab_ Emirates".Similarly, after applying the process of tokenisation on the Arabic phrase '

‫الثراث‬ ‫العربي‬ ‫االسالمي‬
' which means Arab and Islamic heritage, it would become '

NORMALIZATION
The aim of normalisation is to clean the data by excluding unwanted data such as digits or numbers, special characters, punctuation, and stop-words.

EXTRACTION OF CANDIDATES
Candidate extraction consists of two methods, namely Stanford POS tagger and a list of named entity.The goal of these two approaches is to produce a list of the n-gram of noun compounds in order to be clarified, depending on the linguistic patterns.These two methods are described as follows:

POS TAGGING
According to (Navigli, 2009), POS tagging is a method of word-sense disambiguation whose purpose is to assign tags such as adjective, noun, verb, or adverb for all the words in a text.There are a considerable amount of words that have numerous potential tags, and hence POS has been introduced with a view to disambiguate these words.Thus, the key characteristic of POS tagging lies in its ability to provide each word in the dataset with the exact tag.In this study, the Stanford POS tagging has been used.[Table 3] shows an example of Arabic POS tagging.Initially, a list of unigrams has been produced based on the linguistic pattern such as (Preposition + Adjective), which has been introduced by (Boujelben, Mesfar & Hamadou, 2010).Each word from this list will be allocated with another word from the dataset that appears to be potentially consistent, to create a noun compound.These potentials will be kept with their linguistic tags and frequent occurrence in a list named the 5-gram list.Consecutively, relying on the POS tagger, the 5-gram list will be the ancestry leading into numerous 4-gram noun compound potentials.These potentials will be stored with their POS tags and frequent occurrence.Likewise, this 4-gram list will be the ancestry into numerous tri-gram noun compound potentials and stored with their tags and frequent occurrence in a list named the tri-gram list.Lastly, the POS tagging will descend this tri-gram list into numerous bi-gram noun compound potentials stored with their POS tags and frequent occurrence in a bi-gram list.Its purpose is to produce a list of n-gram holding bi-gram, tri-gram, 4-gram, and 5-gram lists of noun compounds.Subsequently, the lists have to be filtered based on the structural patterns.Initially, it fetches the words from the unigram list which was learnt during the pre-processing task.Every word will be allocated with a word from the dataset that it appears to have possible integration with.Using the POS tagger, these combinations will be stored with their linguistic classification and recurrent occurrences in a list named 5-gram.From the 5-gram list, POS tagging will choose a 4-gram combination that appears to be a candidate according to the linguistic structural patterns and store it in a list named the 4-gram list with its linguistic classification and recurrent occurrences.From the 4-gram list, POS tagging will previously choose a 3-gram combination that appears to be a candidate, relying on the structural patterns and store it in a list named the tri-gram list with the linguistic classification and recurrent occurrences.Likewise, the bi-gram list will be constructed from the 3-gram list.

NAMED ENTITY
The named entity pattern is proposed in order to improve the process of extracting Arabic nested noun compounds, which have been used by (Al-Mashhadani & Omar, 2015), since a great percentage of NCs are named entities -for example "Security Council".Therefore, to simplify the procedure of extracting noun compounds, a domain-specific named entities has been constructed which consists of various kinds of names (e.g.persons, locations and organisations).Thus, this method has the capability to extract accurate noun compounds by checking the availability of these compound nouns from the proposed list.Essentially, the list method will shorten the linguistic patterns that have been used by the Stanford POS tagging.For example, given the NC "Qaboos Said Sultan of Sultanate Oman", this compound noun has a linguistic pattern of (N+ N +ADJ + PRE + N + N).Meanwhile, "Qaboos Said" is located in the list and it is a person's name.Therefore, it will be swapped with one tag which is Named Entity (NE).Likewise, since "Sultanate Oman" is assigned in the list and it is a location, it will therefore be swapped with one tag which is NE.Thus, this method could sufficiently improve the process of extracting nested noun compounds because of its capability to contain numerous noun compounds.[Table 4] shows examples of patterns generated using Named Entity.Candidate ranking phase aims to calculate the statistical measures for the candidates that have been extracted in the lists of n-gram which allocates each candidate a score of association strength (Ittoo & Bouma, 2013).The association measure that has been used, includes NCvalue, NTC-value, NLC-value, and the combination of these three measures, where both term-hood and unit hood measures are considered.The definitions of these association measures are as follows:

NC-VALUE
NC-value has been proposed by (Frantzi, Ananiadou & Mima, 2000), whereas C-value is a statistical method that measures the term-hood of a candidate based on the following features: number of occurrences, term nesting, and term length.It is measured as: Where  indicates the length of the candidate termfor example, in the case of a non-nested noun compound such as ' ‫االمعاء‬ ‫الغليظة‬ ', which means "large intestine".where a=2, it indicates that this phrase is a noun compound.But in case  is a nested noun compound, then we see an example such as ‫االرياني"‬ ‫الكريم‬ ‫عبد‬ ‫اليمني‬ ‫الوزراء‬ ‫"رئيس‬ which means "Yemeni Prime Minister Abdulkarim Alaryani", where  =6 while ||=2 indicates " ‫رئيس‬ ‫,"الوزراء‬ which means "Prime Minister".This part of the equation for noun compound takes other nouns to form a longer sentence.|a| indicates the length of candidate term  in words, while () is the number of occurrences of .For instance, if () = 12, it means the phrase ‫االرياني"‬ ‫الكريم‬ ‫عبد‬ ‫اليمني‬ ‫الوزراء‬ ‫"رئيس‬ has occurred 12 times in the dataset and: where   indicates the set of longer terms in which a appears (|  | is the cardinality of this set).
For instance, the sentence ‫االرياني"‬ ‫الكريم‬ ‫عبد‬ ‫اليمني‬ ‫الوزراء‬ ‫"رئيس‬ has appeared 9 times and the sentence ‫اليمني'‬ ‫الوزراء‬ ‫'رئيس‬ has appeared 17 times.Hence,  = ‫اليمني"‬ ‫الوزراء‬ ‫,"رئيس‬ and   =9.Furthermore, the NC-value combines the C-value together with contextual information which is calculated based on the N-value that indicates a measure of the terminological status of the context of a given candidate term.It is measured as: Where  indicates the set of distinct context words of ,   () indicates the number of times b occurs in the context of a and n is the total number of terms considered.This measure is then simply combined with the C-value to provide the overall NC-value measure: Where   () is the frequency of b as a MWE context word of w,   is the collection of featured context words of , weight () is the weight of () as an MWLU context word.Besides that, α is the weight assigned to the two factors of NC-value, and C-value.

NTC-VALUE
This method has been presented by (Vu, Aw & Zhang, 2008), which aims to combine the unit-hood feature based on the T-score with NC-value in order to improve the performance.It is measured as follows: Where (  ,   ) refers to the probability of the bigram (  ,   ) in the corpus, while (  ) is the probability of word   .For instance, if the two words ‫رداع"‬ ‫"مدينة‬ which means "Rada'a City" were applied using tscore, it would be:

𝑁
Where ‫,مدينة(‬ ‫)رداع‬ is the probability of the two words, ‫)مدينة(‬ is the probability of the word ‫,"مدينة"‬ ‫)رداع(‬ is the probability of the word ‫,"رداع"‬ and N is the total number of words in the dataset.After that, the T-score is combined in the NC measures through a reweighting of the number of existences that privileges terms with a positive T-score: Where min(()) indicates the minimum T-score gained from all the word pairs in a. Replacing () to () in Eq. ( 1) produces the TC-value, which is then combined with the N-value as before, leading to the NTC-value:

NLC-VALUE
NLC-value is a combination of NC-value which is proposed by (Frantzi et al., 2000), with LLR which is introduced by Dunning (Dunning, 1993).This could offer more exact unit-hood in terms of the capability of LLR to distinguish the actual co-occurrence.It is measured by: which leads to the NLC-value that integrates contextual information and both termhood and unit-hood.The combined NLC can be illustrated as:

COMBINATION
This method is a combination of the three association measures introduced above, where the NC-value, NTC-value and NLC-value are integrated together, which could lead to more accurate unit-hood in terms of the accuracy of the association measures used to provide the actual co-occurrence.The combination is introduced in two steps: first, the combination is made with 80% of the result of the NLC-value with 20% of the result of the NTC-value.It is defined as: In the second step, the combination involves 80% of the result of the () value which indicates the combination of (the NLC-value with the NTC-value) as shown in Eq. ( 10) with 20% of the result of the NC-value.This leads to the combination-value that integrates contextual information and both term-hood and unit-hood: − () = 0.8 .() + 0.2 .()

EVALUATION
The method of evaluation considered in this study is the n-best method which has been proposed by (Evert, 2005).Basically, three stages of this evaluation method have been used: The first is the n-best selection which gathers the highest value of association for the candidate ranking.The second stage is the annotation where the accurate noun compound is manually annotated with one and the incorrect noun compound is annotated with zero.Lastly, the precision calculation for the annotated noun compounds has been used according to the following equation: Where TP is the number of correct noun compounds and TEC is the total number of extracted noun compounds.

RESULTS AND DISCUSSION
In this section, the results of the association measures, namely NC-value, NTC-value, NLCvalue, and the combination for all of them, are identified.As shown in [Table 5] , it has been noticed that Bi-gram candidates have the greatest value of precision where the increasing of n-gram causes a decreasing of precision.This indicates the difficulty in extracting the accurate candidate when the n-gram is higher.On the other hand, the results when N = 100 are greater than the other values of N for all n-gram forms.This is because the possibilities of extracting incorrect candidates will increase where the process of identifying more than bigram candidates may result in the detection of invalid noun compounds.Obviously, Combination-value has outperformed NC-value, NTC-value and NLC-value due to the combination between all three association measures.However, the greatest value of precision has been achieved when N = 100 with Bi-gram by obtaining 97%, while the lowest value of precision has been obtained when N = 500 with 5-gram by achieving 81%.To summarise, the Combination-value has outperformed NLC-value, NTC-value, and NC-value in terms of the extraction of bi-gram and tri-gram.However, this study has demonstrated a similar performance for Combination-value compared with NLC-value, NTCvalue, and NC-value in terms of the extraction of ANNCs involving Bi-gram, Tri-gram, 4gram, and 5-gram.This is because Combination-value is a combination of NLC-value, NTCvalue, and NC-value, which in other words means a combination of multiple features which are contextual information, unit-hood, and term-hood.Contextual information measures the terminological rank of a given candidate term.The unit-hood feature offers the degree of strength for combinations or collocations (Fahmi, 2005).Lastly, the term-hood treats the terms as a linguistic unit (Vu et al., 2008).These features have the ability to improve the extraction procedure of the nested noun compound in the Arabic language.Furthermore, using named entity as a linguistic pattern has the capability to improve the procedure of nested noun compounds extraction for Combination-value NLC-value, NTC-value, and NCvalue, and in facilitating the task of recognising named entities which usually occur as noun compounds.
[Table 6] shows a sample result of bi-gram compound noun extraction based on the proposed method.The extracted candidate is the extracted compound noun.The combination value indicates the strong correlation between the two nouns extracted with the given pattern types.Similar sample results for tri-gram, 4-gram and 5-gram are shown in Table 7, 8 and 9 respectively.With a view to clarify the improvement, a comparison with related work or baseline is performed.The related work to this research is the study of (Al-Mashhadani & Omar, 2015).The results shows that the proposed method clearly outperformed the work of (Al-Mashhadani & Omar, 2015).Here, a combination of Stanford POS tagging, named entity, and a combined association measures has been proposed in order to identify the nested noun compound.Table [10] shows the experiment results for the introduced method of this study and the related work.

CONCLUSION
This study proposed a combination of linguistic and statistical methods for the extraction of Arabic nested noun compounds.The linguistic approach consists of POS tagging, which permits the process of selecting candidates depending on the word categories and linguistic patterns, and the named entity pattern which uses a list of Arabic named entities.The presented statistical approach consists of the following association measures: the combination-value, NLC-value, NTC-value, and NC-value.The experimental results have been evaluated using the n-best method and have been compared with the related work.Essentially, Combination-value has outperformed the other three association measures in terms of identifying Arabic nested noun compounds.This research demonstrates that extraction of the nested noun compounds in Arabic especially the 4-gram such as, ' ‫الممثل‬ ‫ناصر‬ ‫الكوميدي‬ ‫القصبي‬ ' which means 'Comedian actor Naser Al-Qasabi', and 5-gram such as ‫سكوت'‬ ‫جورج‬ ‫التسجيلي‬ ‫الفيلم‬ ‫'مخرج‬ which means 'Documentary film director George Scott' can be improved using the proposed combination methods.The automatic extraction of the compound nouns may assist many language processing tasks such machine translation, named entity recognition and question answering.Moreover, extracting such noun compounds helps the field of Arabic language studies in term of discovering the different types of nested noun compound that exists in numerous linguistic patterns.

Compound nouns linked by composite relations (
, Arabic NCs have five classes which are described as follows:

TABLE 1 .
Summary of related work

TABLE 2 .
Details of the corpus used

TABLE 3 .
Example of Arabic POS tagger

TABLE 4 .
Example of named entity patterns

TABLE 5 .
Results of association measures

TABLE 8 .
Sample results of