Automatic Multi-lingual Script Recognition Application

Waleed Abdel Karim Abu-Ain, Siti Norul Huda Sheikh Abdullah, Khairuddin Omar, Siti Zaharah Abd. Rahman

Abstract


Document Image Analysis and Recognition (DIAR) technique is used to recognize text component and translate it into editable format. Scripts are a set of graphical representations used to express a particular writing system as well as subsets belonging to a particular writing system. The writing styles of more than one script family may then be adopted by one language, such as in the cases where the old Malay language (Jawi) adopts the Arabic script while the modern one adopts the Roman script. The seven major scripts used in this research are in handwritten style including Arabic, Devanagari, Hebrew, Thai, Greek, Cyrillic and Korean. Automatic Multi-lingual Script Recognition (AMSR) is one of the main challenges in DIAR domain. Currently, only few attempts have been made for automated script identification of off-line handwritten documents images. Most available AMSR applications only deal with printed documents and script types, and they neglect handwritten and multi-lingual documents. The objective of this study is to propose a multi-lingual AMSR framework. The research methodology consists of a proposed multilingual AMSR framework. The multilingual AMSR framework is tested on Multilingual-HW datasets, which contains more than seven international unconstraint handwritten scripts, using Grey-Level Co-occurrence Matrix and Local Binary Pattern. The average accuracy of both methods is about 97.01% and 85.29% respectively. This proposed multilingual AMSR is hoped to be beneficial to a group of community which requires automatic sorting multi-lingual documents. This research can also be extended to document forensic area or international relations agency to identify unknown native document.

 


Keywords


Automatic Multi-lingual Script Recognition (AMSR); feature extraction; statistical texture analysis; Grey-Level Co-occurrence Matrix (GLCM); Local Binary Pattern (LBP)

Full Text:

PDF

References


A. Abidi, I. Siddiqi and K. Khurshid, (2011). "Towards Searchable Digital Urdu Libraries - A Word Spotting Based Retrieval Approach," 2011 International Conference on Document Analysis and Recognition, Beijing, 1344-1348.

Ahmed, R., Al-Khatib, W. G. & Mahmoud, S. (2017). A survey on handwritten documents word spotting. International Journal of Multimedia Information Retrieval. Vol. 6(1), 31-47.

Bataineh, B., Abdullah, S. N. H. S. & Omar, K. (2011a). Generating an Arabic Calligraphy Text Blocks for Global Texture Analysis. International Journal on Advanced Science, Engineering and Information Technology. Vol. 1(2),50-155.

Bataineh, B., Abdullah, S. N. H. S. & Omar, K. (2011b). A statistical global feature extraction method for optical font recognition. Paper presented at the Asian Conference on Intelligent Information and Database Systems, 257-267.

Bataineh, B., Abdullah, S. N. H. S. & Omar, K. (2012). A novel statistical feature extraction method for textual images: Optical font recognition. Expert Systems with Applications. Vol. 39(5), 5470-5477.

Bataineh, B., Abdullah, S.N.H.S. & Omar, K. (2017) . Adaptive

binarization method for degraded document images based on surface contrast variation. Pattern Analysis Applications. Vol. 20(3), 639-652.

Bian, N. (2005). Evaluation of Texture Features for Analysis of Ovarian Follicular Development. Master thesis, University of Saskatchewan, Saskatoon.

Boufenar, C., Kerboua, A. & Batouche, M. (2018). Investigation on deep learning for off-line handwritten Arabic character recognition. Cognitive Systems Research. Vol. 50(180-195).

Busch, A., Boles, W. W. & Sridharan, S. (2005). Texture for script identification. IEEE Transactions on Pattern Analysis and Machine Intelligence. Vol. 27(11), 1720-1732.

Chen, H., Tsai, S. S., Schroth, G., Chen, D. M., Grzeszczuk, R. & Girod, B. (2011). Robust text detection in natural images with edge-enhanced maximally stable extremal regions. 2011 18th IEEE International Conference on Image Processing, Brussels, 2609-2612.

Ghosh, D., Dube, T. & Shivaprasad, A. (2010). Script recognition—a review. IEEE Transactions on Pattern Analysis and Machine Intelligence. Vol. 32(12), 2142-2161.

Gllavata, J. & Freisleben, B. (2005). Script recognition in images with complex backgrounds. Paper presented at the Signal Processing and Information Technology, 2005. Proceedings of the Fifth IEEE International Symposium on., 589-594.

Haralick, R. M. & Shanmugam, K. (1973). Textural features for image classification. IEEE Transactions on Systems, Man, and Cybernetics. Vol. 6, 610-621.

Hochberg, J., Bowers, K., Cannon, M. & Kelly, P. (1999). Script and language identification for handwritten document images. International Journal on Document Analysis and Recognition. Vol. 2(2), 45-52.

Hochberg, J., Kelly, P., Thomas, T. & Kerns, L. (1997). Automatic Script Identification From Document Images Using Cluster-Based Templates. IEEE Trans. Pattern Anal. Mach. Intell. Vol. 19(2), 176-181. doi: 10.1109/34.574802

Jain, A. K. & Zhong, Y. (1996). Page segmentation using texture analysis. Pattern Recognition. Vol. 29(5), 743-770.

Jiang, X. (2009). "Feature extraction for image recognition and computer vision". Paper presented at the 2009 2nd IEEE International Conference on Computer Science and Information Technology,1-15. 8-11 Aug.

Joshi, G. D., Garg, S. & Sivaswamy, J. (2006). Script identification from Indian documents. Paper presented at the Document Analysis Systems. 255-267.

Kamble, P. M. & Hegadi, R. S. (2015). Handwritten Marathi character recognition using R-HOG Feature. Procedia Computer Science. Vol. 45, 266-274.

Kasturi, R., O’gorman, L. & Govindaraju, V. (2002). Document image analysis: A primer. Sadhana. Vol. 27(1), 3-22.

Khaleefah, S. H. & Nasrudin, M. F. (2016). Identification of printing paper based on texture using gabor filters and local binary patterns. Journal of Theoretical and Applied Information Technology. Vol. 86(2), 279-289.

Li, J., Fan, Z.-G., Wu, Y. & Le, N. (2009). Document image retrieval with local feature sequences. Paper presented at the Document Analysis and Recognition, 2009. ICDAR'09. 10th International Conference on. 346-350.

Lutf, M., You, X., Cheung, Y.-m. & Chen, C. P. (2014). Arabic font recognition based on diacritics features. Pattern Recognition. Vol. 47(2), 672-684.

Marinai, S. (2008). Introduction to document analysis and recognition. Machine Learning in Document Analysis and Recognition. 1-20.

Obaidullah, S. M., Das, N., Halder, C. & Roy, K. (2015). Indic script identification from handwritten document images — An unconstrained block-level approach," 2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS), Kolkata, 2015, 213-218.

Ojala, T., Pietikäinen, M. & Harwood, D. (1996). A comparative study of texture measures with classification based on featured distributions. Pattern Recognition. Vol. 29(1), 51-59.

Ojala, T., Pietikainen, M. & Maenpaa, T. (2002). Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence. Vol.

(7), 971-987.

Pardeshi, R., Chaudhuri, B., Hangarge, M. & Santosh, K. (2014). Automatic handwritten Indian scripts identification. Paper presented at the Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on. 375-380.

Peake, G. & Tan, T. (1997). Script and language identification from document images. Paper presented at the Document Image Analysis, 1997.(DIA'97) Proceedings., Workshop on.,10-17.

Peete, B. P. & A. G. Ramakrishnan. (2008). Word Level Multi-Script Identification. Pattern Recognition Letters. Vol. 29, 1218-1229.

Quevedo, R., Valencia, E., Bastías, J. M. & Cárdenas, S. (2013). Description of the enzymatic browning in avocado slice using GLCM image texture. Paper presented at the Pacific-Rim Symposium on Image and Video Technology. 93-101.

Radwan, M. A., Khalil, M. I. & Abbas, H. M. (2017). Neural networks pipeline for offline machine printed Arabic OCR. Neural Processing Letters. 1-19.

Rao, G. S., Imanuddin, M. & Harikumar, B. (2014). Script Identification of Telugu, English and Hindi Document Image. Int. J. Adv. Eng. Global Technol. 2(2), 443-452.

Rathore, M. S. (2014). Statistical analysis of Synthetic Aperture Radar (SAR) image speckle. Retrieved from Biju Patnaik Central Library National Insitute of Technology Rourkela, Odisha-769008, 5946.

Saabni, R., Asi, A. & El-Sana, J. (2014). Text line extraction for historical document images. Pattern Recognition Letters. Vol. 35, 23-33.

Sarkar, R., Das, N., Basu, S., Kundu, M., Nasipuri, M. & Basu, D. K. (2010). Word level script identification from Bangla and Devanagri handwritten texts mixed with Roman script. arXiv preprint arXiv:1002.4007.

Singh, C., Bhatia, N. & Kaur, A. (2008). Hough transform based fast skew detection and accurate skew correction methods. Pattern Recognition. Vol. 41(12), 3528-3546.

Singh, R., Yadav, C., Verma, P. & Yadav, V. (2010). Optical character recognition (OCR) for printed devnagari script using artificial neural network. International Journal of Computer Science & Communication. Vol. 1(1), 91-95.

Sulaiman, A., Omar, K. & Nasrudin, M. F. (2017). A database for degraded Arabic historical manuscripts. 2017 6th International Conference on Electrical Engineering and Informatics (ICEEI). 1–6.

Ubul, K., Tursun, G., Aysa, A., Impedovo, D., Pirlo, G. & Yibulayin, T. (2017). Script Identification of Multi-Script Documents: A Survey. IEEE Access. Vol. 5, 6546–6559.

Tan, T. (1998). Rotation invariant texture features and their use in automatic script identification. IEEE Transactions on Pattern Analysis and Machine Intelligence. Vol. 20(7), 751-756.

Tan, X. & Triggs, B. (2010). Enhanced local texture feature sets for face recognition under difficult lighting conditions. IEEE Transactions on Image Processing. Vol. 19(6), 1635-1650.

Tensmeyer, C. & Martinez, T. (2017). Document Image Binarization with Fully Convolutional Neural Networks. arXiv preprint arXiv:1708.03276.

Tho, Y. & Tang, Y. Y. (2001). Discrimination of oriental and Euramerican scripts using fractal feature. Paper presented at the Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on. 1115-1119.

Vinod, H. & Niranjan, S. (2018). Multi-level Skew Correction Approach for Hand Written Kannada Documents. Paper presented at the International Conference on Information Theoretic Security. 376-386.

Zavvar, M., Garavand, S., Nehi, M. R., Yanpi, A., Rezaei, M. & Zavvar, M. H. (2016). Measuring Reliability of Aspect-Oriented Software Using a Combination of Artificial Neural Network and Imperialist. Asia-Pacific Journal of Information Technology and Multimedia. Vol. 5(2), 75-84.




DOI: http://dx.doi.org/10.17576/gema-2018-1803-12

Refbacks

  • There are currently no refbacks.


 

 

 

eISSN : 2550-2131

ISSN : 1675-8021