Image-based Character Embedding for Arabic Document Classification

Abstract

This work introduces an image-based character embdeddings model for Arabic text classification. The problem with Arabic text classification using classical techniques is its dependency on complex morphological analysis and hand crafted feature engineering. Using character-level embeddings eliminates the need for complex morphological analysis and feature engineering. We propose a new Arabic document classification model using the CE-CLCNN, where text is represented as an array of character images, and the class-balanced loss. The CE-CLCNN consists of two parts, a character autoencoder (CE) and a character-level CNN (CLCNN). The CE learns to encode image based character embeddings, and the CLCNN is then used to classify the document using those embeddings. We created two datasets to test the effectiveness of our system. The first one is the Arabic Wikipedia title dataset (AWT), and the second one is the Arabic Poetry dataset (APD). The proposed model outperforms the classical SVM by 21.23% and 13.82% for the micro and macro Fscore respectively for the ADP dataset, and by 4.02% and 3.95% for the AWT dataset. To the best of our knowledge, this is the first time an image based character embedding model was used to address the problem of Arabic text classification. Also, the first time a text classification deep learning model is tested on datasets that contain the three types of Arabic.

Publication
言語処理学会第 26 回年次大会, 2020.

Related