Sample text 123

Improved Arabic Natural Language Processing through Semisupervised and Cross-Lingual Learning

cmu-admin
April 22, 2025

Natural language processing (NLP) promises to change the landscape of computer-human and human-human communication. We envision Arabic language processing systems for applications like automatic summarization, question answering, and, most importantly, automatic translation. Such NLP applications rely on the development of robust, domain-general Arabic NLP algorithms – information extractors, parsers, entity resolvers, and the like. Somewhat surprisingly, we and others have found that English NLP tools, along with techniques from machine translation and examples of human translations, can be used to develop better NLP tools for languages like Arabic. This proposal aims to use cross-lingual and semisupervised learning to improve Arabic NLP, exploiting the ever-growing Arabic Wikipedia alongside existing annotated and translated Arabic corpora. Because Arabic NLP has a high degree of orthographic and morphological ambiguity, our focus will be on core NLP tasks with integrated morphological disambiguation. This project builds on the PIs’ prior work in statistical NLP, including morphological disambiguation for Arabic and other morphologically complex languages, and learning from unannotated data. This project will provide a unique opportunity for researchers in Qatar and the United States to collaborate on leveraging recent statistical modeling and machine learning approaches to development of wide-coverage robust methods required for Arabic language technologies.

Title for This Block