Skip to main content

Nizar Habash

Followers

0

Following

0

Public Views

Interests

Uploads

Papers by Nizar Habash

Arabic preprocessing for Statistical Machine Translation

John Benjamins Publishing Company eBooks, 2012

Arabic is a morphologically rich language. This poses some problems for statistical machine trans... more Arabic is a morphologically rich language. This poses some problems for statistical machine translation (SMT) approaches. In this chapter, we study the effect of different Arabic word-level preprocessing schemes and techniques on the quality of phrase-based SMT. We also present and evaluate different methods for combining preprocessing schemes. Our results show that given large training data sets, splitting off proclitics only performs best. However, for small training data sets, it is best to apply English-like tokenization using part-of-speech tags, and sophisticated morphological analysis and disambiguation. Moreover, choosing the appropriate preprocessing scheme produces a significant increase in BLEU score if there is a change in genre between training and test data. We also found that combining different preprocessing schemes leads to improved translation quality.

Orthographic and Morphological Processing for English-Arabic Statistical Machine Translation

De nombreux travaux en Traduction Automatique Statistique (TAS) pour des langues d'entrée morphol... more De nombreux travaux en Traduction Automatique Statistique (TAS) pour des langues d'entrée morphologiquement riches montrent que la ségmentation morphologique et la normalisation orthographique améliorent la qualité des traductions en diminuant la sparsité des données. Dans cet article, nous étudions l'impact de ce prétraitement pour la TAS vers une langue de sortie riche morphologiquement, comme l'Arabe. Nous explorons l'espace des schémas de segmentation et des options de normalisation possibles. Nous évaluons seulement la sortie sous une forme désegmentée et enrichie orthographiquement. Nos résultats montrent d'une part que le meilleur schéma pour la ségmentation est celui de la Penn Arabic Treebank. D'autre part, la meilleure procédure de prétraitement consiste à entraîner le système sur des données normalisées orthographiquement, puis à enrichir et désegmenter les traductions en sortie.

Arabic Natural Language Processing

The Arabic language continues to be the focus of an increasing number of projects in natural lang... more The Arabic language continues to be the focus of an increasing number of projects in natural language processing (NLP) and computational linguistics (CL). This tutorial provides NLP/CL system developers and researchers (computer scientists and linguists alike) with the necessary background information for working with Arabic in its various forms: Classical, Modern Standard and Dialectal. We discuss various Arabic linguistic phenomena and review the state-of-the-art in Arabic processing from enabling technologies and resources, to common tasks and applications. The tutorial will explain important concepts, common wisdom, and common pitfalls in Arabic processing. Given the wide range of possible issues, we invite tutorial attendees to bring up interesting challenges and problems they are working on to discuss during the tutorial.

Combination of Arabic preprocessing schemes for statistical machine translation

Statistical machine translation is quite robust when it comes to the choice of input representati... more Statistical machine translation is quite robust when it comes to the choice of input representation. It only requires consistency between training and testing. As a result, there is a wide range of possible preprocessing choices for data used in statistical machine translation. This is even more so for morphologically rich languages such as Arabic. In this paper, we study the effect of different word-level preprocessing schemes for Arabic on the quality of phrase-based statistical machine translation. We also present and evaluate different methods for combining preprocessing schemes resulting in improved translation quality.

Improved Arabic-to-English statistical machine translation by reordering post-verbal subjects for word alignment

Machine Translation, Oct 30, 2011

We study challenges raised by the order of Arabic verbs and their subjects in statistical machine... more We study challenges raised by the order of Arabic verbs and their subjects in statistical machine translation (SMT). We show that the boundaries of post-verbal subjects (VS) are hard to detect accurately, even with a state-of-the-art Arabic dependency parser. In addition, VS constructions have highly ambiguous reordering patterns when translated to English, and these patterns are very different for matrix (main clause) VS and non-matrix (subordinate clause) VS. Based on this analysis, we propose a novel method for leveraging VS information in SMT: we reorder VS constructions into pre-verbal (SV) order for word alignment. Unlike previous approaches to sourceside reordering, phrase extraction and decoding are performed using the original Arabic word order. This strategy significantly improves BLEU and TER scores, even on a strong large-scale baseline. Limiting reordering to matrix VS yields further improvements.

Hebrew Morphological Preprocessing for Statistical Machine Translation

This paper presents a range of preprocessing solutions for Hebrew-English statistical machine tra... more This paper presents a range of preprocessing solutions for Hebrew-English statistical machine translation. Our best system, using a morphological analyzer, increases 3.5 BLEU points over a no-tokenization baseline on a blind test set. The next best system uses Morfessor, an unsupervised morphological segmenter, and obtains almost 3.0 BLEU points over the baseline.

Syntactic reordering for English-Arabic phrase-based machine translation

We investigate syntactic reordering within an English to Arabic translation task. We extend a pre... more We investigate syntactic reordering within an English to Arabic translation task. We extend a pre-translation syntactic reordering approach developed on a close language pair (English-Danish) to the distant language pair, English-Arabic. We achieve significant improvements in translation quality over related approaches, measured by manual as well as automatic evaluations. These results prove the viability of this approach for distant languages.

Arabic preprocessing schemes for statistical machine translation

In this paper, we study the effect of different word-level preprocessing decisions for Arabic on ... more In this paper, we study the effect of different word-level preprocessing decisions for Arabic on SMT quality. Our results show that given large amounts of training data, splitting off only proclitics performs best. However, for small amounts of training data, it is best to apply English-like tokenization using part-of-speech tags, and sophisticated morphological analysis and disambiguation. Moreover, choosing the appropriate preprocessing produces a significant increase in BLEU score if there is a change in genre between training and test data.

Dialectal Arabic to English Machine Translation: Pivoting through Modern Standard Arabic

North American Chapter of the Association for Computational Linguistics, Jun 1, 2013

Modern Standard Arabic (MSA) has a wealth of natural language processing (NLP) tools and resource... more Modern Standard Arabic (MSA) has a wealth of natural language processing (NLP) tools and resources. In comparison, resources for dialectal Arabic (DA), the unstandardized spoken varieties of Arabic, are still lacking. We present ELISSA, a machine translation (MT) system for DA to MSA. ELISSA employs a rule-based approach that relies on morphological analysis, transfer rules and dictionaries in addition to language models to produce MSA paraphrases of DA sentences. ELISSA can be employed as a general preprocessor for DA when using MSA NLP tools. A manual error analysis of ELISSA's output shows that it produces correct MSA translations over 93% of the time. Using ELISSA to produce MSA versions of DA sentences as part of an MSA-pivoting DA-to-English MT solution, improves BLEU scores on multiple blind test sets between 0.6% and 1.4%.

Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop

We present an approach to using a morphological analyzer for tokenizing and morphologically taggi... more We present an approach to using a morphological analyzer for tokenizing and morphologically tagging (including partof-speech tagging) Arabic words in one process. We learn classifiers for individual morphological features, as well as ways of using these classifiers to choose among entries from the output of the analyzer. We obtain accuracy rates on all tasks in the high nineties.

The Impact of Preprocessing on Arabic-English Statistical and Neural Machine Translation

Neural networks have become the state-of-the-art approach for machine translation (MT) in many la... more Neural networks have become the state-of-the-art approach for machine translation (MT) in many languages. While linguistically-motivated tokenization techniques were shown to have significant effects on the performance of statistical MT, it remains unclear if those techniques are well suited for neural MT. In this paper, we systematically compare neural and statistical MT models for Arabic-English translation on data preprecossed by various prominent tokenization schemes. Furthermore, we consider a range of data and vocabulary sizes and compare their effect on both approaches. Our empirical results show that the best choice of tokenization scheme is largely based on the type of model and the size of data. We also show that we can gain significant improvements using a system selection that combines the output from neural and statistical MT.

MADARi: A Web Interface for Joint Arabic Morphological Annotation and Spelling Correction

ArXiv, 2018

In this paper, we introduce MADARi, a joint morphological annotation and spelling correction syst... more In this paper, we introduce MADARi, a joint morphological annotation and spelling correction system for texts in Standard and Dialectal Arabic. The MADARi framework provides intuitive interfaces for annotating text and managing the annotation process of a large number of sizable documents. Morphological annotation includes indicating, for a word, in context, its baseword, clitics, part-of-speech, lemma, gloss, and dialect identification. MADARi has a suite of utilities to help with annotator productivity. For example, annotators are provided with pre-computed analyses to assist them in their task and reduce the amount of work needed to complete it. MADARi also allows annotators to query a morphological analyzer for a list of possible analyses in multiple dialects or look up previously submitted analyses. The MADARi management interface enables a lead annotator to easily manage and organize the whole annotation process remotely and concurrently. We describe the motivation, design and...

CamelParser: A system for Arabic Syntactic Analysis and Morphological Disambiguation

In this paper, we present CamelParser, a state-of-the-art system for Arabic syntactic dependency ... more In this paper, we present CamelParser, a state-of-the-art system for Arabic syntactic dependency analysis aligned with contextually disambiguated morphological features. CamelParser uses a state-of-the-art morphological disambiguator and improves its results using syntactically driven features. The system offers a number of output formats that include basic dependency with morphological features, two tree visualization modes, and traditional Arabic grammatical analysis.

Morphological analysis and disambiguation for Breton

Lang. Resour. Evaluation, 2021

In this paper we present an extended description of two resources for natural language processing... more In this paper we present an extended description of two resources for natural language processing of Breton, a morphological analyser and constraint grammar-based disambiguator. The constraint grammar was developed using a novel methodology by a linguist and a language consultant creating rules to solve specific errors in disambiguation in a machine translation system. In addition we introduce a new morphologically-disambiguated corpus of Breton and evaluate both the morphological analyser and constraint grammar for coverage and accuracy. For comparison we use the same corpus to train several reference systems for part-of-speech tagging and lemmatisation and compare the performance. The experiments show that our system outperforms the reference systems by a wide margin when the reference systems are trained without an external full-form list, and performs comparably when they are trained with a full-form list generated from our morphological analyser.

Annotation Guidelines and Framework for Arabic Machine Translation Post-Edited Corpus

Qatar Foundation Annual Research Conference Proceedings Volume 2016 Issue 1, 2016

1. Introduction Machine translation (MT) became widely used by translation companies to reduce th... more 1. Introduction Machine translation (MT) became widely used by translation companies to reduce their costs and improve their speed. Therefore, the demand for quick and accurate machine translations is growing. Machine translation (MT) systems often produce incorrect output with many grammatical and lexical choice errors. Correcting machine-produced translation errors, or MT Post-Editing (PE) can be done automatically or manually. The availability of annotated resources is required for such approaches. When it comes to the Arabic language, to the best of our knowledge, there is no MT manually post-edited corpora available to build such systems. Therefore, there is a clear need to build such valuable resources for the Arabic language. In this abstract, we present our guidelines and annotation procedure to create a human corrected MT corpus for the Modern Standard Arabic (MSA). The creation of any manually annotated corpus usually presents many challenges. In order to address these challenges, we created a comprehensive and simplified annotation guidelines which were used by a team of five annotators and one lead annotator. In order to ensure a high annotation agreement between the annotators, multiple training sessions were held and regular inter annotator agreement (IAA) measures were performed to check the annotation quality 2. Corpus We collected a corpus of 100K of English news article taken from the collaborative journalism Wikinews website. Afterwards, the corpus collected was automatically translated from English to Arabic using the Google Translate API paid service. 3. Guidelines In order to annotate the MT corpus, we use the general annotation correction guidelines we designed previously for L1 described in Zaghouani et al. (2014) and we add specific MT post-editing correction rules. In the general correction guidelines we place the errors to be corrected into seven categories: spelling, word choice, morphology, syntax, proper names, dialectal usage and punctuation. We refer to Zaghouani et al. (2014) for more details about these errors. In the MT post-editing guidelines, we provide the annotators with detailed annotation procedure and explain how to deal with borderline cases. We include many annotated examples to illustrate some specific cases of machine translation correction rules. Since there are equally-accurate alternative ways to edit the machine translation output, all being considered correct, some using fewer edits than others, we explained in the guidelines that the machine translated texts should be corrected with a minimum number of edits necessary to achieve an acceptable translation quality. However, correcting the accuracy errors and producing a semantically coherent text is more important than minimizing the number of edits and therefore, the annotators were asked to pay attention to the following three aspects: accuracy, fluency and style. 4. Annotation Pipeline The annotation team consisted of a lead annotator and six annotators. The lead annotator is also the annotation workflow manager of this project. He frequently evaluate the quality of the annotation, monitor and report on the annotation progress. A clearly defined protocol is set, including a routine for the Post-editing annotation job assignment and the inter-annotator agreement evaluation. The lead annotators is also responsible of the corpus selection and normalization process beside the annotation of the gold standard to be used to compute the Inter-Annotator Agreement (IAA) portion of the corpus. The annotation itself is done using an in house built web annotation framework built originally for the manual correction of errors in L1 and L2 texts (Obeid et al., 2013). This framework includes two major components: 1. The annotation management interface which is used to assist the lead annotator in the general work-flow process, it allows the user to upload, assign, monitor, evaluate and export annotation tasks. 2. The MT post-editing annotation interface is the actual annotation tool, which allows the annotators to do the manual correction of the MT Arabic output. 5. Evaluation The low average WER of 4.92 obtained show a high agreement with the post-editing done in the first round between three annotators. The results obtained with the MT are comparable to those obtained with the L2 corpus, this can be explained by the difficult nature of both corpora and the multiple acceptable corrections for both. 6. Related Work Large scale manually corrected MT corpora are not yet widely available due to the high cost related to building such resources. For the Arabic language, we cite the effort of Bouamor et al. (2014) who created a medium scale human judgment corpus of Arabic machine translation using the output of six MT systems and a total of 1892 sentences and 22k rankings. Our corpus is a part of the Qatar Arabic Language Bank (QALB) project, a large scale manually annotated annotation project (Zaghouani et al., 2014;…

Dialectal Arabic to English Machine Translation: Pivoting through Modern Standard Arabic

Modern Standard Arabic (MSA) has a wealth of natural language processing (NLP) tools and resource... more Modern Standard Arabic (MSA) has a wealth of natural language processing (NLP) tools and resources. In comparison, resources for dialectal Arabic (DA), the unstandardized spoken varieties of Arabic, are still lacking. We present ELISSA, a machine translation (MT) system for DA to MSA. ELISSA employs a rule-based approach that relies on morphological analysis, transfer rules and dictionaries in addition to language models to produce MSA paraphrases of DA sentences. ELISSA can be employed as a general preprocessor for DA when using MSA NLP tools. A manual error analysis of ELISSA's output shows that it produces correct MSA translations over 93% of the time. Using ELISSA to produce MSA versions of DA sentences as part of an MSA-pivoting DA-to-English MT solution, improves BLEU scores on multiple blind test sets between 0.6% and 1.4%.

Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of arabic

In this paper, we present MADAMIRA, a system for morphological analysis and disambiguation of Ara... more In this paper, we present MADAMIRA, a system for morphological analysis and disambiguation of Arabic that combines some of the best aspects of two previously commonly used systems for Arabic processing, MADA Habash et al., 2009; and AMIRA . MADAMIRA improves upon the two systems with a more streamlined Java implementation that is more robust, portable, extensible, and is faster than its ancestors by more than an order of magnitude. We also discuss an online demo (see ) that highlights these aspects.

Combination of Arabic preprocessing schemes for statistical machine translation

Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL - ACL '06, 2006

Statistical machine translation is quite robust when it comes to the choice of input representati... more Statistical machine translation is quite robust when it comes to the choice of input representation. It only requires consistency between training and testing. As a result, there is a wide range of possible preprocessing choices for data used in statistical machine translation. This is even more so for morphologically rich languages such as Arabic. In this paper, we study the effect of different word-level preprocessing schemes for Arabic on the quality of phrase-based statistical machine translation. We also present and evaluate different methods for combining preprocessing schemes resulting in improved translation quality.

Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop

Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics - ACL '05, 2005

We present an approach to using a morphological analyzer for tokenizing and morphologically taggi... more We present an approach to using a morphological analyzer for tokenizing and morphologically tagging (including partof-speech tagging) Arabic words in one process. We learn classifiers for individual morphological features, as well as ways of using these classifiers to choose among entries from the output of the analyzer. We obtain accuracy rates on all tasks in the high nineties.

Syntactic reordering for English-Arabic phrase-based machine translation

Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages - Semitic '09, 2009

We investigate syntactic reordering within an English to Arabic translation task. We extend a pre... more We investigate syntactic reordering within an English to Arabic translation task. We extend a pre-translation syntactic reordering approach developed on a close language pair (English-Danish) to the distant language pair, English-Arabic. We achieve significant improvements in translation quality over related approaches, measured by manual as well as automatic evaluations. These results prove the viability of this approach for distant languages.