Skip to main content

Michel Simard

National Research Council Canada, Information and Communication Technologies, Faculty Member

Followers

21

Following

2

Co-author

1

Public Views

InterestsView All (6)

Uploads

Papers by Michel Simard

Machine Translation of Canadian Court Decisions

Conference of the Association for Machine Translation in the Americas, 2016

Second Language Writing Assistant System Description

We describe the system entered by the National Research Council Canada in the SemEval-2014 L2 wri... more We describe the system entered by the National Research Council Canada in the SemEval-2014 L2 writing assistant task. Our system relies on a standard Phrase-Based Statistical Machine Translation trained on generic, publicly available data. Translations are produced by taking the already translated part of the sentence as fixed context. We show that translation systems can address the L2 writing assistant task, reaching out-of-five word-based accuracy above 80 percent for 3 out of 4 language pairs. We also present a brief analysis of remaining errors.

Real-time automatic insertion of accents in French text

Natural Language Engineering, 2001

Automatic Accent Insertion (AAI) is the problem of re-inserting accents (diacritics) into a text ... more Automatic Accent Insertion (AAI) is the problem of re-inserting accents (diacritics) into a text where they are missing. Unaccented French texts are still quite common in electronic media, as a result of a long history of character encoding problems and the lack of well-established conventions for typing accented characters on computer keyboards. An AAI method for French is presented, based on a statistical language model. Next, it is shown how this AAI method can be used to do real-time accent insertions within a word processing environment, making it possible to type in French without having to type accents. Various mechanisms are proposed to improve the performance of real-time AAI, by exploiting online corrections made by the user. Experiments show that, on average, such a system produces less than one accentuation error for every 200 words typed.

Automatic restoration of accents in French text

Recent Advances in Automatic Post-Editing

ABSTRACT no yes

Workshop on Post-Editing Technology and Practice

Machine Translation and Self-post-editing for Academic Writing Support: Quality Explorations

Machine translation, 2018

Scholars who need to publish in English and who have English as a Foreign Language might consider... more Scholars who need to publish in English and who have English as a Foreign Language might consider and already be deploying free online MT engines to aid their writing processes. This raises the obvious question of whether MT is actually a useful aid for academic writing and what impact it might have on the quality of the written product. The work described in this chapter attempts to address these two broad questions. After a brief introduction, Sect. 2 reviews literature on three topics: English as a lingua franca in academic writing and the consequences this might have for individual authors and for academic disciplines, second-language writing, and the use of MT as a second-language writing aid. In Sect. 3, the methodology is presented. As will be detailed, the experiment involved ten participants, who were asked to write an abstract in their field of expertise. One half of the text was written in English, while the other half was written in their L1 and then machine-translated into English. Section 4 describes the results: subjective feedback of the participants acquired through a post-task survey, revision activity of a professional reviser, number and types of errors identified by a grammar-checking tool. The results suggest that MT and self-post-editing did not impact negatively on the text produced. However, the participants were divided in their opinions about which task was easier and whether they would consider using MT again for academic writing support. In Sect. 5, we offer a discussion on those results and provide future research ideas.

Automatic Text Simplification of News Articles in the Context of Public Broadcasting

arXiv (Cornell University), Dec 26, 2022

This report summarizes the work carried out by the authors during the Twelfth Montreal Industrial... more This report summarizes the work carried out by the authors during the Twelfth Montreal Industrial Problem Solving Workshop, held at Université de Montréal in August 2022. The team tackled a problem submitted by CBC/Radio-Canada on the theme of Automatic Text Simplification (ATS). In order to make its written content more widely accessible, and to support its second-language teaching activities, CBC/RC has recently been exploring the potential of automatic methods to simplify texts. They have developed a modular lexical simplification system (LSS), which identifies complex words in French and English texts, and replaces them with simpler, more common equivalents. Recently however, the ATS research community has proposed a number of approaches that rely on deep learning methods to perform more elaborate transformations, not limited to just lexical substitutions, but covering syntactic restructuring and conceptual simplifications as well. The main goal of CBC/RC's participation in the workshop was to examine these new methods and to compare their performance to that of their own LSS. This report is structured as follows: In Section 2, we detail the context of the proposed problem and the requirements of the sponsor. We then give an overview of current ATS methods in Section 3. Section 4 provides information about the relevant datasets available, both for training and testing ATS methods. As is often the case in natural language processing applications, there is much less data available to support ATS in French than in English; therefore, we also discuss in that section the possibility of automatically translating English resources into French, as a means of supplementing the French data. The outcome of text simplification, whether automatic or not, is notoriously difficult to evaluate objectively; in Section 5, we discuss the various evaluation methods we have considered, both manual and automatic. Finally, we present the ATS methods we have tested and the outcome of their evaluation in Section 6, then Section 7 concludes this document and presents research directions.

format_quoteThe developed lexical simplification system outperformed the LSBert baseline on 27 of 51 evaluation metrics, showing significant improvements in text simplification.format_quote

Human or Neural Translation?

Deep neural models tremendously improved machine translation. In this context, we investigate whe... more Deep neural models tremendously improved machine translation. In this context, we investigate whether distinguishing machine from human translations is still feasible. We trained and applied 18 classifiers under two settings: a monolingual task, in which the classifier only looks at the translation; and a bilingual task, in which the source text is also taken into consideration. We report on extensive experiments involving 4 neural MT systems (Google Translate, DeepL, as well as two systems we trained) and varying the domain of texts. We show that the bilingual task is the easiest one and that transfer-based deep-learning classifiers perform best, with mean accuracies around 85% in-domain and 75% out-of-domain .

Fully Unsupervised Crosslingual Semantic Textual Similarity Metric Based on BERT for Identifying Parallel Data

Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), 2019

We present a fully unsupervised crosslingual semantic textual similarity (STS) metric, based on c... more We present a fully unsupervised crosslingual semantic textual similarity (STS) metric, based on contextual embeddings extracted from BERT-Bidirectional Encoder Representations from Transformers (Devlin et al., 2019). The goal of crosslingual STS is to measure to what degree two segments of text in different languages express the same meaning. Not only is it a key task in crosslingual natural language understanding (XLU), it is also particularly useful for identifying parallel resources for training and evaluating downstream multilingual natural language processing (NLP) applications, such as machine translation. Most previous crosslingual STS methods relied heavily on existing parallel resources, thus leading to a circular dependency problem. With the advent of massively multilingual context representation models such as BERT, which are trained on the concatenation of non-parallel data from each language, we show that the deadlock around parallel resources can be broken. We perform intrinsic evaluations on crosslingual STS data sets and extrinsic evaluations on parallel corpus filtering and human translation equivalence assessment tasks. Our results show that the unsupervised crosslingual STS metric using BERT without fine-tuning achieves performance on par with supervised or weakly supervised approaches.

Measuring sentence parallelism using Mahalanobis distances: The NRC unsupervised submissions to the WMT18 Parallel Corpus Filtering shared task

Proceedings of the Third Conference on Machine Translation: Shared Task Papers, 2018

The WMT18 shared task on parallel corpus filtering (Koehn et al., 2018b) challenged teams to scor... more The WMT18 shared task on parallel corpus filtering (Koehn et al., 2018b) challenged teams to score sentence pairs from a large highrecall, low-precision web-scraped parallel corpus (Koehn et al., 2018a). Participants could use existing sample corpora (e.g. past WMT data) as a supervisory signal to learn what a "clean" corpus looks like. However, in lowerresource situations it often happens that the target corpus of the language is the only sample of parallel text in that language. We therefore made several unsupervised entries, setting ourselves an additional constraint that we not utilize the additional clean parallel corpora. One such entry fairly consistently scored in the top ten systems in the 100M-word conditions, and for one task-translating the European Medicines Agency corpus (Tiedemann, 2009)-scored among the best systems even in the 10M-word conditions.

format_quoteA novel parallelism measure based on Mahalanobis distances outperformed traditional metrics on synthetic data.format_quote

Segment choice models

Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics -, 2006

This paper presents a new approach to distortion (phrase reordering) in phrasebased machine trans... more This paper presents a new approach to distortion (phrase reordering) in phrasebased machine translation (MT). Distortion is modeled as a sequence of choices during translation. The approach yields trainable, probabilistic distortion models that are global: they assign a probability to each possible phrase reordering. These "segment choice" models (SCMs) can be trained on "segment-aligned" sentence pairs; they can be applied during decoding or rescoring. The approach yields a metric called "distortion perplexity" ("disperp") for comparing SCMs offline on test data, analogous to perplexity for language models. A decision-tree-based SCM is tested on Chinese-to-English translation, and outperforms a baseline distortion penalty approach at the 99% confidence level.

Une approchea la traduction automatique statistique par segments discontinus

Cet article présente une méthode de traduction automatique statistique basée sur des segments non... more Cet article présente une méthode de traduction automatique statistique basée sur des segments non-continus, c'est-à-dire des segments formés de mots qui ne se présentent pas nécéssairement de façon contiguë dans le texte. On propose une méthode pour produire de tels segments à partir de corpus alignés au niveau des mots. On présente également un modèle de traduction statistique capable de tenir compte de tels segments, de même qu'une méthode d'apprentissage des paramètres du modèle visant à maximiser l'exactitude des traductions produites, telle que mesurée avec la métrique NIST. Les traductions optimales sont produites par le biais d'une recherche en faisceau. On présente finalement des résultats expérimentaux, qui démontrent comment la méthode proposée permet une meilleure généralisation à partir des données d'entraînement.

PEPr: Post-Edit Propagation Using Phrase-based Statistical Machine Translation

Translators who work by post-editing machine translation output often find themselves repeatedly ... more Translators who work by post-editing machine translation output often find themselves repeatedly correcting the same errors. We propose a method for Post-edit Propagation (PEPr), which learns posteditor corrections and applies them on-thefly to further MT output. Our proposal is based on a phrase-based SMT system, used in an automatic post-editing (APE) setting with online learning. Simulated experiments on a variety of data sets show that for documents with high levels of internal repetition, the proposed mechanism could substantially reduce the post-editing effort.

NRC's PORTAGE system for WMT 2007

Proceedings of the Second Workshop on Statistical Machine Translation - StatMT '07, 2007

We present the PORTAGE statistical machine translation system which participated in the shared ta... more We present the PORTAGE statistical machine translation system which participated in the shared task of the ACL 2007 Second Workshop on Statistical Machine Translation. The focus of this description is on improvements which were incorporated into the system over the last year. These include adapted language models, phrase table pruning, an IBM1-based decoder feature, and rescoring with posterior probabilities.

Natural Language Engineering, 2005

Parallel texts 1 have become a vital element for natural language processing. We present a panora... more

WPTP 2012

Récupération de segments sous-phrastiques dans une mémoire de traduction

L'utilité des outils d'aideà la traduction reposant sur les mémoires de traduction est souvent li... more L'utilité des outils d'aideà la traduction reposant sur les mémoires de traduction est souvent limitée par la nature des segments que celles-ci mettent en correspondance, le plus souvent des phrases entières. Cet article examine le potentiel d'un type de système qui serait en mesure de récupérer la traduction de séquences de mots de longueur arbitraire. The usefullness of translation support tools based on translation memories is often limited by the nature of the text segments that they connect, generally whole sentences. This article examines the potential of a type of system that would be able to recuperate the translation of arbitrary sequences of words. Mots clés : mémoire de traduction sous-phrastique, traduction assistée par ordinateur, traduction automatiqueà base d'exemples.

De la traduction probabiliste aux mémoires de traduction (ou l’inverse)

En dépit des travaux réalisés cette dernière décennie dans le cadre général de la traduction prob... more En dépit des travaux réalisés cette dernière décennie dans le cadre général de la traduction probabiliste, nous sommes toujours bien loin du jour où un engin de traduction automatique (probabiliste ou pas) sera capable de répondre pleinement aux besoins d'un traducteur professionnel. Dans une étude récente (Langlais, 2002), nous avons montré comment un engin de traduction probabiliste pouvait bénéficier de ressources terminologiques extérieures. Dans cette étude, nous montrons que les techniques de traduction probabiliste peuvent être utilisées pour extraire des informations sous-phrastiques d'une mémoire de traduction. Ces informations peuvent à leur tour s'avérer utiles à un engin de traduction probabiliste. Nous rapportons des résultats sur un corpus de test de taille importante en utilisant la mémoire de traduction d'un concordancier bilingue commercial. Despite the exciting work accomplished over the past decade in the field of Statistical Machine Translation (SMT), we are still far from the point of being able to say that machine translation fully meets the needs of real-life users. In a previous study (Langlais, 2002), we have shown how a SMT engine could benefit from terminological resources, especially when translating texts very different from those used to train the system. In the present paper, we discuss the opening of SMT to examples automatically extracted from a Translation Memory (TM). We report results on a fair-sized translation task using the database of a commercial bilingual concordancer.

Using parallel web pages for multi-lingual IR

In this report, we describe the approach we used in CLEF Cross-Language IR (CLIR) tasks. In our e... more In this report, we describe the approach we used in CLEF Cross-Language IR (CLIR) tasks. In our experiments, we used statistical models estimated from parallel texts automatically mined from the Web. In our previous experiments, we tested CLIR for English-French and English-Chinese. Our goal of this series of experiments is to see if the approach may be extended to multilingual IR (with other languages). In particular, we compare models trained from the Web documents with models that also combine other resources such as dictionaries.