Towards Simplification: A Supervised Learning Approach
clg.wlv.ac.uk
…
7 pages
Sign up for access to the world's latest research
Abstract
The aim of this study is to train a computer system to distinguish between translated and original text, in order to investigate the simplification phenomenon. The experiments are based on Spanish comparable corpora with two different genres: medical and technical texts. The classifiers ...
Related papers
Expert Syst. Appl., 2019
The current bottleneck of all data-driven lexical simplification (LS) systems is scarcity and small size of parallel corpora (original sentences and their manually simplified versions) used for training. This is especially pronounced for languages other than English. We address this problem, taking Spanish as an example of such a language, by building new simplification-specific datasets of synonyms and paraphrases using freely available resources. We test their usefulness in the LS task by adding them, in various combinations , to the existing text simplification (TS) training dataset in a phrase-based statistical machine translation (PBSMT) approach. Our best systems significantly outperform the state-of-the-art LS systems for Spanish, by the number of transformations performed and the grammaticality, simplicity and meaning preservation of the output sentences. The results of a detailed manual analysis show that some of the newly built TS resources, although they have a good lexical coverage and lead to a high number of transformations, often change the original meaning and do not generate simpler output when used in this PBSMT setup. The good combinations of these additional resources with the TS training dataset and a good choice of language model, in contrast, improve the lexical coverage and produce sentences which are grammatical, simpler than the original, and preserve the original meaning well.
ACM Transactions on Accessible Computing (TACCESS) - Special Issue on Speech and Language Processing for AT (Part 2), 2015
The way in which a text is written can be a barrier for many people. Automatic text simplification is a natural language processing technology that, when mature, could be used to produce texts that are adapted to the specific needs of particular users. Most research in the area of automatic text simplification has dealt with the English language. In this article, we present results from the Simplext project, which is dedicated to automatic text simplification for Spanish. We present a modular system with dedicated procedures for syntactic and lexical simplification that are grounded on the analysis of a corpus manually simplified for people with special needs. We carried out an automatic evaluation of the system's output, taking into account the interaction between three different modules dedicated to different simplification aspects. One evaluation is based on readability metrics for Spanish and shows that the system is able to reduce the lexical and syntactic complexity of the texts. We also show, by means of a human evaluation, that sentence meaning is preserved in most cases. Our results, even if our work represents the first automatic text simplification system for Spanish that addresses different linguistic aspects, are comparable to the state of the art in English Automatic Text Simplification.
2012
This paper addresses the problem of automatic text simplification. Automatic text simplifications aims at reducing the reading difficulty for people with cognitive disability, among other target groups. We describe an automatic text simplification system for Spanish which combines a rule based core module with a statistical support module that controls the application of rules in the wrong contexts. Our system is integrated in a service architecture which includes a web service and mobile applications.
Procesamiento Del Lenguaje Natural, 2012
En este artículo presentamos los resultados de un estudio cuyo objetivo es sentar las bases para el desarrollo de un módulo de simplificación léxica para el español. Basándonos en estudios para otras lenguas analizamos, en primer lugar, la distribución de la frecuencia y la longitud de palabra en textos originales y sus simplificaciones manuales. En segundo lugar nos centramos en los casos de clarificación de información a través de la introducción de definiciones en textos simplificados. Finalmente estudiamos la reducción del contenido informativo del texto y proponemos un sistema para su tratamiento basado en técnicas de resumen. Nuestro estudio empírico sienta las bases para el desarrollo de un componente de tratamiento léxico en un sistema de simplificación de textos en desarrollo.
Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR), 2014
This study explores the possibility of replacing the costly and time-consuming human evaluation of the grammaticality and meaning preservation of the output of text simplification (TS) systems with some automatic measures. The focus is on six widely used machine translation (MT) evaluation metrics and their correlation with human judgements of grammaticality and meaning preservation in text snippets. As the results show a significant correlation between them, we go further and try to classify simplified sentences into: (1) those which are acceptable; (2) those which need minimal post-editing; and (3) those which should be discarded. The preliminary results, reported in this paper, are promising.
2008
In this paper we investigate the main linguistic phenomena that can make texts complex and how they could be simplified. We focus on a corpus analysis of simple account texts available on the web for Brazilian Portuguese and propose simplification strategies for this language. This study illustrates the need for text simplification to facilitate accessibility to information by poor literacy readers and potentially by people with other cognitive disabilities. It also highlights characteristics of simplification for Portuguese, which may differ from other languages. Such study consists of the first step towards building Brazilian Portuguese text simplification systems. One of the scenarios in which these systems could be used is that of reading electronic texts produced, e.g., by the Brazilian government or by relevant news agencies.
AMIA ... Annual Symposium proceedings. AMIA Symposium, 2017
Simplifying medical texts facilitates readability and comprehension. While most simplification work focuses on English, we investigate whether features important for simplifying English text are similarly helpful for simplifying Spanish text. We conducted a user study on 15 Spanish medical texts using Amazon Mechanical Turk and measured perceived and actual difficulty. Using the median of the difficulty scores, we split the texts into easy and difficult groups and extracted 10 surface, 2 semantic and 4 grammatical features. Using t-tests, we identified those features that significantly distinguish easy text from difficult text in Spanish and compare with prior work in English. We found that easy Spanish texts use more repeated words and adverbs, less negations and more familiar words, similar to English. Also like English, difficult Spanish texts use more nouns and adjectives. However in contrast to English, easier Spanish texts contained longer sentences and used grammatical struct...
2011
We present a method for the sentence-level alignment of short simplified text to the original text from which they were adapted. Our goal is to align a medium-sized corpus of parallel text, consisting of short news texts in Spanish with their simplified counterpart. No training data is available for this task, so we have to rely on unsupervised learning. In contrast to bilingual sentence alignment, in this task we can exploit the fact that the probability of sentence correspondence can be estimated from lexical similarity between sentences. We show that the algoithm employed performs better than a baseline which approaches the problem with a TF*IDF sentence similarity metric. The alignment algorithm is being used for the creation of a corpus for the study of text simplification in the Spanish language.
Proceedings of the ... International Florida Artificial Intelligence Research Society Conference, 2022
Natural language processing encompasses several tasks, one of which is the automatic text simplification. Telling whether one text is simpler than another involves not only knowledge about the language being analyzed, but also a cultural knowledge of the target audience to which the text is being directed. Most of the current metrics used to measure text simplification are based on the use of parallel corpora, prepared by humans, which makes it difficult to apply the metrics in automatic text simplification in real time. In this paper, we present ISiM (Independent Simplification Metric), a metric that dismiss a parallel corpus, is simple, fast, language and human annotation independent, capable of quantifying the simplicity/complexity of a sentence, thus contributing improve automating text simplification. The results of the tests performed indicate that the proposed metric has the potential to be used to evaluate automatic methods of simplification.
References (12)
- Baker, M. (1993). 'Corpus Linguistics and Translation Studies -Implications and Applications'. In: M. Baker, M.G. Francis & E. Tognini-Bonelli (eds.). Text and Technology: In Honour of John Sinclair. Amsterdam & Philadelphia: John Benjamins. 233-250.
- Baker, M. (1996). 'Corpus-based Translation Studies: The Challenges that Lie Ahead'. In: H. Somers (ed.). 1996. Terminology, LSP and Translation: Studies in Language Engineering, in Honour of Juan C. Sager. Amsterdam & Philadelphia: John Benjamins. 175-186.
- Baroni, Marco and Silvia Bernardini. (2006). 'A new approach to the study of translationese: Machine-learning the difference between original and translated text'. Literary and Linguistic Computing. 21, 3: 259-274.
- Bernardini, S. and Zanettin, F. (2004). 'When is a Universal not a Universal?' In Mauranen, A. and Kujamaki, P. (eds), Translation Universals. Do they exist? Amsterdam: Benjamins, pp. 51-62.
- Borin, L. and Prütz, K. (2001). Thorough a dark glass: part of speech distribution in original and translated text. In Daelemans, W., Sima'an, K., Veenstra, J. and Zavrel, J. (eds), Computational Linguistics in the Netherlands 2000. Amsterdam: Rodopi, pp. 30-44.
- Corpas Pastor, G. (2008). Investigar con corpus en traducción: los retos de un nuevo paradigma. Frankfurt am Main, Berlin & New York: Peter Lang. Corpas Pastor, G., Mitkov R., Afzal N., Pekar V. (2008). Translation Universals: Do they exist? A corpus-based NLP study of convergence and simplification. In Proceedings of the AMTA (2008). Waikiki, Hawaii.
- Frawley, W. (1984). 'Prolegomenon to a theory of translation'. In Frawley, W. (ed.), Translation: Literary, Linguistic and Philosophical Perspectives. Newark: University of Delaware Press, pp. 159-75.
- Gellerstam, M. (1986). 'Translationese in Swedish novels translated from English'. In Wollin, L. and Lindquist, H.(eds), Translation Studies in Scandinavia. Lund: CWK Gleerup, pp. 88-95.
- Hansen, S. (2003). The Nature of Translated Text. Saarbrücken: Saarland University.
- Laviosa, S. (2002). Corpus-based Translation Studies. Theory, Findings, Applications. Amsterdam & New York: Rodopi.
- Teich, E. (2003). Cross-linguistic Variation in System and Text. Berlin: Mouton de Gruyter. Toury, G. (1995). 'Descriptive Translation Studies and Beyond'. Amsterdam: John Benjamins. Witten, I. and Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques. Second Edition. Morgan Kaufmann.
- Quinlan, J.R. (1986). 'Induction of Decision Trees'. Machine Learning, 1:81-106.
Ruslan Mitkov