Academia.eduAcademia.edu

Towards Simplification: A Supervised Learning Approach

clg.wlv.ac.uk

Abstract

The aim of this study is to train a computer system to distinguish between translated and original text, in order to investigate the simplification phenomenon. The experiments are based on Spanish comparable corpora with two different genres: medical and technical texts. The classifiers ...

Towards Simplification: A Supervised Learning Approach Iustina Ilisei1, Diana Inkpen2, Gloria Corpas Pastor3, and Ruslan Mitkov4 1, 4 Research Institute in Information and Language Processing, University of Wolverhampton, UK 2 School of Information Technology and Engineering, University of Ottawa, Ottawa, Canada 3 Department of Translation and Interpreting,University of Malaga, Malaga, Spain 1 [email protected], [email protected], [email protected], [email protected] Abstract The aim of this study is to train a computer system to distinguish between translated and original text, in order to investigate the simplification phenomenon. The experiments are based on Spanish comparable corpora with two different genres: medical and technical texts. The classifiers achieve an accuracy of 87.16 % on a test set, and reach up to 97.62% for separate test datasets from the technical domain. When we remove the features related to simplification from the machine learning process, the accuracy of the classifiers decreases. It can be assumed that this is an argument towards the existence of the simplification universal. 1 Introduction The characteristics exhibited by translated texts compared to original, non-translated texts, have always been of great interest in Translation Studies. The unnatural language of translations has certain universal features, as a consequence of the translation process. The translations exhibit their own peculiar lexico-grammatical and syntactic characteristics (Borin and Prütz, 2001; Hansen, 2003; Teich, 2003). These “fingerprints” left by the translation process were first described by Gellerstam (1986) and named translationese. Fairly recently, it has been stated that there are common characteristics which all the translations share, regardless of the source and the target languages (Baker, 1993). Toury (1995) proposed two laws of translation: the law of standardisation and the law of interference, but Baker (1993, 1996) defined four possible translation universals. However, these universals’ explanations and foundation are based on intuition and introspection. Laviosa (2002) continued this line of research by proposing features for simplification in a corpus-based study. Despite some evidence of the existence of such a phenomenon, there is still a remarkable challenge in defining the features needed to grasp the simplification universal and its degree in translated texts. The aim of this study is twofold: to investigate the validation of the simplification hypothesis, and to use a language-independent feature vector in the process of training a system to distinguish between translated and original texts. The main advantages of using only language-independent features are obvious: the system has a wide applicability for other languages, and more important, the universality characteristic of this hypothesis is easier to investigate. 2 Related Work One of the translation universals defined by Baker (1993) is simplification, which is described as the tendency of translators to produce easier-to-follow and simpler texts. The follow-up research methodology in the investigation of translation universals is based on comparable corpora, and some empirical results sustaining the universal were provided (Laviosa, 2002). Recently, more light was shed by proposing several statistically-significant features in support of the simplification universal (Corpas et al., 2008). As previous studies have a lack of specific guidelines in defining the features for the universals, they based their experiments on a set of assumptions: translated texts have less varied and more familiar vocabulary; contain a greater rate of simple sentences and a higher rate of shorter sentences; are characterised by fewer discourse markers; are easy to follow and generally more readable. Simplification appears to be validated only for a few parameters (translated texts showed a lower lexical density and richness, and seemed to be more readable), while others revealed unexpected results (sentence length, simple sentences and discourse markers assumptions have been contradicted). A different perspective for this research topic is undertaken by Baroni and Bernardini (2006), reporting interesting experiments using machine learning algorithms for the task of classifying Italian texts as translated or originals. They use several features, including the words in the texts. In our work, we avoid this type of features, because this introduces language and domain dependence. It is known that it is difficult to assure that the translated and the original texts are about exactly the same topics and sub- topics. They use a feature vector to represent a document, by changing both the size and the type of the units: unigrams, bigrams, trigrams, and word forms, lemmas, part of speech tags, and mixed, respectively. They show that the SVM classifier depends heavily on lexical cues, the distribution of n- grams of function words and the morpho-syntactic categories in general, and on personal pronouns and adverbs in particular. Their results show that shallow data representations can be sufficient to automatically distinguish professional translations from original texts with accuracy above the chance level, and thus hypothesise that this representation catches the distinguishing features of translationese. 3 Methodology Our approach is based on supervised machine learning algorithms which aim to distinguish between translated and original texts. We train classifiers by including in the data representation vector specific features that we proposed for the simplification universal. Therefore, if the accuracy obtained by the machine learning classification is high, it can be assumed that the simplification universal has been spotted in these experiments. If the accuracy of the classifiers decreases when we remove the simplification features from the feature vectors, it can be stated that this is an argument towards the existence of the simplification universal. For our experiments, we use three comparable corpora, described in Corpas (2008). They are Spanish comparable corpora of original and translated texts. Two are from the medical domain, written by translation students and professional translators, respectively. The third one is from the technical domain, written by professionals. The three paired corpora are the following: • Corpus of Medical Translations by Professionals (MTP), which is comparable to the Corpus of Original Medical texts by Professionals (MTPC); • Corpus of Medical Translations by Students (MTS), which is comparable to the Corpus of Original Medical texts by Students (MTSC); • Corpus of Technical Translations by Professionals (TT), which is comparable to the Corpus of Original Technical texts by Professionals (TTC). We extract a training set of 450 randomly selected instances and a test set of 150 randomly selected instances from all the three pairs of comparable texts. We keep the same proportion of texts of each kind in the selected training and test sets. We propose 21 language-independent features for the training of our system. The first 12 are general parameters, while the next 9 are designed to grasp the simplicity characteristic of texts. On the assumption that the simplification universal is valid, the latter features are expected to improve the performance of the classifiers. The first 12 features are the proportion in each text of the following: grammatical words, nouns, finite verbs, auxiliary verbs, adjectives, adverbs, numerals, pronouns, prepositions, determinants, conjunctions, and the ratio of grammatical words per lexical words. The proposed simplification features are the following: the average sentence length, the parse tree depth, the proportion of simple sentences, complex sentences and sentences without any finite verb, the ambiguity level of sentences, the word length as the proportion of syllables per word, the ratio of lemmas divided by the number of tokens, and the ratio of lexical words by total number of tokens. The next stage of the experiments consists of the separate evaluation on the three parts of the datasets corresponding to each corpus category, in order to determine the performance of the text classification for each type and genre. Therefore, we have separate training and test sets from the technical domain written by professional translators, from the medical domain written by students, and from the medical domain written by professionals. The classification was done with the following machine learning algorithms (Witten and Frank, 2005): Jrip, Decision Tree (J48), Naïve Bayes, BayesNet, SVM, Simple Logistic and one meta-classification algorithm: using the results from three algorithms: J48, Jrip and Simple Logistic. This meta-classifier was the one that obtained the best results among many combinations of classifiers that we tried. To assess the statistical significance of the improvement of the machine learning system when including simplification features comparing to the learning system without these features, we apply the paired two-tailed t-test (with 0.5 significance level). T-tests have been applied for the evaluation measurements that we calculated: the accuracy, the precision, the recall and the f-measure of the classifiers. 4 Experiments In order to investigate the simplification universal, we compare the accuracy of the classification task when in the data representation we have all the parameters to the accuracy of the system without the simplification features. Our assumption is the following: if the lack of simplification features causes a statistically-significant difference, this can be considered an argument towards the existence of the simplification universal. 4.1 Classification results In Table 1, we present the main results of the classification when all the three corpora are used as one larger corpus. We report accuracy results for 10-fold cross-validation on the training data (just to see how well the classifiers were able to learn), and for the test data (to make sure that what was learnt on the training data is valid when the classifiers are applied to unseen test data). Throughout all the table cells, a star near the value of the result for a classifier indicates that the result is better in a statistically significant way, when including the simplification features, than the same classifier without the simplification features. Therefore we only need to add stars on the side of the classifier that included all features, in case the improvement brought by the simplification features is statistically significant. As it is known for machine learning techniques, our classifiers need to be better than a baseline classifier that chooses a class by chance. The baseline in our experiments (the ZeroR classifier from Weka) takes into account the majority class from the data set, which happens to be the original class. Therefore, our baseline goes from 64.5% in general, as we followed the same proportion of instances for both the training dataset and test dataset. Accuracy Including Simplification Excluding Simplification Features Features 10-fold Test set 10-fold Test set cross-validation cross-validation Baseline (ZeroR) 65.33% 64.86% 65.33% 64.86% Naive Bayes *76.67% 79.05% 69.33% 75.00% BayesNet 78.67% 79.73% 75.11% 77.03% Jrip 79.56% 83.11% 73.33% 77.03% Decision Tree 78.22% 81.76% 78.22% 81.76% Simple Logistic *77.33% 83.11% 71.11% 80.41% SVM *79.11% *81.76% 69.33% 73.65% Meta-classifier 80.00% 87.16% 73.33% 85.81% Table 1: Classification Results: Accuracies for several classifiers. The meta-classifier, which takes the majority vote between J48, Jrip and Simple Logistic, reaches 87.16% for the randomly selected test set and 80% for 10 fold cross-validation. In Table 2, the results for the test set reach 0.83 precision, and 0.63 recall, with a statistically- significant improvement in F-measure of 0.69 for the SVM classifier, when in the data representation the simplification features are included. BayesNet is a classifier which exhibits a constant significant improvement for all three evaluation measurements in the case of including the simplification features; excluding them would reduce the results up to 0.08 F-measure. Including Simplification Excluding Simplification Features Features Precision Recall F-measure Precision Recall F-measure Naive Bayes 0.77 *0.64 0.68 0.8 0.52 0.61 BayesNet *0.55 *0.43 *0.41 0.07 0.09 0.08 Jrip 0.61 0.55 0.56 0.67 0.68 0.65 Decision Tree 0.64 0.58 0.59 0.75 0.61 0.65 Simple 0.77 0.66 0.7 0.69 0.54 0.58 Logistic SVM 0.83 *0.63 *0.69 0.73 0.47 0.54 Meta-classifier 0.76 0.63 0.66 0.73 0.65 0.66 Table 2: Classification results for the test set (precision, recall, and F-measure). 4.2 Classification results for the three separate test sets We continued our experiments with the evaluation of our system on three test data subsets according to the three types of corpora: the test set pair 1 for MTP-MTPC, test set pair 2 for MTS-MTSC, and test set pair 3 for TT-TTC. We keep the same proportion as in the previous stage: test set pair 2 has 66 and 36 instances for original and translated class, respectively; test set pair 3 has 28 original class instances and 14 instances for the other class. For the first pair we take only 2 original and 2 translated instances as we have only 5 in total for this pair. Thus, the accuracy varies between 50-100% as the following table shows. In Table 3, we present the accuracies for several classifiers on these three datasets. As expected from our previous experiment, none of them report a worse statistical significance. Furthermore, the SVM classifier shows a statistically significant improvement for the technical domain written by professionals, reaching the highest performance of 97.62% accuracy. Other classifiers, like BayesNet, Simple Logistic, and the meta-classifier, register the same value for the same pair (technical domain), but without a statistical significance according to the t-test. Overall, from the accuracies presented in this table, the test set for pair three results are striking. Accuracy Including Simplification Excluding Simplification Features Features Test set Test set Test set Test set Test set Test set pair 1 pair 2 pair 3 pair 1 pair 2 pair 3 Baseline (ZeroR) 50.00% 64.71% 66.67% 50.00% 64.71% 66.67% Naive Bayes 100.00% 71.57% 95.24% 100.00% 71.57% 80.95% BayesNet 50.00% 73.53% 97.62% 50.00% 71.57% 92.86% Jrip 50.00% 79.42% 95.24% 50.00% 72.55% 92.86% Decision Tree 50.00% 77.45% 92.86% 100.00% 75.49% 95.24% Simple Logistic 75.00% 77.45% 97.62% 75.00% 79.41% 83.33% SVM 75.00% 75.49% *97.62% 100.00% 74.51% 69.05% Meta-classifier 75.00% 82.35% 97.62% 75.00% 78.43% 92.86% Table 3: Classification results on the three test sets. 4.3 Preliminary results analysis The Jrip classifier and the J48 decision tree are the only classifiers that are intuitive to humans, for analysis (Quinlan, 1986). The decision tree that was learnt: on the first level is the proportion between lemmas and tokens – a feature considered to be indicative for simplification (Corpas et al., 2008). On the second level of the decision tree is the sentence length and the proportion between the grammatical words and lexical words. Sentence length is a characteristic widely discussed in similar studies and which presented some difficulty in the interpretation of the results described in Corpas et al. (2008). The proportion between the grammatical words and lexical words is an original feature proposed in this paper, considered to stand for the translationese phenomenon rather than to be an indicator strictly for the simplification universal. On the third level of the tree is the proportion of pronouns and conjunctions. Personal pronouns have been considered before, while in this study we take all the pronouns, regardless of their type. As conjunctions have not been proposed as a feature in simplification, these results point to a new direction in the investigation of translation studies. Thus, the top features taken into account by the decision tree algorithm are: the proportion in texts of lemmas by tokens, the ratio between grammatical words and lexical words, the sentence length, followed by the ratio of pronouns and conjunctions. The Jrip classifier gives a readable format output of the rules, pointing out that the following features helped in the categorisation task: firstly, the proportion of lemmas by tokens and the proportion of finite verbs; secondly, the sentence length, the proportion of nouns and the proportion of syllables per word, and thirdly, the proportion of finite verbs and pronouns in text. We applied feature selection techniques to determine which features are strongly correlated with one of the two classes (original vs. translated texts). The top 10 ranked features by the information gain feature selection methods are as follows (starting with the top ranked feature): the ratio of lemmas divided by the number of tokens, the ratio of grammatical words per lexical words, the ratio of finite verbs, numerals, and adjectives, the sentence length, the ratio of pronouns, the simple sentences, the proportion of syllables per word, the ratio of grammatical words, the sentences without any finite verb, the proportion of nouns in texts, the ratio of lexical words, the proportion in texts of determinants, complex sentences, conjunctions, the parse tree depth, the proportion of auxiliary verbs, adverbs, the ambiguity level of sentences, and the ratio of prepositions by total number of tokens. The top-ranked features as ranked by chi-square are slightly different: the ratio of lemmas divided by the number of tokens, the ratio of grammatical words per lexical words, the ratio of finite verbs, numerals, and adjectives, the sentence length, the ratio of pronouns, the proportion of syllables per word, the simple sentences, the sentences without any finite verb, the proportion of nouns in texts, the ratio of lexical words, the ratio of grammatical words. 5 Conclusions and further work This study uses a supervised learning approach in the process of identification of the features that characterise translated texts vs. original texts. The system is based on Spanish comparable corpora with medical and technical genres. The novelty of our study consists in the model based on language- independent and domain-independent features that include the indicators for simplification, which performs better than the system trained without the simplification features. On the categorisation task, our system has an accuracy of up to 87.16 % on a test set, and obtains up to 97.62% for the technical test dataset. When we remove the features related to simplification from the machine learning process, the performance of the classifiers decreases. This can be considered an argument towards the existence of the simplification universal. The results indicate that our algorithms perform the task by relying mainly on the following features: the proportion of lemmas by tokens, the proportion between the grammatical words and lexical words, the sentence length, the proportion of syllables per word, the proportion of nouns, pronouns, finite verbs, and conjunctions. In future work, we would like to experiment with a similar approach to investigate the other universals, for example the explicitation hypothesis. Another line of research consists of the analysis of the features used in detection of explicitation, and we would also like to experiment with different representations these universals. References Baker, M. (1993). 'Corpus Linguistics and Translation Studies – Implications and Applications'. In: M. Baker, M.G. Francis & E. Tognini-Bonelli (eds.). Text and Technology: In Honour of John Sinclair. Amsterdam & Philadelphia: John Benjamins. 233-250. Baker, M. (1996). 'Corpus-based Translation Studies: The Challenges that Lie Ahead'. In: H. Somers (ed.). 1996. Terminology, LSP and Translation: Studies in Language Engineering, in Honour of Juan C. Sager. Amsterdam & Philadelphia: John Benjamins. 175-186. Baroni, Marco and Silvia Bernardini. (2006). 'A new approach to the study of translationese: Machine-learning the difference between original and translated text'. Literary and Linguistic Computing. 21, 3: 259-274. Bernardini, S. and Zanettin, F. (2004). 'When is a Universal not a Universal?' In Mauranen, A. and Kujamaki, P. (eds), Translation Universals. Do they exist? Amsterdam: Benjamins, pp. 51–62. Borin, L. and Prütz, K. (2001). Thorough a dark glass: part of speech distribution in original and translated text. In Daelemans, W., Sima’an, K., Veenstra, J. and Zavrel, J. (eds), Computational Linguistics in the Netherlands 2000. Amsterdam: Rodopi, pp. 30–44. Corpas Pastor, G. (2008). Investigar con corpus en traducción: los retos de un nuevo paradigma. Frankfurt am Main, Berlin & New York: Peter Lang. Corpas Pastor, G., Mitkov R., Afzal N., Pekar V. (2008). Translation Universals: Do they exist? A corpus-based NLP study of convergence and simplification. In Proceedings of the AMTA (2008). Waikiki, Hawaii. Frawley, W. (1984). 'Prolegomenon to a theory of translation'. In Frawley, W. (ed.), Translation: Literary, Linguistic and Philosophical Perspectives. Newark: University of Delaware Press, pp. 159–75. Gellerstam, M. (1986). 'Translationese in Swedish novels translated from English'. In Wollin, L. and Lindquist, H.(eds), Translation Studies in Scandinavia. Lund: CWK Gleerup, pp. 88–95. Hansen, S. (2003). The Nature of Translated Text. Saarbrücken: Saarland University. Laviosa, S. (2002). Corpus-based Translation Studies. Theory, Findings, Applications. Amsterdam & New York: Rodopi. Teich, E. (2003). Cross-linguistic Variation in System and Text. Berlin: Mouton de Gruyter. Toury, G. (1995). 'Descriptive Translation Studies and Beyond'. Amsterdam: John Benjamins. Witten, I. and Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques. Second Edition. Morgan Kaufmann. Quinlan, J.R. (1986). 'Induction of Decision Trees'. Machine Learning, 1:81–106.

References (12)

  1. Baker, M. (1993). 'Corpus Linguistics and Translation Studies -Implications and Applications'. In: M. Baker, M.G. Francis & E. Tognini-Bonelli (eds.). Text and Technology: In Honour of John Sinclair. Amsterdam & Philadelphia: John Benjamins. 233-250.
  2. Baker, M. (1996). 'Corpus-based Translation Studies: The Challenges that Lie Ahead'. In: H. Somers (ed.). 1996. Terminology, LSP and Translation: Studies in Language Engineering, in Honour of Juan C. Sager. Amsterdam & Philadelphia: John Benjamins. 175-186.
  3. Baroni, Marco and Silvia Bernardini. (2006). 'A new approach to the study of translationese: Machine-learning the difference between original and translated text'. Literary and Linguistic Computing. 21, 3: 259-274.
  4. Bernardini, S. and Zanettin, F. (2004). 'When is a Universal not a Universal?' In Mauranen, A. and Kujamaki, P. (eds), Translation Universals. Do they exist? Amsterdam: Benjamins, pp. 51-62.
  5. Borin, L. and Prütz, K. (2001). Thorough a dark glass: part of speech distribution in original and translated text. In Daelemans, W., Sima'an, K., Veenstra, J. and Zavrel, J. (eds), Computational Linguistics in the Netherlands 2000. Amsterdam: Rodopi, pp. 30-44.
  6. Corpas Pastor, G. (2008). Investigar con corpus en traducción: los retos de un nuevo paradigma. Frankfurt am Main, Berlin & New York: Peter Lang. Corpas Pastor, G., Mitkov R., Afzal N., Pekar V. (2008). Translation Universals: Do they exist? A corpus-based NLP study of convergence and simplification. In Proceedings of the AMTA (2008). Waikiki, Hawaii.
  7. Frawley, W. (1984). 'Prolegomenon to a theory of translation'. In Frawley, W. (ed.), Translation: Literary, Linguistic and Philosophical Perspectives. Newark: University of Delaware Press, pp. 159-75.
  8. Gellerstam, M. (1986). 'Translationese in Swedish novels translated from English'. In Wollin, L. and Lindquist, H.(eds), Translation Studies in Scandinavia. Lund: CWK Gleerup, pp. 88-95.
  9. Hansen, S. (2003). The Nature of Translated Text. Saarbrücken: Saarland University.
  10. Laviosa, S. (2002). Corpus-based Translation Studies. Theory, Findings, Applications. Amsterdam & New York: Rodopi.
  11. Teich, E. (2003). Cross-linguistic Variation in System and Text. Berlin: Mouton de Gruyter. Toury, G. (1995). 'Descriptive Translation Studies and Beyond'. Amsterdam: John Benjamins. Witten, I. and Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques. Second Edition. Morgan Kaufmann.
  12. Quinlan, J.R. (1986). 'Induction of Decision Trees'. Machine Learning, 1:81-106.
About the author
Papers
186
Followers
22
View all papers from Ruslan Mitkovarrow_forward