Translation Quality Estimation
– Useful references [PDF]
– WMT16 QE shared task [LINK]
– List of simple words [txt]
– Hands-on QuEst++ [PDF]
In this tutorial we will introduce the background and state of the art on translation quality estimation at different levels of granularity (word, sentence and document) and discuss open challenges in the area. We will then demonstrate QuEst++, a framework for quality estimation, including how to set up an experiment and run the code, and how to implement new features and add other machine learning algorithms to the pipeline.
Description and Relevance to PROPOR
Quality Estimation (QE) of Machine Translation (MT) has become increasingly popular over the last decade. With the goal of providing a prediction on the quality of a machine translated text, QE systems have the potential to make MT more useful in a number of scenarios, for example, improving post-editing effciency by ﬁltering out segments which would require more effort or time to correct than to translate from scratch [Specia, 2011], selecting high quality segments [Soricut and Echihabi, 2010], selecting a translation from either an MT system or a translation memory [He et al., 2010], selecting the best translation from multiple MT systems [Shah and Specia, 2014], and highlighting words or phrases that need revision [Bach et al., 2011].
Sentence-level QE is addressed as a supervised machine learning task using a variety of algorithms to induce models from examples of sentence translations annotated with quality labels (e.g. 1-5 likert scores). This level has been covered in shared tasks organised by the Workshop on Statistical Machine Translation (WMT) annually since 2012 [Callison-Burch et al., 2012, Bojar et al., 2013, Bojar et al., 2014, Bojar et al., 2015]. While standard algorithms can be used to build prediction models, key to this task is work of feature engineering. Two open source feature extraction toolkits are available for that: Asiya [Gonzàlez et al., 2012] and QuEst [Specia et al., 2013]. The latter has been used as the official baseline for the WMT shared tasks and extended by a number of participants, leading to improved results over the years.
Word-level QE [Blatz et al., 2004, Ueng and Ney, 2005, Luong et al., 2014] has recently received more attention. It is seemingly a more challenging task where a quality label is to be produced for each target word. An additional challenge is the acquisition of sizable training sets. Signiﬁcant efforts have been made (including three years of shared task at WMT), showing an increase on researches in word-level QE from last year.
An application that can beneﬁt from word-level QE is spotting errors (wrong words) in a post-editing/revision scenario.
Document-level QE has received much less attention than the other two lev•els. This task consists in predicting a single label for entire documents, be it an absolute score [Scarton and Specia, 2014] or a relative ranking of translations by one or more MT systems [Soricut and Echihabi, 2010] (being useful for gisting purposes, where post-editing is not an option). The ﬁrst shared-task on document-level QE was organised last year in WMT15. Although feature engineering is the focus of this tutorial, it is worth mentioning that one important research question in document-level QE is to deﬁne ideal quality labels for documents [Scarton et al., 2015].
QuEst++ [Specia et al., 2015] is framework for translation QE. It has two main modules: a feature extraction module and a machine learning module. Feature extraction modules can operate at sentence, word and document-level. Machine learning algorithms speciﬁcally designed for each of these levels are available.
Theoretical aspects of QE: in the ﬁrst part of this tutorial, we will introduce the task of QE, show the standard framework for it and describe the three common levels of prediction. Challenges and future work for each level will be discussed, including ideas on how to extend the framework to be used for other Natural Language Processing (NLP) tasks. In addition, applications in research and industry will be illustrated.
Hands-on QuEst++: the second part of the tutorial will cover a hands-on activity QuEst++ , showing how to install and run it for examples available on all prediction levels. We will present the two modules of this framework: Feature Extractor (implemented in Java) and Machine Learning (implemented in Python). An example of feature will be added to QuEst++ , using external resources and showing the interaction between classes and conﬁguration ﬁles. A new machine learning algorithm from scikit-learn will be included in the Machine Learning module, showing how to create wrappers and conﬁguration ﬁles for it.
This tutorial will be structured in 2h30 hours, as follows:
• 1h: theoretical part, presenting the QE task, its levels of feature extraction and prediction and challenges.
• 1h30: hands-on QuEst++ -how to install, run, add a new feature, add a new machine learning algorithm.
Participants will be asked to install tools and dependencies on their laptops previously to the tutorial (e.g.: SRILM tool, TreeTagger).
Carolina Scarton is a PhD student at the University of Sheffield, UK, being supervised by Professor Lucia Specia. The topic of her thesis is on document-level assessment for QE of MT. More speciﬁcally, her research focuses on how to assess machine translated documents in order to build QE models at document-level and the development of features for document-level QE (exploring discourse information). Additionally, she is also interested in Readability Assessment and Language Acquisition. Carolina received a Master degree in Computer Science from the University of São Paulo, Brazil, in 2013.
email: c.<<surname>> at sheffield dot ac dot uk
As technical requirements, we present the following list:
• Internet connection (participants will be asked to download material -e.g. codes -from instructors websites).
• In advance of the tutorial, participants will be asked to install external software required for QuEst++ to run:
— Apache Ant (>= 1.9.3 – http://ant.apache.org/
NetBeans has issues to build on Linux. Get Apache Ant instead to build through command line:
sudo apt-get install ant
– Python 2.7.6 (or above – only 2.7 stable distributions – https://www.python.org/
— NumPy and SciPy (NumPy >=1.6.1 and SciPy >=0.9 – http://www.scipy.org/install.
— scikit-learn (version 0.15.2 – https://pypi.python.org/pypi/
— PyYAML (http://pyyaml.org/)3. Feature extraction requirements:3.1 Sentence-level:
- Perl 5 (or above – https://www.perl.org/get.html)
– SRILM (http://www.speech.sri.
Some tools (e.g. SRILM) might require Cygwin (https://www.cygwin.com/) to run on Windows.
• Linux and MacOS are recommend
[Bach et al., 2011] Bach, N., Huang, F., and Al-Onaizan, Y. (2011). Goodness: a method for measuring MT conﬁdence. In ACL11.
[Blatz et al., 2004] Blatz, J., Fitzgerald, E., Foster, G., Gandrabur, S., Goutte, C., Kulesza, A., Sanchis, A., and Ueng, N. (2004). Conﬁdence Estimation for Machine Translation. In COLING04.
[Bojar et al., 2013] Bojar, O., Buck, C., Callison-Burch, C., Federmann, C., Haddow, B., Koehn, P., Monz, C., Post, M., Soricut, R., and Specia, L. (2013). Findings of the 2013 Workshop on SMT. In WMT13.
[Bojar et al., 2014] Bojar, O., Buck, C., Federmann, C., Haddow, B., Koehn, P., Leveling, J., Monz, C., Pecina, P., Post, M., Saint-Amand, H., Soricut, R., Specia, L., and Tamchyna, A. (2015). Findings of the 2015 Workshop on SMT. In WMT15.
[Bojar et al., 2015] Bojar, O., Chatterjee, R., Federmann, C., Haddow, B., Huck, M., Hockamp, C., Koehn, P., Logacheva, V., Monz, C., Negri, M., Post, M., Scarton, C., Specia, L., and Turchi, M. (2014). Findings of the 2014 Workshop on SMT. In WMT14.
[Callison-Burch et al., 2012] Callison-Burch, C., Koehn, P., Monz, C., Post, M., Soricut, R., and Specia, L. (2012). Findings of the 2012 Workshop on SMT. In WMT12.
[Gonzàlez et al., 2012] Gon`alez, M., Gim´enez, J., and M`arquez, L. (2012). A Graphical Interface for MT Evaluation and Error Analysis. In ACL12.
[He et al., 2010] He, Y., Ma, Y., van Genabith, J., and Way, A. (2010). Bridging SMT and TM with translation recommendation. In ACL10.
[Luong et al., 2014] Luong, N. Q., Besacier, L., and Lecouteux, B. (2014). LIG System for Word Level QE task. In WMT14.
[Scarton and Specia, 2014] Scarton, C. and Specia, L. (2014). Document-level translation quality estimation: exploring discourse and pseudo-references. In EAMT14.
[Scarton et al., 2015] Scarton, C., Zampieri, M., Vela, M., van Genabith, J., and Specia, L. (2015). Searching for Context: a Study on Document-Level Labels for Translation Quality Estimation. In EAMT15.
[Shah and Specia, 2014] Shah, K. and Specia, L. (2014). Quality estimation for translation selection. In EAMT14.
[Soricut and Echihabi, 2010] Soricut, R. and Echihabi, A. (2010). Trustrank: Inducing trust in automatic translations via ranking. In ACL10.
[Specia, 2011] Specia, L. (2011). Exploiting objective annotations for measuring translation post-editing effort. In EAMT11.
[Specia et al., 2015] Specia, L., Paetzold, G. H., and Scarton, C. (2015). Multi-level Translation Quality Prediction with QuEst++. In ACL-IJCNLP15 -System Demonstrations.
[Specia et al., 2013] Specia, L., Shah, K., de Souza, J. G. C., and Cohn, T. (2013). Quest -a translation quality estimation framework. In ACL13.
[Uefing and Ney, 2005] Uefing, N. and Ney, H. (2005). Word-level conﬁdence estimation for machine translation using phrase-based translation models. In HLT/EMNLP.