Corpora and Tools for Processing Corpora

July 13, 2016 — Tomar, Portugal
Workshop co-located with PROPOR 2016

Workshop Proceedings
Corpora and Tools for Processing Corpora — Proceedings
Resumos em Português


Welcome and Introduction
ELRC and CEF.AT initiatives and MT@EC — António Branco and Hilário Leal Fontes
09:30 Processing of EU Multilingual Corpora — M.T. Carrasco
10:00 Language processing in MT@EC — Hilário Leal Fontes

10:30 coffee break

Multilingual resources
11:00 CM2News: Towards a Corpus for Multilingual Multi-document Summarization — Ariani Di Felippo
11:30 Language Resources and Processing Tools at the University of Lisbon in the NLX Group Collection — António Branco et al
12:00 Language resources for information extraction and semantic computing – NLP at PUCRS — Renata Vieira et al

12:30 lunch

Task-specific resources
14:00 MWE-aware corpus processing with the mwetoolkit and word embeddings — Aline Villavicencio et al
14:30 ZAC: Zero Anaphora Corpus, A Corpus for Zero Anaphora Resolution in Portuguese — Jorge Baptista et al

Beyond machine translation
15:00 Resources for Monolingual Translation: a case study of Text Simplification for Portuguese — Rodrigo Wilkens et al
15:30 Building a Brazilian Portuguese – Brazilian Sign Language Parallel Corpus using Motion Capture Data — José Mario De Martino et al
16:00 Discussion and wrap up


Call for papers

A great deal of the popularity of statistical machine translation solutions is due to the availability of software packages that are making increasingly easier and faster to train a working machine translation system. For this deployment to take place, these packages have been seen as just requiring to be fed with a sufficiently large volume of data, including some form of parallel corpora of raw text.

While advances in ever more sophisticated aspects of language technology have permitted this to become increasingly feasible, it has been left in the shadow the fact that the data needed to feed these systems still require a considerable deal of preparation. Given the volume of appropriate corpora needed, this preparation can only be practical if suitable datasets are available, on the one hand; and, on the other hand, if this preparation is supported by a number of shallow processing tools, such as boilerplate removers, tokenisers, orthographic normalisers, hyphenators, foreign word detectors, inflectional analysers, etc.

While the construction of this type of tools is no longer a hot topic for cutting-edge research in language technology, resorting to them may turn out to be in many cases less easy than finding and using the much more sophisticated modules needed to deploy the machine translation systems. This is a specially acute situation when it comes to the vast majority of languages, which are comparatively less resourced than English in terms of language technology, and it comes to tools performing at the state of the art level and furthermore are openly available to be reused.

It goes without saying that these negative circumstances go on par with and get aggravated by the fact that suitable parallel texts are not available or easy to obtain. Interestingly, many times such tools and datasets exist and yet their development has never been documented in a publication or their availability has never been disseminated.

The present workshop seeks to contribute to improve on this state of affairs by helping to map both available parallel datasets suitable to feed statistical machine translation systems and available language processing tools useful for their preparation.

While pursuing this goal, the workshop seeks also to exchange ideas and disseminate best practices that help to foster the ELRC and CEF.AT ( initiatives.

We thus invite submissions reporting on language resources suitable to support statistical machine translation from/into Portuguese and on processing tools for their preparation. Different types of presentations are possible, under the form of an oral presentation and/or of a demonstration. While the workshop seeks to attract and promote papers concerning language resources and tools not yet documented in previous publications, for the sake of encompassing representativeness, renewed papers on the other tools and resources are also welcome.

The submissions should be in the .pdf file format, should not exceed 8 pages, and should use the article template that can be found here (consider sections under header “CS Proceedings and Other Multiauthor Volumes”). Papers shall be submitted via the EasyChair online platform:

Accepted papers will be published in a special issue of the journal of the Portuguese Language Department of the Directorate-General of Translation of the European Commission, freely available online.

The participation in the workshop, for authors or non-authors of papers alike, is free of charge for an estimated attendance of up to 40 persons. The organization of the workshop is supported by the Portuguese Language Department of the Directorate-General for Translation of the European Commission.

The workshop invites submissions on resources and tools for any language that fit into the stated aim of this workshop. English is the working language for submissions and in the workshop.

February 25: First call for papers
March 24: Final call for papers
April 15: Deadline for submissions
May 16: Notification sent to authors
June 1: Camera-ready papers ready
July 13, 2016: Workshop takes place

Organization Committee
Hilário Leal Fontes, DGT – European Commission (chair)
Paulo Batista, DGT – European Commission
António Branco, University of Lisbon

The workshop is co-organized by the QTLeap project.

Programme Committee
António Branco, University of Lisbon  (co-chair)
Hilário Leal Fontes, European Commission (co-chair)
Alexandru Ceausu, AMPLEXOR Luxembourg
Aline Villavicencio, Universidade Federal do Rio Grande do Sul
Amália Mendes, Centro de Linguística da Universidade de Lisboa
Belinda Maia, Universidade do Porto
Francis Tyers, Universitetet i Tromsø
Gabriel Lopes, Faculdade de Ciências e Tecnologia, UNL
Gorka Labaka, University of the Basque Country
Jorge Baptista, CECL/U. Algarve and L2F-Spoken Language Lab/INESC ID Lisboa
José João Almeida, Departamento de Informática – Universidade do Minho
José Ramom Pichel Campos, imaxin|software
Luísa Coheur, IST/INESC-ID Lisboa
M.T. Carrasco Benitez, European Commission
Maria José Machado, European Commission
Michael Jellinghaus, European Commission
Mikel Forcada, DLSI – Universitat d’Alacant
Paulo Quaresma, Universidade de Évora
Paulo Correia, European Commission
Renata Vieira, PUCRS
Thiago Pardo, Universidade de São Paulo
Xavier Gómez Guinovart, Universidade de Vigo

Questions about this workshop may be sent to hilario.fontes @