Avaliação de Similaridade Semântica e Inferência Textual

ASSIN (Avaliação de Similaridade Semântica e Inferência Textual) is an evaluation forum for two related and relevant tasks: semantic similarity and textual entailment recognition.  It introduces a large-scale dataset annotated for both phenomena in Portuguese, allowing the development of machine learning-based NLP systems capable of solving them.

The task of  measuring semantic similarity has been introduced in SemEval 2012 in the Semantic Textual Similarity (STS) track, and textual entailment recognition first appeared in the RTE Challenges. The SICK shared task brought both together in SemEval 2014, and now ASSIN presents both tasks with Portuguese data.

The ASSIN workshop will promote discussions on the subject, its difficulties and their possible solutions, as well as a comparison of the contribution of different computational techniques, tools and linguistic resources.

Workshop Program and Accepted Papers

09h00 – 09h30  Overview of the Shared Task and Corpus ASSIN [SLIDES]

09h30 – 09h50  Solo Queue [SLIDES]

09h50 – 10h10  Reciclagem [SLIDES]

10h10 – 10h30  Blue Man Group [SLIDES]

10h30 – 11h00  Coffe Break

11h00 – 11h20  ASAPP [SLIDES]

11h20 – 11h40  LEC-UNIFOR [SLIDES]

11h40 – 12h00  L2F/UNESC-ID [SLIDES]

12h00 – 12h30  Talk by Hugo Gonçalo Oliveira: Portuguese Lexical Knowledge Bases (in conjunction with the LexSem+Logics workshop)

Please note that slides in the presentations should be in English, but can be presented in either English or Portuguese.

Task Description

The ASSIN dataset contains 10,000 sentence pairs collected from Google News, half from Brazilian sources and half from Portuguese ones. 6,000 pairs are released for training and the remaining will serve as a blind test set. Each pair is annotated for both semantic relatedness and textual entailment.

Semantic relatedness is measured on a scale from 1 to 5. The general guidelines for each score are:

  1. Completely different sentences, on different subjects
  2. Sentences are not related, but are roughly on the same subject
  3. Sentences are somewhat related; they may describe different facts but share some details
  4. Sentences are strongly related, but some details differ
  5. Sentences mean essentially the same thing

A sentence T (the text) entails another sentence H (the hypothesis) if, after reading both and knowing that T is true, a person concludes that H must also be true. ASSIN also distinguishes bidirectional entailment cases, or paraphrases.

 

Examples

Semantic similarity:

1 Mas esta é a primeira vez que um chefe da Igreja Católica usa a palavra em público.
A Alemanha reconheceu ontem pela primeira vez o genocídio armênio.
2 Como era esperado, o primeiro tempo foi marcado pelo equilíbrio.
No segundo tempo, o panorama da partida não mudou.
3 Houve pelo menos sete mortos, entre os quais um cidadão moçambicano, e
300 pessoas foram detidas.
Mais de 300 pessoas foram detidas por participar de atos de vandalismo.
4 A organização criminosa é formada por diversos empresários e por um deputado estadual.
Segundo a investigação, diversos empresários e um deputado estadual integram o grupo.
5 Outros 8.869 fizeram a quadra e ganharão R$ 356,43 cada um.
Na quadra 8.869 apostadores acertaram, o prêmio é de R$ 356,43 para cada.

Entailment:

Entailment Como não houve acordo, a reunião será retomada nesta terça, a partir das 10h.
As partes voltam a se reunir nesta terça,às 10h.
Paraphrase Vou convocar um congresso extraordinário para me substituir enquanto presidente.
Vou organizar um congresso extraordináriopara se realizar a minha substituição como presidente.
No relation As apostas podem ser feitas até as 19h (de Brasília).
As apostas podem ser feitas em qualquerlotérica do país.

 

Participants

Participants may develop systems to solve both subtasks or a single one. Additionally, performance can be measured separately on Brazilian and European data. The release of systems employed by participants as open source is highly encouraged, as it will contribute to the advancement of the state of the art.

Interested participants should register with the workshop organizers to receive the test data in the scheduled date. In order to register, send an email to propor2016assin@gmail.com.

The results of up to 3 runs from each participant may be sent, in order to allow the use of different configurations. Each run must output an XML file in the same format of the training data. The pair elements should contain the system answer in the attributes entailment (one of Entailment, Paraphrase and None) and similarity (a real value between 1 and 5, inclusive). If the system only solves one of the tasks, the other attribute may be ommitted. Files must be encoded in UTF-8.

Papers

Participants must submit papers describing their systems (algorithms used, strategies to solve the problem, linguistic resources and NLP pipeline etc). Papers may have up to 8 pages and should be prefereably written in Portuguese.

They will be published in a special issue of Linguamática. Templates for formatting papers can be found here.

Submissions should be made via EasyChair.

Data and Software

The full corpus can be downloaded here. Scripts implementing the baselines and the evaluation can be found in GitHub.

Schedule

  • Release of the full Brazilian training set: November 20, 2015  Available!
  • Release of the full European training set: December 20, 2015 January 20, 2016 Available!
  • Release of the test data to registered participants: February 27, 2016 March 4, 2016
  • Evaluation period: February 27 – March 8, 2016 March 4 – March 12, 2016
  • Paper submission: April 15, 2016 April 30, 2016
  • Reviews due: May 5, 2016 May 20, 2016
  • Camera ready papers: June 20, 2016 July 8, 2016
  • Public release of the full dataset: July 13, 2016 (on PROPOR beginning) Available!
  • Workshop: July 13, 2016

Workshop Chairs

  • Erick Fonseca (ICMC/University of São Paulo, Brazil)
  • Sandra Aluísio (ICMC/University of São Paulo, Brazil)
  • Marcelo Criscuolo (ICMC/University of São Paulo, Brazil)
  • Leandro Santos (ICMC/University of São Paulo, Brazil)
Program Committee
  • Arnaldo Candido Junior (Universidade Tecnológica Federal do Paraná, Brazil)
  • Erick Fonseca (ICMC/University of São Paulo, Brazil)
  • Leandro Borges dos Santos (ICMC/University of São Paulo, Brazil)
  • Marcelo Criscuolo (ICMC/University of São Paulo, Brazil)
  • Maria das Graças Volpe Nunes (ICMC/University of São Paulo, Brazil)
  • Sandra Aluisio (ICMC/University of São Paulo, Brazil)
  • Thiago Pardo (ICMC/University of São Paulo, Brazil)

 

Contact

Questions about the workshop may be sent to propor2016assin@gmail.com.