The MeLLANGE Learner Translator Corpus (LTC)

  1. What it is
  2. LTC annotation
  3. Storing the annotated translations
  4. Storing the non-annotated translations
  5. Querying the LTC
  6. Short tutorial on using the MeLLANGE query interface
What it is

The LTC is a multilingual annotated corpus whose core is composed of translations produced by trainee translators and whose primary purpose is to provide insights into the most significant characteristics of such texts in order to inform translation pedagogy. It comprises originals of 4 different text types together with translations of these texts done by students and professional translators. The text types selected by the MeLLANGE partnership are: legal, technical, administrative, and journalistic.

The chosen materials are 350 words long on average and they are also available in at least all of the languages that project partners use regularly in their translation classes. Consequently, the legal text was offered for translation out of da, de, el, en, es, fi, fr, it, nl and pt, the technical and administrative ones out of de, en, es, fr and it, and finally the journalistic text out of ca, de, en, es, fr and it. You can access these texts, as well as contribute translations to the MeLLANGE project, from this link.

The LTC currently contains 429 student translations, out of which 232 student translations have been annotated for translation errors. Furthermore, by using both new translations submitted by professionals and re-purposing the corpus of parallel originals, we have added 55 reference translations to the LTC.

back to top
LTC annotation

In order to enhance the analysis of the student translations, a subset of the corpus was annotated with metadata and linguistic information, as well as error categories from an error typology which was designed specifically for the MeLLANGE project. It must be noted, however, that contrary to the models which were considered, the MeLLANGE error typology is not meant to contribute to any evaluative process, the focus being on describing and studying specific translation phenomena rather than giving any quality judgment. Therefore, it does not provide for the encoding of the perceived “seriousness” or errors.

The error typology is a hierarchical scheme based on the fundamental distinction between content-related and language-related errors. These two main categories are further divided into subcategories, such as SL Intrusion or Terminology and Lexis, which in turn group more specific error types, such as Too Literal and Inappropriate Collocation. Each error type is marked by a code which will be attached to erroneous words/phrases/sentences during the annotation process. The taxonomy also includes User-Defined categories that annotators can resort to in case they find that the error they wish to mark does not belong to any of the MeLLANGE error categories. It is also possible for annotators to assign more than one category to given text parts, as well as to provide explanatory notes and/or suggest correct solutions.

The English version of the error typology can be accessed from this link. The format in which the error typology was developed was XML. You can download the English XML file, as well as its French, German and Italian localised versions from this link. Finally, if you need instructions and scripts for generating image files from XML files on Linux, you can download this archive.

The annotation task was carried out using a version of MMAX that was adapted for the MeLLANGE project by Andrei Popescu-Beliş' team in École de Traduction et d'Interprétation, Université de Genève. The annotators were translator tutors in the MeLLANGE academic institutions. In order to get an idea of how they used MMAX, you can watch this short animation which was used for the project's internal training purposes (you need to have a Flash-enabled browser).

Furthermore, translations were also annotated with part-of-speech and lemma information. You can access the tagsets used by the MeLLANGE project from this link. The following table is a summary of the tools used for each language and of the size of the tagsets used by each partner:

  Tagger Tagset size
EN Tree-tagger 49 tags (Tree-tagger EN tagset)
DE TnT-tagger 54 tags (Stuttgart-Tübingen-Tagset)
FR In-house probabilistic tagger 43 tags (selected from the 300 tags used in the Paris 7 "Le Monde Corpus)
IT Tree-tagger 52 tags (Tree-tagger IT tagset)
ES Connexor tagger 36 tags (Connexor Tagset for Spanish)
CA Catcg (Constraint Grammar formalism) 380 tags

back to top

Storing the annotated translations

The partnership has chosen to use stand-off annotation – i.e. storing each level of annotation in a separate XML file – because of the numerous advantages of this approach in terms of information management and maintenance.

For example, each student translations that was annotated with linguistic information, as well as MeLLANGE error typology data appears in the corpus that is queried by our scripts under the following form:

An example of this structure can be downloaded from this link.

back to top

Storing the non-annotated translations

As far as the student translations which have not been annotated are concerned, we still see them as a valuable part of the LTC. We offer them as possible alternatives and we store the following information about them:

back to top

Querying the LTC

This modular structure of the LTC has enabled the building of a complex query tool that retrieves error, metadata and linguistic categories.

The interface allows access to two corpora: The MeLLANGE LTC and the eCoLoRe TMX corpus which has been proving beneficial in translation classes. You can access the query interface here.

The MeLLANGE query tool has been designed to retrieve sentences containing translation errors identified and annotated by the consortium and chosen for the query by the user. Moreover, the query results will also include original text, reference translation(s) provided by professional translators and alternative translations (annotated or not) provided by other students. For more information, see the section below on how to use the MeLLANGE query interface.

back to top

Short tutorial on using the MeLLANGE query interface

back to top

Did you know that ...

Copyright MeLLANGE 2007
For more information about the MeLLANGE project, visit the project website.