The MeLLANGE Learner Translator Corpus (LTC)
- What it is
- LTC annotation
- Storing the annotated translations
- Storing the non-annotated translations
- Querying the LTC
- Short tutorial on using the MeLLANGE query interface
What it is
The LTC is a multilingual annotated corpus whose core is composed of translations produced by trainee translators and whose primary purpose is to provide insights into the most significant characteristics of such texts in order to inform translation pedagogy. It comprises originals of 4 different text types together with translations of these texts done by students and professional translators. The text types selected by the MeLLANGE partnership are: legal, technical, administrative, and journalistic.
The chosen materials are 350 words long on average and they are also available in at least all of the languages that project partners use regularly in their translation classes. Consequently, the legal text was offered for translation out of da, de, el, en, es, fi, fr, it, nl and pt, the technical and administrative ones out of de, en, es, fr and it, and finally the journalistic text out of ca, de, en, es, fr and it. You can access these texts, as well as contribute translations to the MeLLANGE project, from this link.
The LTC currently contains 429 student translations, out of which 232 student translations have been annotated for translation errors. Furthermore, by using both new translations submitted by professionals and re-purposing the corpus of parallel originals, we have added 55 reference translations to the LTC.
back to topLTC annotation
In order to enhance the analysis of the student translations, a subset of the corpus was annotated with metadata and linguistic information, as well as error categories from an error typology which was designed specifically for the MeLLANGE project. It must be noted, however, that contrary to the models which were considered, the MeLLANGE error typology is not meant to contribute to any evaluative process, the focus being on describing and studying specific translation phenomena rather than giving any quality judgment. Therefore, it does not provide for the encoding of the perceived “seriousness” or errors.
The error typology is a hierarchical scheme based on the fundamental distinction between content-related and language-related errors. These two main categories are further divided into subcategories, such as SL Intrusion or Terminology and Lexis, which in turn group more specific error types, such as Too Literal and Inappropriate Collocation. Each error type is marked by a code which will be attached to erroneous words/phrases/sentences during the annotation process. The taxonomy also includes User-Defined categories that annotators can resort to in case they find that the error they wish to mark does not belong to any of the MeLLANGE error categories. It is also possible for annotators to assign more than one category to given text parts, as well as to provide explanatory notes and/or suggest correct solutions.
The English version of the error typology can be accessed from this link. The format in which the error typology was developed was XML. You can download the English XML file, as well as its French, German and Italian localised versions from this link. Finally, if you need instructions and scripts for generating image files from XML files on Linux, you can download this archive.
The annotation task was carried out using a version of MMAX that was adapted for the MeLLANGE project by Andrei Popescu-Beliş' team in École de Traduction et d'Interprétation, Université de Genève. The annotators were translator tutors in the MeLLANGE academic institutions. In order to get an idea of how they used MMAX, you can watch this short animation which was used for the project's internal training purposes (you need to have a Flash-enabled browser).
Furthermore, translations were also annotated with part-of-speech and lemma information. You can access the tagsets used by the MeLLANGE project from this link. The following table is a summary of the tools used for each language and of the size of the tagsets used by each partner:
| Tagger | Tagset size | |
| EN | Tree-tagger | 49 tags (Tree-tagger EN tagset) |
| DE | TnT-tagger | 54 tags (Stuttgart-Tübingen-Tagset) |
| FR | In-house probabilistic tagger | 43 tags (selected from the 300 tags used in the Paris 7 "Le Monde Corpus) |
| IT | Tree-tagger | 52 tags (Tree-tagger IT tagset) |
| ES | Connexor tagger | 36 tags (Connexor Tagset for Spanish) |
| CA | Catcg (Constraint Grammar formalism) | 380 tags |
Storing the annotated translations
The partnership has chosen to use stand-off annotation – i.e. storing each level of annotation in a separate XML file – because of the numerous advantages of this approach in terms of information management and maintenance.
For example, each student translations that was annotated with linguistic information, as well as MeLLANGE error typology data appears in the corpus that is queried by our scripts under the following form:
- one XML file containing the tokenised translation
- one XML file containing information about Content Transfer errors which reference token IDs
- one XML file containing information about Language errors which reference token IDs
- one XML file containing POS and lemma information which reference token IDs
- one XML file containing information about sentence spans which reference token IDs
- one XML file containing metadata about the trainee translator and the circumstances in which the translation was carried out
An example of this structure can be downloaded from this link.
Storing the non-annotated translations
As far as the student translations which have not been annotated are concerned, we still see them as a valuable part of the LTC. We offer them as possible alternatives and we store the following information about them:
- one XML file containing the tokenised translation
- one XML file containing information about sentence spans which reference token IDs
- one XML file containing metadata about the trainee translator and the circumstances in which the translation was carried out
Querying the LTC
This modular structure of the LTC has enabled the building of a complex query tool that retrieves error, metadata and linguistic categories.
The interface allows access to two corpora: The MeLLANGE LTC and the eCoLoRe TMX corpus which has been proving beneficial in translation classes. You can access the query interface here.
The MeLLANGE query tool has been designed to retrieve sentences containing translation errors identified and annotated by the consortium and chosen for the query by the user. Moreover, the query results will also include original text, reference translation(s) provided by professional translators and alternative translations (annotated or not) provided by other students. For more information, see the section below on how to use the MeLLANGE query interface.
Short tutorial on using the MeLLANGE query interface
- in order to maximise efficiency, the interface to the MeLLANGE linguistic information (but not the student and text metadata, too) and eCoLoRe TMX corpus are represented by a series of connected menus which only present information that can actually be found in the corpora
- when using elements of the student and text metadata as search criteria, make sure you also specify what texts and language combinations you would like to retrieve before clicking on the "Get concordances" button
- once results to MeLLANGE queries start being displayed, the user can see in the series of tables colour-coded rows containing the original text corresponding to the student sentence found to contain the error the user is searching for, as well as corresponding reference translations and alternative translations given by other students
- hovering with the mouse over words or error codes will bring up more information if available: POS and lemma information, and trainer feedback for those errors respectively
- the query results will also include Alternative student contexts extracted from unannotated and untagged translations in order to capture the true breadth of the LTC and offer as many potential solutions to translation problems as possible
- the MeLLANGE LTC corpus is NOT sentence-aligned. Our hypothesis was that, given the small texts, sufficiently reliable data can be returned only relying on sentence IDs. What that actually means is that if a user is searching for a particular error and our query system finds it in sentence 4 of a student translation, it will return that translation in the Target sentence row and will display a context of one sentence either side of the target sentence ID (i.e. sentences 2 to 4) in the Original context, Reference context and Alternative student context rows.
- clicking on the links in the left column brings up the available metadata about the translation displayed in that row
- not all reference translations have associated metadata
- clicking on the links in the right column brings up the whole text - in the case of the annotated translations, it will include the annotations, too
- when the annotation took place, the annotation tool did not store paragraph information, too - only sentence boundaries. This is why, when viewing the entire text, it will appear unformatted. The formatted translations will be available from this location. Use the download mechanism for parts of or the entire corpus.
- at the moment the query page is not refreshed automatically, so in order to make a new query use the link back to the query interface rather than rely on the Back button in your browser. If you have used the Back button, refresh the page.

