School of Computing

FACULTY OF ENGINEERING

 

Professor Bogdan Babych

Professor of Translation Studies, Heidelberg University and Visiting Research Fellow, University of Leeds

Member of the Natural Language Processing Group

Summary of research results

Overview

Method of discovering unseen translation equivalents in comparable corpora
Methods of measuring comparability of multilingual documents/corpora and evaluating comparability metrics
Improving Machine Translation with Information Extraction techniques
MT evaluation: improving accuracy of automated MT evaluation metrics and characterising their applicability and limitations
A methodology of automated error analysis for MT
A Tree Adjoining Grammar-based model of word order variation in Ukrainian
Systems of syntaxeme groups: a semantically-oriented formalism for syntactic representations
History of diphthongs in Ukrainian Northern dialects: system coherency method as evidence for later origin of Northern diphtongs
Pragmatics of logical connectives: scalar implicature model for conversational semantics of 'if', 'or' & 'and'

Method of discovering unseen translation equivalents in comparable corpora

Frequently multiword expressions cannot be translated literally, word-for-word, and in many cases their translation equivalents cannot be found in dictionaries or parallel corpora. A method is proposed for discovering such non-literal translation equivalents for multiword expressions in non-parallel, or comparable corpora (collections of documents in the same subject domain and genre). In the first stage, a set of distributionally similar synonyms is produced for each word in a multiword expression; then the word and each of the synonyms are translated into the target language using a traditional dictionary, and distributionally similar synonyms are generated for each of the translations. In the next stage, all possible combinations (a Cartesian product) of these translations and synonyms for each word in a multiword expression are generated. Usually there is a large number of such possibilities. Each of them is tested for being a possible collocation in the target language corpus, and those candidates which pass this test are ranked by the degree of their distributional similarity to the original expression. The method generates N-best list of candidates, which was found useful by human translators for finding non-literal solutions to hard translation problems. Formal evaluation confirmed that the method outperforms searches in parallel corpora using word alignment tools, such as Giza++, in terms of a range of suggested translation equivalents. The method models strategies of paraphrasing and synonymous translation used by human translators, which allows to generate new, previously unseen translation equivalents.

Main publication:

Babych, B., Sharoff, S., Hartley, A., and Mudraya, O. (2007). Assisting Translators in Indirect Lexical Transfer. Paper presented at the 45th International Conference of Association for Computational Linguistics ACL 2007, Prague, Czech Republic. [video]; [pdf]

Methods of measuring comparability of multilingual documents/corpora and evaluating comparability metrics

For many computational linguistic applications it is important to understand the degree of closeness, or comparability between documents or larger document collections. Several methods have been suggested, including methods of mapping words across languages using dictionaries or an on-line machine translation system, and then computing comparability scores on the basis of combining several lexical and structural features. As the intuitive notion of comparability is not well-defined, it is difficult to interpret and/or evaluate different types of comparability scores. Methods developed here are based on measuring Pearson's correlation between values of the scores computed on different types of texts and independent comparabiltiy benchmarks, e.g., a baseline intuitive human judgements about the degree of closeness between the texts. However, a more promissing benchmark is a performance-based indicator: the number of transaltion equivalents which can be successfully extracted from comparable documents. Proposed measures showed high correlation between computed comparability scores and the average numbers of extracted translation equivalents, which creates a new technical task-based definition of comparability.

Another method for evaluating the quality of comparability metrics is based on metric's consistency across languages: Comparability scores are computed between monolingual documents on the source and target side. Then correlation is computed between theser ranges, and the best metrics is the most consistent one -- yielding the highest correlation.

Main publications:

Fangzhong Su and Bogdan Babych (2012) Measuring Comparability of Documents in Non-Parallel Corpora for Efficient Extraction of (Semi-)Parallel Translation Equivalents. In: Proceedings of Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra) at EACL-2012. Pp. 10-19. [pdf]

Babych, B. and Hartley, A. (2011). Meta-evaluation of comparability metrics using parallel corpora. International Journal of Computational Linguistics and Applications, Proceedings volume of CICLing-2011. [pdf] (preprint)

Mārcis Pinnis, Radu Ion, Dan Ştefănescu, Fangzhong Su, Inguna Skadiņa, Andrejs Vasiļjevs, Bogdan Babych (2012) ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora. In Proceesings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012): Demo Session, July 8 - July 14, 2012, Jeju, Korea. [pdf]

Improving Machine Translation quality with Information Extraction techniques

The quality of Machine Translation of texts which contain proper names can be improved by pre-processing source texts with Information Extraction tools, such as Named Entity recognition systems. MT can take full advantage of this annotation if it is fully-integrated into MT architecture. The improvements in translation quality take place both inside annotated named entities, and in their sentential context, since the annotation often improves segmentation of the sentences. These experiments point to the potential of Information Extraction techniques for improving translation quality of MT systems.

Main publications:

Babych, B., and Hartley, A. (2003). Improving Machine Translation quality with automatic Named Entity recognition. Paper presented at the 7th International EAMT workshop on MT and other language technology tools at the 10th Conference of the European Chapter of the Association for Computational Linguistics EACL 2003, Budapest, Hungary. [pdf]

Babych, B., Hartley A. (2004). Selecting Translation Strategies in MT using Automatic Named Entity Recognition. Paper presented at the European Association for Machine Translation (EAMT) Workshop, Malta. [pdf]

MT evaluation: improving accuracy of automated MT evaluation metrics and characterising their applicability and limitations

Automted MT evaluation methods are based on either measuring distance between MT output and the source text (or MT output and a gold-standard human translation) -- so called reference proximity methods, (which include BLEU, NIST, Meteor, TER, etc); or on measuring performance of certain automated systems and tasks, such as parsing, Information Extraction, Named Entity recognition, text mining -- from MT output, and comparing it with systems' performance on human translations or original source texts -- so called performance-based methods.

For proximity-based methods such as BLEU, correlation with human intuitive judgements on Adequacy and Fluency of translation can be improved if matched words in the MT output and the gold-standard human translation are weighted by their statistical salience in target texts, e.g., using measures such as FT.IDF, S-scores which are typically used in Information Retrieval.

It is important to understand limitations for using different types of automated MT evaluation metrics for different evaluation tasks. Proximity-based methods produce reliable correlation with human judgements if the size of evaluated output is more than ~7000 words. When automated methods are used for comparing different versions of the same system developed over time, correlation is usually reliable across different target languages and text types; when they are used for predicting human evaluation scores on the basis of automated scores, or for assessing whether MT output is fit for purpose (i.e., if MT reaches a certain quality threshold), then the regression parameters (the slope and intercept) are different for different target languages and text types, so automated scores need re-calibration with human evaluation for each new combination of a target language and text type (subject domain, genre). Proximity-based metrics usually loose sensitivity on higher-quality MT output, e.g, for MT between closely-related languages, such as Russian and Ukrainian, where lexical level is already properly covered, and other higher-level quality parameters matter more: discourse coherence, appropriateness of the style. Performance-based metrics are more stable across different quality levels of evaluated texts. This can be interpreted as a difference between evaluation on a structural level for proximity-based metrics, and on functional level for performance-based metrics, which capture system's performance on the level of text-external functions.

Main publications:

Babych, B., and Hartley, A. (2004). Extending the BLEU MT Evaluation Method with Frequency Weightings. Paper presented at the 42nd International Conference of the Association for Computational Linguistics, ACL 2004: Barcelona, Spain. [pdf]

Babych, B., Hartley, A., and Elliott, D. (2005). Estimating the predictive power of n-gram MT evaluation metrics across language and text types. Paper presented at Machine Translation Summit X, Phuket, Thailand. [pdf]

Babych, B., and Hartley, A. (2008). Sensitivity of Automated MT Evaluation Metrics on Higher Quality MT Output: BLEU vs Task-Based Evaluation Methods. Proceedings of the Sixth International Language Resources and Evaluation (LREC'08). Marrakech, Morocco. 28-30 May 2008. [pdf]

A methodology of automated error analysis for MT

Automated evaluation metrics, such as BLEU produce reliable results on test sets about 7000 words or more, so their application for document-level or sentence-level evaluation is limited: lexical mismatches can be interpreted either as MT errors, or as a result of legitimate variation in translation. For this reason it is also difficult to use automated metrics for identification and analysis of indivudual errors in MT. A method of automated error analysis is proposed, which allows to benchmark translation quality for individual multiword expressions and linguistic constructions, and to rank the importance these expression for MT development workflow using a risk-assessment framework. The framework highlights most frequent constructions which have lowest evaluation scores that need to receive highest priority to be covered. The method is based on extracting a large set of multiword expressions, terms and constructions whose quality is checked in the error analysis. Then concordance lines within a limited window (up to 5 words) are generated for each of these expressions using an aligned parallel corpus. For each of the expressions these concordances are treated as an evaluation corpus. Empirically it was established that BLEU-type scores for these concordances reflect translation quality of evaluated constructions, so BLEU scores have a specific 'island of stability' on much smaller sizes of evaluated data, if those are generated in a contolled way -- as concordances with a narrow context window.

Main publication:

Babych, B. and Hartley, A. (2009). Automated error analysis for multiword expressions: using BLEU-type scores for automatic discovery of potential translation errors. Linguistica Antverpiensia, New Series (8/2009): Journal of translation and interpreting studies. Special Issue on Evaluation of Translation Technology., 8, pp. 81-104. [pdf] (preprint)

A Tree Adjoining Grammar-based model of word order variation in Ukrainian

Limitations on comprehensibility of centre-embedded sentences are often illustrated by examples such as 'The rat the cat the dog chased killed ate the malt. (Chomsky & Miller, 1963, p. 286)', where a sentence with two levels of centre embedding becomes incomprehensible: cf. 'The rat the cat killed ate the malt.': one level of centre embedding, the sentence is still comprehensible. Traditional expanation for this limitation is insufficient size of human short-term memory for processing sentences which otherwise are grammatically well-formed. However, in Ukrainian and other languages with morphological case marking there are other types of centre embedded constructions which remain comprehensible up to 4 levels of centre embedding, that is consistent with Yngve's model on limitations of short-term memory for linguistic processing, and other psychological evidence, where the limit is 7[plus/minus]2 units (assuming that 1 level of embeddign consumes 2 memory units). However, embedding of subordinate clauses in object position still become incomprehensible in Ukrainian starting from 2 levels of centre embedding. A morphosyntactic model is proposed based on the formalism of tree adjoining grammars enhanced with morphological features and word order variation constraints, which accounts for limits of comprehensibility of different types of centre-embedded linguistic constructions across languages. Specifically, for 2-level centre embedded clauses in object position morphological case features clash at a certain point in proposed syntactic representation, which does not happen in other types of constructions that allow higher levels of centre embedding. The model allows us to predict grammatical limits on comprehensibility of sentences, and to understand the nature of psychological comprehensibility of linguistic construction -- the pheonomenon which is different from ungrammaticality when a sentence still remains comprehensible, even though not well-formed. This model can be used for establishing limitations of comprehesibility of automatically generated text, e.g., the output of Machine Translation systems.

Main publications and reports:

Babych, B. (2002). Word order variation and comprehensibility of centre embedding: evidence from Ukrainian. Technical Report. TRE-CTS-Babych-2002 (Unpublished) [pdf].
Presentation at a research seminar of the Natural Language Processing group at Leeds: 11 October 2002: Comprehensibility limits on centre-embedded structures. [pdf]

Babych, B. (2001). The model of word order variation in Ukrainian declarative sentences. Technical Report TRE-Ieper-Babych-2001 (Unpublished) [pdf]

Babych, B. (2000). Interpretational model of formal syntactic structures in Ukrainian. Thesis submitted in accordance with the requirements for the degree of Candidate of Sciences. Unpublished manuscript. [pdf] (in Ukrainian)

Systems of syntaxeme groups: a semantically-oriented formalism for syntactic representations

Present linguistic formalisms give no clear way of integrating semantic and syntactic representations within the same descriptive framework. Syntactic relations are normally interpreted in semantic terms, can be paraphrased lexically and contribute to a general semantic content of sentences. A new formalism for semantic-level description of syntactic representation is proposed -- systems of syntaxeme groups -- that integrate meanings expressed lexically and syntactically. The formalism can be used to compute semantic effect of a text on an ontological model of a subject domain that represents speakers' knowledge about a communicative situation.

Main publications:

Babych, B. (1998). Systems of syntaxeme groups and their procedural semantics. Movoznavstvo (Linguistics) no. 6 (190), November-December 1998. Kyiv.
[pdf] (in English);
[pdf] (in Ukrainian)

Babych, B. (1999). Lexical semantics in syntactic structure of a text: formal representation and interpretation. In: Proc. of the all-Ukrainian scholarly conference "Semantics, Syntactics and Pragmatics of Speaking" 25th-27th January, 1999. Lviv, "Litopys", 1999. p. 101-108.
[pdf] (English)
[pdf] (Ukrainian)

History of diphthongs in Ukrainian Northern dialects: system coherency method as evidence for later origin of Northern diphtongs

A distictive feature of Ukrianian phonological system is the sound [i] in newly-closed syllables, in place of [o] and [e], which has lead to known historical vowel alternations: [st'il] (table.Nom) vs. [stola] (of table.Gen); [selo](vilage.Nom.Sing) vs. [s'il](of villages.Gen.Plur) -- [i] appeared where the syllable closed because of the decline of reduced vovels [o] and [e] at the end of the following syllable, and short [o]/[e] vocalised into full vowels where the syllable remained open. This process in Ukrainian happened around 14th century. According to a hypothesis of Franz Miklosic it happened through the stage when [o] and [e] became long vowels, then diphtongs and then transformed into the present [i] sound. Miklosic believed that the remnants of these diphtongs remain in Ukrainian Northern dialects: [stuol](table.Nom). However, other linguistics, such as Volodymyr Hantsov and Olena Kurylo challenaged this view: according to them Northern Ukrainian diphtongs are relatively new feature, and are not related to the history of the transformation of /o, e/ into /i/ in the rest of the Ukrianian dialects. The argument was that diphtongisation is stress dependednt in the North, which does not fit Miklosic's model. In a research project which was based on a field trip to a Northern Ukrainian village of Pidlisne (former Tserkovyshche), Kozelets district of the Chernihiv region new data was recorded that adds weight to this alternative hypothesis by Hantsov and Kurylo. More specifically, diphtongs were found in words such as /muozh/ (Standard Ukr. /muzh/ - 'man', hist. /monzh/, with nasalised /on/), /hariechyj/ (Standard Ukr. /har'achyj/ - 'hot', hist. /harenchyj/, with nasalised /en/), i.e., not in positions where they shold not be according to Miklosic's model. In the proposed research these diphtongs are explained by relating them to another feature of the Northern dialects: de-nasalisation of nasal vowels: /pamet'/ (Standard Ukr. /pamjat'/ - 'memory') -- from /pament'/ (with nasal /en/). Diphtongs in /muozh/ and /hariechyj/ appered after nasalised vowels /en, /on/ lost their nasal component in Northern dialects and became undistinguishable from historic non-nasalised /e/ and /o/. But in the Southern dialects /en/ and /on/ already behaved differently -- they transformed into /ja/, /ju/ -- respectively. This indicated that at that stage phonological systems of Northern and Southern Ukrainian dialects were separated and worked independently. So subsequent transformation of /o, e/ into /i/ in the South (which later became the basis of Standard Ukrinian) and /o, e/ into diphtongs /uo/, /ie/ in the North -- were independent processes, so Northern diphtongs are a relatively new phenomenon and could not be treated as a preserved intermediate stage of the same process of all-Ukrainian change in the vowel system, as Miklosic thought. This result shows that understanding of historic changes in a language can benefit from taking into account several seemingly unrelated phenomena, and understanding their function as parts of a coherent and interdependent linguistic system which function not only on the synchrony level, but also maintains its coherence over time -- usually spanning across several centuries, and geographically -- spanning across relatively distant territories.

Main publication:

Babych, B. (1994). Diphtongs in Northern Ukrainian dialects and in the history of Ukrainian language. Bulletin of Kyiv Taras Shevchenko University. Literature, Folklore and Linguistic studies. Issue 2. [pdf] (In Ukrainian)

Pragmatics of logical connectives: scalar implicature model for conversational semantics of 'if', 'or' & 'and'

Logical connectives, such as 'and', 'or', 'if' in their everyday use often have usages which go beyond their standard truth-conditional meanings: 'and' is often used with an additional temporal meaning 'and then'; 'or' is used as an exclusive 'or', not allowing two conjoint propositions to be true for the whole combined sentence to be true; 'if' is used in a sense 'if and only if'. In Gracean Pragmatics these phenomena are traditionally explained as 'conventional implicatures' -- additions to meanings that are part of linguistic convention shared by speakers, which still can be challenged in conversation. The problem is that this explanation is not systematic and has to deal with meanings for each indivudual connective. A new model is proposed which characterises deviations from standard truth-conditional values of logical connectives as 'scalar implicature' -- additional part of meaning that arises from systematic relations between elements of the linguistic systems which are measured against some kind of scale. The scales operate over 'shared domains' of the connectives -- parts of the truth tables where these connectives have the same truth values. This model allows us to predict systematic changes in meanings of all logical connectives as part of a single explanatory system. This model predicts a linguistic potential for systematic changes in conversational 'implicature' meanings of connecties.

Report:

Babych, B. (1997). Scalar implicature of logical connectives. Technical Report TRE-Cornell-Babych-1997. (Unpublished). [pdf]