After more than fifty years of empty promises and repeated failures, amazingly, interest in machine translation continues to grow. It is still something that almost everybody hopes will work someday. Is MT finally ready to deliver on its promise? What are the issues with this technology and what will it take to make it work? And why do we continue to try after 50 years of minimal success? This overview attempts to provide a lay perspective on the ongoing discussion in the evolution of the two main approaches to ‘machine translation’ that are in use today, and attempts to answer these questions. While other technical approaches to MT do exist, this overview will only focus on Rule-based MT (RbMT) and Statistical Machine Translation (SMT), as these approaches underlie virtually all the production MT systems in use today.
Why It Matters
We live in a world where knowledge is power and information access has become a human right. In 2006, the amount of digital information created, captured, and replicated was 1,288 x 1018 bits. In computer parlance, that's 161 exabytes or 161 billion gigabytes …
This is about 3 million times the information in all the books ever written!
Between 2006 and 2010, the information added annually to the digital universe will increase more than six fold from 161 exabytes to 988 exabytes. It is likely that the bulk of this new information will originate in just a few key languages. So are we heading into a global digital divide in the not so distant future? There are two references, which testify to these facts: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/ and http://www.universityofcalifornia.edu/news/article/17949
Peter Brantley (2008) at Berkeley says it quite eloquently: “For the Internet to fulfill it’s most ambitious promises, we need to recognize translation as one of the core challenges to an open, shared and collectively governed internet. Many of us share a vision of the Internet as a place where the good ideas of any person, in any country, can influence thought and opinion around the world. This vision can only be realized if we accept the challenge of a polyglot internet and build tools and systems to bridge and translate between the hundreds of languages represented online. …. Mass machine translation is not a translation of a work, per se, but it is rather, a liberation of the constraints of language in the discovery of knowledge."
Today, the world faces a new kind poverty. While, we in the West face a glut of information, much of the world faces information poverty. Much of the world’s knowledge is created and remains in a handful of languages, inaccessible to most who don’t speak these languages. Access to knowledge is one the keys to economic prosperity. Automated translation is one of those technologies that offers a way to reduce the digital divide and raise living standards across the world. As imperfect as it is, this technology may even be the key to real people to people contact across the globe. This has been pointed out eloquently by Ethan Zuckerman in his essay on The Polyglot Internet http://www.ethanzuckerman.com/blog/the-polyglot-internet/, Vince Cerf also points this out in a recent interview http://www.guardian.co.uk/commentisfree/2008/aug/17/internet.google . Ray Kurzweil and Bill Gates have also spoken on the huge potential to change the world using this technology. The Asia Online project is focused on breaking the language barriers for knowledge content using a combination of automated translation and crowdsourcing. The complete English Wikipedia will be translated into several Asian languages that are content starved using SMT and crowdsourcing http://www.asiaonline.net/corporate/index.aspx. There are many more examples but I think the point is clear.
While stories of MT mishaps and mistranslations abound, (we all know how easy it is to make MT look bad), it is becoming increasingly apparent to many, that it is important to learn how to use and extend the capabilities of this technology successfully. While MT is unlikely to replace human beings in any application where quality is really important, there are a growing number of cases that show that MT is suitable for:
· Highly repetitive content where productivity gains with MT can dramatically exceed what is possible with just using TM alone
· Content that would just not get translated otherwise
· Content that cannot afford human translation
· High value content that is changing every hour and every day
· Knowledge content that facilitates and enhances the global spread of critical knowledge
· Content that is created to enhance and accelerate communication with global customers who prefer a self-service model
· Content that does not need to be perfect but just approximately understandable
How They Work
As with all engineering problems, automated translation starts with a basic goal: to take a text in a given language (the source language) and convert it into a second text in another language (the target language), in such a way as to preserve the meaning and information contained in the source. While several approaches have been tried, to date, two approaches stand out: Rule-based MT and Statistical Machine Translation (SMT). Some will argue that Example based MT (EBMT) is also important, many say this is the approach that underlies translation memory technology.
Rule Based Machine Translation -- RbMT
The foundation for rule-based systems is relatively easy to understand intuitively. Languages can be considered to have two foundational elements:
1. The meaning of the words — the semantics, and,
2. The structure of how the words are put together — the grammar, syntax and morphology etc…
So basically, a RbMT system attempts to map these two elements of the source language to the target language. While this may sound simple on the surface, it quickly gets complicated. Developers of RbMT solutions combine the theories of traditional grammarians and linguists and attempt to convert this linguistic knowledge into systematic, encyclopedic sets of rules encompassing grammar, morphology, syntax and meaning across a language pair. Programmers encode this information into rule sets and dictionaries and try and get as much linguistic knowledge as possible into these rule sets. Linguistic knowledge refers to information about word structure (singulars and plurals, first, second and third person endings and so on), word meanings (dictionary definitions), grammar (word order, part‐of‐speech [POS], typical phrasing), and homonyms (e.g. ambiguous terms can have different meanings in different contexts). Very simply put, a RbMT system comprises of a dictionary and a set of rules for the language combinations that the system can process.
The Vauquois triangle shown above is useful to describe the various approaches to the MT engineering problem, and to better understand how they might evolve. What we see is that the simplest approaches used by early RbMT systems were based on direct transfer approaches, with simple dictionaries and very simple rules to change word order in the target language e.g. [Subject-Verb-Object (SVO) to Subject-Object-Verb (SOV)]. These early systems were improved by switching to a more modular approach that separated the analysis of the source language sentence from the synthesis of the target language sentence by means of a transfer stage. Thus, they have evolved into systems where linguistic analysis is done on the source language and then through a transfer process (more complex rules + dictionary) the sentence is reconstructed in the target language. In this analysis phase, the system will look at inflections and conjugations of words (morphology), word sequence and structure of sentences (syntax), and to a certain degree the meaning of words in context (semantics). In other words, it will “parse” the source language sentence, so that the so-called “part of speech” or word class is determined for each word or phrase in the sentence, by checking it in a comprehensive dictionary of the language. This is repeated sometimes for the target language. There is some possibility to create a generalized rather than a single language pair specific model as in the direct transfer mode through this approach. To the best of my knowledge, none of the systems have really evolved to a point where semantics are deeply incorporated into the automated translation process. This is an area of possible evolution.
There are several core problems for RbMT systems:
1. Human Language Inconsistency. Human languages are filled with exceptions that do not follow the rules.
2. Disambiguation. Ambiguity remains the core challenge of RbMT systems. Words may have different meanings depending on their grammatical and semantic references.
3. Local Sentence Focus. RbMT systems analyze and translate one sentence at a time. They have little understanding about the context or the broader corpus from which the sentence originates.
4. Inherent System Conflicts. Dictionaries are the key mechanism used to tune and refine and improve RbMT systems. However, the lack of semantic features on words and expressions means that dictionary maintenance must be monitored very carefully. A new dictionary entry that improves the translation in one sentence may introduce an error in another context.
5. Skills Required. The development of these dictionaries is expensive as it requires a very unique skill set. These lexical skills require persons conversant in linguistics, corporate terminology, and computer software technology and software programming. The lexical information required goes beyond bilingual word lists and requires knowledge of part‐of‐speech (POS) information and morphology. The developers must have linguistic skills as well as language fluency.
6. Maintenance Overhead. The complex rule sets that drive these systems become cumbersome and difficult to maintain over a period of time. Often different teams develop different parts of the rule sets and it can be nearly impossible for developers to have a full understanding of the rules and their interactions. This problem is compounded when very specific rules are introduced to handle special cases of grammar or translation.
7. Diminishing Returns. As human language is inherently inconsistent – or infinitely nuanced – the law of diminishing returns comes into effect. Eventually, modifying a rule to improve results in one context weakens or destroys results in other contexts. Most RbMT systems have hit a ceiling on achievable quality after a few years, so that further modifications introduce as much degradation as improvement. Once RbMT systems reach this plateau improvements are slow even if they are possible.
8. New Language Pairs. Given the process and the difficulties described above, we see that the development of new language pairs is arduous and slow if any reasonable quality is desired. The effort requires development of grammars, lexicons, morphologies, transfer rules, and generation rules. The people involved must have highly specialized skills and deep knowledge of the languages involved.
Many Asian languages have the additional issue of segmentation to consider. How is a computer to decide what a word is in a continuous block of Chinese or Thai characters? Often, several options for character combinations could exist and computers have great difficulty knowing how to approach this without the context that humans can easily place on a sentence. This is further complicated by the sparse use of punctuation and other elements that are common in western languages.
Customization of RbMT
The ‘free’ MT systems that we see on the web today can all be characterized as baseline systems. Successful enterprise use of this technology is, however, characterized by special tuning efforts to raise the quality above these free systems. All MT systems can be optimized to specific use cases and will tend to perform better when this specific optimization is done.
For RbMT systems there are two ways to tune the system for specific customer requirements:
1. Modify the rule sets
2. Expand and extend the existing dictionaries and vocabulary to match the needs of the language used in the customer’s target application
The primary means of tuning most RbMT systems in use today involves the development of dictionaries. The dictionary development in most RbMT systems today require dictionaries in which terms are coded for inflections (morphology) or determining their position in a sentence (part of speech).
While it is theoretically possible to change the rule sets in RbMT systems, this is usually a very complex task that can only be done by very specialized development staff. Vendors of RbMT systems may be willing to do this for significant development dollars sometimes, but basically, this option is not available to general or even specialist users of the technology.
Statistical Machine Translation -- SMT
Statistical machine translation approaches have been gaining considerable momentum since 2004. The premise of these new approaches is that purely linguistic knowledge is far less important than having large volumes of human translated data to analyze and process. By analyzing large corpora of texts instead of just one sentence, the new systems attempt to simulate the way human translators work. Human translators have general knowledge of the everyday world, and they quickly grasp the context and the domain in which they operate. As computer storage and processing capacity increased exponentially, and as large digital bilingual corpora became available, researchers began to suggest that computers could “learn” and extract the systematic patterns and knowledge that humans have embedded into historical translations. SMT systems have in a few short years overtaken the RbMT systems in quality in most baseline systems available on the web today.
These second‐generation SMT solutions adopt a data and probability‐based approach to translation and are also often called “data-driven” approaches. Simply speaking, SMT systems are developed by computationally analyzing large bodies of parallel bilingual text, which they treat as strings of characters, determine patterns, and exploit these regularities by matching these patterns in new material that is presented for translation.
They learn by feeding on parallel corpora of texts, ingesting huge volumes of already translated content. SMT developers require large amounts (millions of TUs if possible) of both parallel corpora and monolingual corpora to train their systems. Parallel corpora consist of perfectly aligned texts in source and target languages, similar to translation memories. Typically, sentences and phrases are used rather than individual words or very large paragraphs. The statistical software picks out two-, three-, and four word phrases (ngrams) and so on usually up to eight word phrases, in the source language that match the target language. These ngrams are then used to produce several hundred or even thousands of hypothetical translations for new sentences. The monolingual corpus in the target language is also used to ‘calculate’ the best word and phrase combinations for these already translated text and help the system determine the best (statistically most significant) single translation to present to the user. This is done using the language model (LM) which is a statistical model encapsulating fluency and usage frequency in the target language. Disambiguation is much less of an issue for SMT systems as SMT systems have a much better sense of context.
Thus, the internet is a natural source of material that can be used to train and build SMT engines. Also, this is essentially a technology that requires data processing and computational linguistic skills more than real bilingual competence in the development process. However, competent bilingual humans involved in vetting the training data and the output can add considerably to the quality of the systems. The EU has in a few short years, generated 460+ SMT language pair combinations (23 core EU languages into 22 possible targets) to facilitate production of content into all the different EU languages. This Euromatrix project has also spawned a thriving open source SMT movement.
As with RbMT there are, however, several core problems with SMT:
1. Data Requirements. SMT needs very large amounts of bitext (parallel bilingual data) to be able to build good systems. Large in this case can mean tens of millions of sentence pairs which is often very difficult to find.
2. Randomness of Errors. Often when dealing with such large amounts of data SMT developers will scrape the web to collect data and introduce ‘noise’ into the core training corpus. Noise refers to errors introduced by training on erroneous data e.g. mistranslations, wrong alignment and MT rather than human translations in the training material. This “dirty” data can produce many strange and unpredictable error patterns that are hard to eliminate. Clean data however, appears to dramatically reduce this problem.
3. Transparency. One does not have enough control and ability to tweak the system and many complain it is too much of a black box. Users have complained that often the only refrain they hear from developers is to get more data (which is usually not available). This is changing with 2nd generation SMT vendors who are opening up the insides and let users see how the SMT engine comes to its translation conclusions. This allows the possibility of dynamic improvement and correction with translator feedback.
4. Power of the Language Model. The SMT Language Model can often produce very fluent and natural sounding but incorrect translations.
5. System Resources. Many powerful computers are necessary for both training and running the final systems, so it can only be a server based solution for the most part.
6. Difficulty with Unlike LPs. Some language combinations work very well (FIGS, BP), but language pairs that have large differences in morphology and syntax do not do well with the basic pattern matching approach. While often more successful with Asian languages than RbMT, SMT too has a long way to go with Asian languages which have many special issues.
7. Lack of Linguistics. The first generation SMT systems are direct transfer systems that have no knowledge of linguistics and thus have serious word order and morphology problems. As SMT systems evolve, they would begin to incorporate linguistics and produce even better quality. We are beginning to see the first of these emerging in 2008.
Customization of SMT
SMT systems are very easy to tune to a specific customer’s requirement if there is enough data available to do so. The most critical ingredient is a sizable amount of bitext comprising 100,000 or more TUs. Glossaries and other terminology assets can also be helpful in addition to monolingual content in the target language and domain to develop a language model. If enough data exists, it is possible to create a system out of a single customer’s data.
In addition to the basic ease in getting a system customized, SMT systems can be designed to use and respond to real-time corrective feedback and are very well matched to massive online collaboration. SMT customization can involve frequent error modifications and corrections and systems that can actually get better over time as users use the system. Every correction adds to the “linguistic knowledge” of the system and so it continues to “learn”. Theoretically it is conceivable that these systems can become quite compelling in their quality.
SMT is an approach that is much better suited to work with emerging business models driven by social networking and Web 2.0 concepts. Thus the growing momentum behind web based collaboration, crowdsourcing, volunteer translation and the ever growing volume of data on the Internet all align perfectly to drive SMT forward. The diagram on the previous page illustrates how this would work in new human driven automated translation models that are being established by Microsoft and Asia Online.
Given the state of the technology today, where no MT provides perfect output it is necessary to involve human beings in the process to get publishable or human-like quality. Thus while historically, we have grown accustomed to “gist” quality from MT, the future will see much more compelling quality. This quality will be possible because humans will be much more integrated into the process and the basic infrastructure will be designed to promote and facilitate rapid evolution in quality.
The Current Status Quo
Today, both approaches can claim success in many different kinds of applications. RbMT systems are in use at the EU, Symantec, Cisco, Fortis bank and many other places to enhance translation productivity and accelerate translation work. The PAHO (Pan American Health Organization) system is one of the most respected and actively used MT systems in the world with a tightly integrated post-editing capability built into it. Several vendors offer RbMT solutions that can also run on the desktop and are sometimes (rarely) used by translators as a productivity tool. These vendors include Systran, ProMT and even SDL and Lionbridge have relatively weak offerings. Additionally there are vendors who focus on regional languages like Apptek, Sakhr (Arabic, Middle Eastern) and Open Logos, BrainTribe and Linguatec (German). The Japanese also have a whole suite of RbMT systems to their credit with Toshiba and Fujitsu having the best reputations. Many RbMT companies have started and failed along the way. Systran is probably the best known name in the RbMT world and has the broadest range of languages available. However, the reputation of MT in general, has been based on these systems and several of the RbMT systems we see today are the result of 30+ years of effort and refinement. Many say now that RbMT systems have reached the limit of their possibilities and that we should not expect much more evolution in future.
The SMT world is where most of the excitement in MT is today. Perhaps the most successful MT application in the world today, the Microsoft knowledge base, used by hundreds of millions of users across the globe, is mostly a SMT based effort. Today, Google and the Microsoft Live free translation portals are powered by SMT. In just a few years, SMT systems have caught up in quality to RbMT systems that have decades of development efforts behind them. Given that we are just at the start of the SMT systems technology, which today are mostly “phrase based SMT” (PBSMT) and really just simple direct transfer systems, there is real reason for optimism as these systems start to incorporate linguistics, add more data and get access to more computing power. SMT systems are also much better suited to massive online collaboration. Commercially, there are now several alternatives available from vendors like Asia Online, Alfabetics, ESTeam, Languagelens, Language Weaver and probably many others in the coming years. There is a growing open source movement (Moses) around the technology that is already outperforming the systems produced by SMT pioneers. Several automotive companies, Intel and others have implemented SMT based translation productivity or technical knowledge base systems.
In a recent report on post-editing best practices, TAUS reported that, “In theory and also in practice,
data-driven MT systems combined with machine learning systematically improve the output, reducing the post-editing load over continuous cycles of translation and machine learning”. They also state that, “translation memory output is increasingly being seen as part of MT output and the two are being compared in terms of post-editing practices.” In this report, they describe a test at Autodesk comparing quality and productivity of RbMT vs. SMT systems on the same data. Autodesk discovered that for its particular content, SMT outperformed RbMT on every indicator, especially post-editing productivity, where a rate of 1,340 words an hour (4 or 5 pages an hour or 30 pages a day) was achieved. However, there are also studies that show that post-editors often prefer the more systematic error patterns of RbMT output, so no definite conclusions can be drawn yet, but, it is clear that SMT is on the march.
In development since early 60s and many systems have been around for 30+ years
First systems began to emerge from 2004 onwards based on IBM patents
40 to 50 language combinations after 50+ years of efforts. New LPs take at least 6 months to years to develop.
Over 600+ language combinations developed in less than 5 years. EU alone has 462 engines built out of Euromatrix project data. Development is possible wherever bilingual data is available.
Based on complex dictionary and rule set modifications
Easily done when domain specific data is available.
Effort to Customize
Complex, long and expensive
Easy if data and computing resources are available
Has essentially reached a plateau and has remained fundamentally the same for many years.
Most systems, especially Google systems have been improving rapidly and are expected to continue to improve over the coming years. Best quality approaches human draft quality.
Very little ability to incorporate community feedback except for dictionary contributions.
Microsoft, Google and Asia Online actively seek and incorporate massive “crowd” collaboration to correct raw MT and enable rapid improvements in quality. This practice may set this SMT quality apart from all previous MT.
Relatively low and desktop installation is also possible. A single server can run 10+ language pairs.
Significant computing resources are required to both build and run SMT engines and not really suitable for single user desktop installation. Better suited to be a server, computing cloud based solution.
The real promise of SMT is yet to come. All the systems we are looking at today are essentially first generation systems. They will only get better and more robust in years to come.
What is a Hybrid System?
It has become very fashionable and desirable for developers to describe their approaches as a hybrid one. Many informed observers state that both RbMT and SMT need to learn and draw from each other to evolve. The expert opinion is that hybrid systems are the answer.
Systran 6.0 now uses a statistical language model approach to improve the fluency of its output. ProMT is also adding similar capabilities. Several SMT developers like Microsoft, Asia Online and LW are experimenting with syntax based SMT. Thus, SMT is evolving from pure non-linguistic, pattern matching techniques to the incorporation of grammar and parts of speech. Most SMT systems I am aware of, already use some rules in pre-processing and post-processing steps. The use of linguistic rather than data concepts will increase to enable the next wave of improvements.
This will continue as linguistic information will enter the SMT development process and RbMT engines will try and incorporate statistical methods around their rule structures. However, all the systems in active use today are predominantly one or the other.
As we look at MT technology today, we see that while there have been decades of failures, SMT appears to be the increasingly dominant way of the future, but we still have some distance to go. Raw MT technology still falls short of any widely understood and accepted standard of quality.
The path to higher quality has to involve humans. Language is too complex, too varied and too filled with irregularities to be resolved just by algorithms and computers with lots of data. One of the most promising new trends is a movement to a more intensive man-machine collaboration in large scale translation projects. The combination of SMT with massive online collaboration could bring us to a tipping point that really does help MT technology to become more pervasive. Microsoft has already led the way in showing how much value domain focused MT can provide to a global customer base. They are now adding crowdsourcing into the mix and having their best resellers manage crowdsourced editing to improve the raw MT that is in the bulk of the knowledge base. Asia Online has embarked on a project to translate the English Wikipedia into several South Asian languages to reduce information poverty in these regions. Initially, internal staff linguists post-edit and help raise the quality of raw MT to a level where 100% comprehensibility is reached. This still imperfect content is then released and people who use the content or roaming bands of bilingual surfers come and help to put finishing touches to the output. Some do it because they want to help and others because they might win prizes. These edit changes flow back into the learning systems and the systems continue to improve. Based on early results, it is conceivable that the translation quality will get to a level where students could use it directly in homework assignments with very minor grammatical adjustments. Even Google invites the casual user to suggest a better translation. These efforts will steadily drive SMT quality higher.
So, is MT possible in your future? It is important to note that, the best MT systems will need to get the respect of real translators, professional and amateur, and these translators will only be interested in using these systems if they have evidence that they can work faster, more efficiently and more effectively using the technology.
Also, real translators are among the most competent people to judge what is good, and what is not, on issues related to translation. Until MT vendors are willing to submit to their judgment and earn an approval or even an endorsement from them, the MT market will stumble along in the doldrums as it has for the last 50 years, making empty promises.
The combination of massive computing power, ever increasing volumes of clean bilingual text and a growing band of motivated bilingual humans (not always professional translators) will be the key forces that drive this technology forward. The scientists may also produce better algorithms along the way, but their contributions will be relatively small. The most important driving force will be: the human need to know and understand what other humans across the globe are saying, the need to share and the urge to learn. The breakthrough that will end (or at least reduce) the language barrier will be social not technical.
Note: I should state for the record, that I am currently employed by Asia Online which has developed a largely SMT based technology platform and that I have also previously been associated with other commercial SMT initiatives in the past.
Brantley, Peter. http://blogs.lib.berkeley.edu/shimenawa.php/2008/11/02/losing-what-we-don-t-see-translation , 2008