The best guess is that humans currently speak about 6,900 different languages. More than half the global population communicates using just a handful of them—Chinese, English, Hindi, Spanish, and Russian. Indeed, 95 percent of people communicate using just 100 languages.
The other argots are much less common. Indeed, linguists estimate that about a third of the world’s languages are spoken by fewer than 1,000 people and are in danger of dying out in the next 100 years or so. With them will go the unique cultural heritage that they embody—stories, phrases, jokes, herbal remedies, and even unique emotions.
It’s easy to think that machine learning can help. The problem is that machine translation relies on huge annotated data sets to ply its trade. These data sets consist of vast corpora of books, articles, and websites that have been manually translated into other languages. This acts like a Rosetta Stone for machine-learning algorithms, and the bigger the data set, the better they learn.
A map showing how the past tense indicators cluster for 100 of the languages investigated.
But these huge data sets simply do not exist for most languages. That’s why machine translation works only for a tiny fraction of the most common lingos. Google Translate, for example, only speaks about 90 languages.
So an important challenge for linguists is to find a way to automatically analyze less common languages to better understand them.
Today, Ehsaneddin Asgari and Hinrich Schutze at Ludwig-Maximilian University of Munich in Germany say they have done just that. Their new approach reveals important elements of almost any language that can then be used as a stepping stone for machine translation.
The new technique is based around a single text that has been translated into at least 2,000 different languages. This is the Bible, and linguists have long recognized its importance in their discipline.
Consequently, they have created a database called the Parallel Bible Corpus, which consists of translations of the New Testament in 1,169 languages. This data set is not big enough for the kind of industrial machine learning that Google and others perform. So Asgari and Schutze have come up with another approach based on the way tenses appear in different languages.
Most languages use specific words or letter combinations to signify tenses. So the new trick is to manually identify these signals in several languages and then use data-mining techniques to hunt through other translations looking for words or strings of letters that play the same role.
For example, in English the present tense is signified by the word “is,” the future tense by the word “will,” and the past tense by the word “was.” Of course, there are other signifiers too.
Asgari and Schutze’s idea is to find all these words in the English translation of the Bible along with other examples from a handful other language translations. Then look for words or letters strings that play the same role in other languages. For example, the letter string “-ed” also signifies the past tense in English.
But there is a twist. Asgari and Schutze do not start with English because it is a relatively old language with many exceptions to the rule, which makes it hard to learn.
Instead, they start with a set of Creole languages that have developed from a mixture of other languages. Because they are younger, Creole languages have had less time to develop these linguistic idiosyncrasies. And that means they generally contain better markers of linguistic features such as tense. “Our rationale is that Creole languages are more regular than other languages because they are young and have not accumulated ‘historical baggage’ that may make computational analysis more difficult,” they say.