On April 19–20, 2017, Necip Fazil Ayan, Engineering Manager at Facebook, gave a 20-minute update at the F8 Developer Conference about the current state of the art of machine translation at the social networking giant.
Slator reported in June 2016 on Facebook’s big expectations for NMT. Then, Alan Packer, Engineering Director and head of the Language Technology team at Facebook, predicted that “statistical or phrase-based MT has kind of reached the end of its natural life” and the way to go was NMT.
Ten months on and Facebook says it is halfway there. The company claims that more than 50% of machine translations across the company’s three platforms — Facebook, Instagram, and Workplace — are powered by NMT today.
Facebook says it started exploring migrating from phrase-based MT to neural MT two years ago and deployed the first system (German to English) using the neural net architecture in June 2016.
Since then, Ayan said 15 systems (from high-traffic language pairs like English to Spanish, English to French, and Turkish to English) have been deployed.
No tech presentation would be complete without a healthy dose of very large numbers. Ayan said Facebook now supports translation in more than 45 languages (2,000 language combination), generates two billion “translation impressions” per day, serves translations to 500 million people daily and 1.3 billion monthly (that is, everyone, basically).
Ayan admitted that translation continues to be a very hard problem. He pointed to informal language as being one of the biggest obstacles, highlighting odd spellings, hashtags, urban slang, dialects, hybrid words, and emoticons as issues that can throw language identification and machine translation systems off balance.
Another key challenge for Facebook: low resources languages. Ayan admitted Facebook has very limited resources for the majority of the languages it translates.
“For most of these languages, we don’t have enough data,” he said — parallel data or high quality translation corpora, that is. What is available even for many low resource languages are large corpora of monolingual data.