I've built a simple tool to help answer an old question of mine: “Which engine translates best?” You can find it here: www.gabble-on.com
Throughout three years of high school Spanish and three years of college Chinese, I used translation
sites like BabelFish a lot. They were far from perfect and I always rotated between several sites to try to find the best tool for any given situation.
10 years later, my question still hasn’t been answered. So I’ve put together this open research project to
allow anyone who speaks two languages to type any phrase into our dynamic engine, compare the results of multiple translation engines side by side, and vote on the best.
I appreciate your help and I hope you’re curious too. My goal is to collect 10,000 votes over the 6 weeks
between February 15th and March 29th, analyze the data, and publish the results.
As a thank you for everyone's participation in this project, I'm also holding a fun little March Madness contest with an iPad as a giveaway to one lucky winner. Come check it out! www.gabble-on.com
An approach to calculate the best Online Translation Engine
Mar 23, 2010
Hello Ethan,
I cast my vote already with English to Spanish, Finnish, and Swedish.
I look forward in seeing the results.
FYI: We have experimented in finding an automated way of finding the best translation. because of practical reasons we used the same set of Online Translation Engines as you do.
My personal sense is that this a pretty meaningless exercise unless one has some upfront clarity on why you are doing this. It depends on what you measure, how you measure, for what objective and when you measure. On any given day, any one of these engines could be the best for what you specifically want to translate. Measuring random snippet translations on baseline capabilities will only provide the crudest measure that may or may not be useful to a casual internet user but completely useless to understanding the possibilities that exist for professional enterprise use where you hopefully have a much more directed purpose. In the professional context knowledge about customization strategies and key control parameters are much more important. The more important question for the professional is: Can I make it do what I want relatively well and relatively easily?
This is another criticism by Alon Lavie who is a professor of computational statistics at CMU:
Ethan Shen of Gabble On “is hoping to be able to detect predictive patterns in the data that he could use to predict future engine performance. But he has no control over the input data (participants choose to translate anything they want), and he's collecting just about no real extrinsic information about the data. So beyond very basic things such as language-pair and length of source, he's unlikely to find any characteristics that are predictive of any future performance with any certainty whatsoever.
What can be done (but Ethan is not doing) is to use intrinsic properties of the MT translations themselves (for example, word and sequence agreement between the MT translations) to identify the better translation. In MT research, that's called "hypothesis selection". My students and I work extensively on a more ambitious problem than that - we do MT system combination, where we attempt to create a new and improved translation by combining pieces from the various original MT translations. Rather than select which translation is best, we leverage all of them. We have had some significant success with this. At the NIST 2009 evaluation, we (and others working on this) were able to get improvements of about six BLEU points beyond the best MT system for Arabic-to-English. That was about a 10% relative improvement. That was a particularly effective setting. Strong but diverse MT engines that each produce good but different translations are the best input to system combination.”
Subject:
Comment:
The contents of this post will automatically be included in the ticket generated. Please add any additional comments or explanation (optional)
I meant to say that Alon Lavie is a professor of Computational Linguistics (NLP) above.
[Edited at 2010-04-19 21:11 GMT]
Subject:
Comment:
The contents of this post will automatically be included in the ticket generated. Please add any additional comments or explanation (optional)
Neil Coffey United Kingdom Local time: 19:08 Member (2009) French to English + ...
Flaws
Apr 20, 2010
Kirti Vashee wrote:
My personal sense is that this a pretty meaningless exercise unless one has some upfront clarity on why you are doing this.
Well, actually they do present a list of hypotheses on the site. But in a sense, that's a flaw-- when you conduct an experiment and don't want your subjects to bias your results, you don't usually tell your subjects in advance what your hypothesis is...
Of course, every experiment has flaws, and you have to weigh up the difficulty of removing these flaws vs practical constraints. Some other problems in this case are:
- they say they want 10,000 votes -- but votes of *what*? is this per language pair? what methodology have they used to estimate that this will be enough to get statistically significant results in the language pair with the likely lowest number of votes?
- how will they assess and compensate for natural biases in the type of people taking part in the experimnt? (e.g. the site is in English and located in the US, so more people are likely to find the site in a US search engine configured for English; an MT system trained/designed more for US English will then be inherently likely to fair better)
Kirti Vashee wrote:
This is another criticism by Alon Lavie who is a professor of computational statistics at CMU:
Ethan Shen of Gabble On “is hoping to be able to detect predictive patterns in the data that he could use to predict future engine performance. But he has no control over the input data (participants choose to translate anything they want)
Though they do have control over how they *filter* the data they get-- e.g. they can say "we'll only include input between X and Y words in length"-- and this can be a viable approach if done properly. But they obviously need to be careful not to bias their results by "peeking" (e.g. they have to make decisions about how to filter using a sample of the data that is then removed from the data actually analysed, and the decision of which sentences are used for experimental design and which are actually analysed should be random).
and he's collecting just about no real extrinsic information about the data
Yes, that's a potential problem, though arguably one that can be overcome by collecting a large amount of data. (OTOH, I'm not sure that 10,000 sentences is large enough.)
So beyond very basic things such as language-pair and length of source, he's unlikely to find any characteristics that are predictive of any future performance with any certainty whatsoever.
Arguably true, but if you look at their actual list of hypotheses, they probably *are* collecting enough in principle for those specific hypotheses. (Whether the testing of those hypotheses tell us much about future performance of MT, I'm not sure...)
What can be done (but Ethan is not doing) is to use intrinsic properties of the MT translations themselves (for example, word and sequence agreement between the MT translations) to identify the better translation.
I think this definitely has some advantages in terms of experimental design (your measurements are more "objective"; you can effectively run any text through the system "instantaneously", so you can run arbitrary numbers of sentences/sentences from well-defined sources). What I'd be interested to know is how you then remove the problem of circularity from your results-- in other words, if your experiment shows that Google Translate comes out top, how do you know that this result isn't biased by Google Translate using similar measures in their (essentially unpublished) training process to the ones that you're using in your evaluation?
Subject:
Comment:
The contents of this post will automatically be included in the ticket generated. Please add any additional comments or explanation (optional)
www.gabble-on.com - Results of my Google Translate vs. Bing Translator vs. Yahoo Babelfish
May 3, 2010
I appreciate all of the input and comments that have been given in this forum, especially those very well supported ones from Alon Lavie.
I do agree that there are some flaws in the experimental design and assumptions, however I think the results still can provide some interesting insight.
We've found that while Google Translate is widely preferred when translating long passages, Microsoft Bing Translator and Yahoo Babelfish often produce better translations for phrases below 140 characters. Also, in general Babelfish performs well in East Asian Languages such as Chinese and Korean and Bing Translator performs well in Spanish, German, and Italian.
This project is only the first in a series. Many of the comments you have given have been incorporated in the design of our current Phase 2 research which focuses on short phrases and will constrain some of the user's input options.
I hope you all will continue to take interest in my work and share it with your friends.
- Ethan
Subject:
Comment:
The contents of this post will automatically be included in the ticket generated. Please add any additional comments or explanation (optional)
There is no moderator assigned specifically to this forum. To report site rules violations or get help, please contact site staff »
Save time by automatically extracting terms. 15% off!
SDL MultiTerm Extract 2011 allows you to automatically create candidate term lists from your existing documentation. This removes the manual effort involved with traditional terminology creation, allowing you to rapidly add terms to SDL MultiTerm.
A fully featured online CAT tool and TMS, with no installation required, and a simple, intuitive interface. Maximize linguistic assets by sharing in real time as you collaborate with colleagues. Make use of next generation, cloud-based translation technol