How do google translate crawlers assemble parrallel corpora?
Thread poster: UmarAwan
UmarAwan
Pakistan
Dec 14, 2016

Forenote: this is a problem related to "A" mildly technical side of google translate, that I'be been failing to get answers to on quora and the good old google search engine.

I have read on quora that if a text and its translation are available on the internet, google translate crawlers will assemble a parralel corpus out of the two texts for Google translate AI to train on.

I want to know, how exactly do google translate crawlers find the translation of say, a japanese novel, which has been uploaded to a different url and file than its original text. How would it assemble them if they were infact on the same page? How would the crawler find its way from A to B for the text?

Basicly, I want to know the system and criteria with which a google translate crawler is able to find translations of a text in order to assemble its parralel corpora(for the AI to train on).

I'd really[please] appreciate it if you guys from the machine translate community can help out.


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 18:50
Member (2009)
Dutch to English
+ ...
my 2 cents Dec 14, 2016

UmarAwan wrote:

Forenote: this is a problem related to "A" mildly technical side of google translate, that I'be been failing to get answers to on quora and the good old google search engine.

I have read on quora that if a text and its translation are available on the internet, google translate crawlers will assemble a parralel corpus out of the two texts for Google translate AI to train on.

I want to know, how exactly do google translate crawlers find the translation of say, a japanese novel, which has been uploaded to a different url and file than its original text. How would it assemble them if they were infact on the same page? How would the crawler find its way from A to B for the text?

Basicly, I want to know the system and criteria with which a google translate crawler is able to find translations of a text in order to assemble its parralel corpora(for the AI to train on).

I'd really[please] appreciate it if you guys from the machine translate community can help out.


I don't think anyone here knows, and that Google will probably want to keep that a secret.

On a related note, the latest version of AlignFactory now also include such functionality. You can basically pointed at a website, and it will carry out all kinds of tricks to try to extract bilingual data. As far as I know, it looks at the source code and file names of the website and tries to identify things that look like they might belong to two linked languages. I suppose stuff like:

66789_nl.html + 66789_en.html
66hhghg89_nl.html + 66hhghg89_en.html
etc.

and then also stuff in the code itself. e.g., it could try to guess the language on a page using algorithms, and then try to pair them.

Michael

see e.g.:

http://www.terminotix.com/index.asp?name=AlignFactory&content=item&brand=1&item=4&lang=en
http://www.terminotix.com/docs/factsheet_alignfactory_en.pdf
http://terminotix.com/news/newsletter_2016-11-en.html

PS: if you are really serious about this, I think that with quite a bit of work, you could create your own with the following three tools:

https://www.httrack.com/ (free website crawler)
https://sourceforge.net/projects/aligner/ (best open source aligner, with automatic website download functionality)
https://autohotkey.com/ (free scripting)




[Edited at 2016-12-14 11:58 GMT]


Direct link Reply with quote
 
FarkasAndras
Local time: 19:50
English to Hungarian
+ ...
Why Dec 14, 2016

Why do you ask and what do you want to know?
Obviously Google (and others) have an enormous and varied corpus. Not every bitext got into that corpus the same way. You can be sure they started with adding the easy stuff like the EU and UN legislation. That's a lot of material in a bunch of languages where the translations closely match each other and the format is easy to process. It's almost certain that they also added the books that they scanned for google books. They probably also scanned gutenberg and a million other major sources. And yes, I believe their crawlers search the web and auto-identify texts & translations or random websites. I'm sure there are multiple layers to that. Of course pages/files like www.some/page/document_en.html and www.some/page/document_fr.html are prime targets but they are not the only targets. For html content, it is fairly easy to check the structure of the pages to see if they match (exact same page layout/formatting with similar amounts of thext in the same spots indicates a translation). Perhaps they also check for those ubiquitous "switch language" buttons. Ultimately whatever algorithm IDs two texts as possible translations of each other, there are fairly easy automated means of confirming the match. First, there are language identification algorithms out there. I know of one open source library that is pretty easy to implement and works quite well. Google obviously will have something that works even better. So you feed the two texts to it to make sure what languages they are. Then you check how long each of them is. If they are close enough, you align them. Autoaligners spit out a confidence/quality score for each of the sentence pairs and/or for the whole text. If the quality score is high enough you can be pretty sure you/the algorithms got it right. All of these steps can be done without human intervention. If you are so inclined you can introduce human checks where needed. Google probably only does very rare spot checks to find major screwups and to optimize things.
Perhaps they published something about their process. They occasionally release articles about their stuff, usually without the full recipe for the seret sauce.

[Edited at 2016-12-14 13:08 GMT]


Direct link Reply with quote
 

Robin Levey
Chile
Local time: 15:50
Spanish to English
+ ...
My 2 pesos Dec 14, 2016

I have no inside knowledge on this topic. That said, I venture to suggest:

Google probably doesn’t (as OP seems to suggest) first get a text in language A, decide it might be worth looking for a translation in language B, and then send its crawlers to find it so a parallel corpus can be made from the two versions.

Rather, I believe that Google’s crawlers are constantly indexing pretty well anything they can find in both languages A and B, and then a different tool (not a crawler) goes looking for possible matches (translations) amongst the data once it's safely lodged on Google’s own servers.

Google has a number of tricks up its sleeve that are not available to us (including some that used to be available to us, but have been removed because they were too useful*). Google has privileged access all their own data, and can apply numerous search 'n' match algorithms that are not available to us mere mortals. And they have more computing power than most of us, so they afford to spend (and risk wasting) computing resources to run hugely complicated algorithms, rapidly, against vast quantities of potentially-matching material.

I would imagine that – regardless of the language pair (i.e. including pairs such as jp/en) – any true translation of an entire document will have certain traits in common with the VO which could be recognised by suitable statistical algorithms. A top-down alignment process, starting with structural basics like document length, relative length of chapters, numbers of headings, sub-headings, etc, number of illustrations, etc. could probably generate a very short shortlist, even before determining the languages used. Short-listed texts could then be examined further. For example, most translations will contain a textual reference to the original language version: name of author, title, date of publication, for example – and much of that data will transcend the language barrier. If Google has got this far, it’ll be worth moving on to actual text alignment.

RL


* Example: Verbatim text matching with wildcards, which was very useful for evaluating things like the preferred word-order in standard phrases, or for plagiarism detection. This has been “designed out” of the public Google search engine in favour of supposedly “intelligent” searches which ignore word order and (too) many stop-words, even when the search phrase is between quotes (see discussion here: http://www.proz.com/forum/internet_for_translators/309884-exact_phrase_search_no_bolded_terms_in_the_results_yet_no_message_stating_no_results_found.html ).


Direct link Reply with quote
 
UmarAwan
Pakistan
TOPIC STARTER
Thanks for the input Dec 14, 2016

FarkasAndras wrote:

Why do you ask and what do you want to know?
Obviously Google (and others) have an enormous and varied corpus. Not every bitext got into that corpus the same way. You can be sure they started with adding the easy stuff like the EU and UN legislation. That's a lot of material in a bunch of languages where the translations closely match each other and the format is easy to process. It's almost certain that they also added the books that they scanned for google books. They probably also scanned gutenberg and a million other major sources. And yes, I believe their crawlers search the web and auto-identify texts & translations or random websites. I'm sure there are multiple layers to that. Of course pages/files like www.some/page/document_en.html and www.some/page/document_fr.html are prime targets but they are not the only targets. For html content, it is fairly easy to check the structure of the pages to see if they match (exact same page layout/formatting with similar amounts of thext in the same spots indicates a translation). Perhaps they also check for those ubiquitous "switch language" buttons. Ultimately whatever algorithm IDs two texts as possible translations of each other, there are fairly easy automated means of confirming the match. First, there are language identification algorithms out there. I know of one open source library that is pretty easy to implement and works quite well. Google obviously will have something that works even better. So you feed the two texts to it to make sure what languages they are. Then you check how long each of them is. If they are close enough, you align them. Autoaligners spit out a confidence/quality score for each of the sentence pairs and/or for the whole text. If the quality score is high enough you can be pretty sure you/the algorithms got it right. All of these steps can be done without human intervention. If you are so inclined you can introduce human checks where needed. Google probably only does very rare spot checks to find major screwups and to optimize things.
Perhaps they published something about their process. They occasionally release articles about their stuff, usually without the full recipe for the seret sauce.

[Edited at 2016-12-14 13:08 GMT]


Well, I can at least confirm that the page/files method is in use.

However I have it from a quoran google translate engineer, that if a novel and it's translation have been uploaded to the internet, Google's crawlers will find them and assemble corpora out if them, however, further questioning for how these crawlers operate had only brought me to the page/file method that was confirmed to be used by Google translates crawlers.

What I want to know is if these web crawlers,and how so?, as claimed by this SMT expert, these crawlers would assemble parallel corpora out of a novel and its translation's. Following are the difficulties I can see arising:

How are the translations found in the first place, there is no link between the ISBN of a text and it's translation, how would the crawler get from a to b in the specifically listed example of novels. Translations for ebooks won't just be listed with page/file systems.

The reason I want to know this is so that I can confirm if there is a system for Google translate crawlers to assemble corpora from novels.


Direct link Reply with quote
 
FarkasAndras
Local time: 19:50
English to Hungarian
+ ...
Come again? Dec 14, 2016

I'm not sure why you're fixated on novels.
I'm sure google added a bunch of aligned books to its db, most likely including the books it scanned itself. There's little doubt that if it finds a book in two language versions on a website, it will also add it. However, I doubt that Google will find a random novel on a random website and then hunt down its translation from whatever other website might have it. Maybe I'm wrong - it could be done by gradually building up an author/title/language database if they can auto-identify the author but I don't think there's much point. There's a lot of other material on the web to mine. Why would they bother with novels specifically? They are not exactly ideal source material: the translations might be from different editions, they might be abridged, the texts might be very archaic... Trust me, it's a mess compared to things like legislation. I would know, I made the largest collection of freely available aligned novels on the web.

[Edited at 2016-12-14 22:28 GMT]


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

How do google translate crawlers assemble parrallel corpora?

Advanced search






SDL Trados Studio 2017 Freelance
The leading translation software used by over 250,000 translators.

SDL Trados Studio 2017 helps translators increase translation productivity whilst ensuring quality. Combining translation memory, terminology management and machine translation in one simple and easy-to-use environment.

More info »
Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search