ParaCrawl corpus released (as .tmx) on OPUS corpora website! (Dutch>English .TMX = 2.5 million TUs!)
Thread poster: Michael Beijer

Michael Beijer  Identity Verified
United Kingdom
Local time: 12:57
Member (2009)
Dutch to English
+ ...
Apr 12

@ http://opus.nlpl.eu/ParaCrawl.php

ParaCrawl

e.g. The Dutch English TMX contains 2,560,421 translation unites (TUs)!

TMX

Am going to add it to my huge TMX collection in Memsource!

Michael


Direct link Reply with quote
 

Hans Lenting  Identity Verified
Netherlands
Member (2006)
German to Dutch
+ ...
Large TMs in times of NMT Apr 13

Michael Beijer wrote:

Am going to add it to my huge TMX collection in Memsource!


What would be the benefit of that compared to running 4 or more NMT resources?


Direct link Reply with quote
 

Michael Beijer  Identity Verified
United Kingdom
Local time: 12:57
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
Hmm Apr 13

Hans Lenting wrote:

Michael Beijer wrote:

Am going to add it to my huge TMX collection in Memsource!


What would be the benefit of that compared to running 4 or more NMT resources?


Well, I really like being able to run a quick concordance on such massive databases from inside my CAT tool. Of course, it's also possible to use TMLookup, but this way I also get any potential matches (fuzzy or otherwise). CafeTran of course has Total Recall (for huge databases), but I find the search UI a bit clunky, and it takes forever to import them all.


Direct link Reply with quote
 

Anton Konashenok  Identity Verified
Czech Republic
Local time: 13:57
English to Russian
+ ...
A different opinion Apr 13

Wow. A TM containing several million entries of unknown origin, unknown quality and not even guaranteed to be translations, merely parallel texts, automatically crawled and unlikely to have been reviewed by a human. How quaint.
Am going to add it to my huge TMX collection in Memsource!

If the word "collection" meant a cabinet of curiosities, it would certainly make an interesting exhibit. But a useful working tool? Don't think so. In fact, publicly admitting to using it for translation work is a reputational risk – potential clients may consider it unscrupulous.


Direct link Reply with quote
 

Michael Beijer  Identity Verified
United Kingdom
Local time: 12:57
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
Ho ho ho. Apr 13

Anton Konashenok wrote:

Wow. A TM containing several million entries of unknown origin, unknown quality and not even guaranteed to be translations, merely parallel texts, automatically crawled and unlikely to have been reviewed by a human. How quaint.
Am going to add it to my huge TMX collection in Memsource!

If the word "collection" meant a cabinet of curiosities, it would certainly make an interesting exhibit. But a useful working tool? Don't think so. In fact, publicly admitting to using it for translation work is a reputational risk – potential clients may consider it unscrupulous.


I have no problem admitting I use massive TMs found online. I also use machine translation. And a mouse. What else is new?

Of course there may be lots of noise in these huge databases, but also lots of little treasures. Have you ever run a concordance search? It provides you with a nice list of possible translations. The art is of course to be able to separate the wheat from the chaff, which I am actually quite good at.

I obviously don't just accept high fuzzy matches from these massive TMs, willy nilly, and race on to the next segment. In fact, high fuzzy matches hardly ever occur between texts in everyday work. What I mainly use these databases for is to find difficult terms, or at least pointers in the right direction. Because of their size (altogether my TM collection contains currently around 50,000,000 TUs), you are often quite likely to run into the term you're having trouble with. You can then use this to do further research, although what you find in your first search will often already be correct.

Anyway, as with any tool, you need to know how to use it.


Direct link Reply with quote
 

Noe Tessmann  Identity Verified
Local time: 13:57
English to German
+ ...
Thanks for sharing Apr 13

Thanks Michael for finding all these treasures and letting us know. I'll also add it somewhere.

KR

Noe


Direct link Reply with quote
 

Michael Beijer  Identity Verified
United Kingdom
Local time: 12:57
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
you're welcome! Apr 14

Noe Tessmann wrote:

Thanks Michael for finding all these treasures and letting us know. I'll also add it somewhere.

KR

Noe


Hi Noe,

I thought you'd appreciate this one!

Michael


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

ParaCrawl corpus released (as .tmx) on OPUS corpora website! (Dutch>English .TMX = 2.5 million TUs!)

Advanced search







SDL Trados Studio 2017 Freelance
The leading translation software used by over 250,000 translators.

SDL Trados Studio 2017 helps translators increase translation productivity whilst ensuring quality. Combining translation memory, terminology management and machine translation in one simple and easy-to-use environment.

More info »
Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search