ParaCrawl corpus released (as .tmx) on OPUS corpora website! (Dutch>English .TMX = 2.5 million TUs!)
Thread poster: Michael Beijer
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 05:41
Member (2009)
Dutch to English
+ ...
Apr 12, 2018

@ http://opus.nlpl.eu/ParaCrawl.php

ParaCrawl

e.g. The Dutch English TMX contains 2,560,421 translation unites (TUs)!

TMX

Am going to add it to my huge TMX collection in Memsource!

Michael


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
Large TMs in times of NMT Apr 13, 2018

Michael Beijer wrote:

Am going to add it to my huge TMX collection in Memsource!


What would be the benefit of that compared to running 4 or more NMT resources?


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 05:41
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
Hmm Apr 13, 2018

Hans Lenting wrote:

Michael Beijer wrote:

Am going to add it to my huge TMX collection in Memsource!


What would be the benefit of that compared to running 4 or more NMT resources?


Well, I really like being able to run a quick concordance on such massive databases from inside my CAT tool. Of course, it's also possible to use TMLookup, but this way I also get any potential matches (fuzzy or otherwise). CafeTran of course has Total Recall (for huge databases), but I find the search UI a bit clunky, and it takes forever to import them all.


 
Anton Konashenok
Anton Konashenok  Identity Verified
Czech Republic
Local time: 06:41
French to English
+ ...
A different opinion Apr 13, 2018

Wow. A TM containing several million entries of unknown origin, unknown quality and not even guaranteed to be translations, merely parallel texts, automatically crawled and unlikely to have been reviewed by a human. How quaint.
Am going to add it to my huge TMX collection in Memsource!

If the word "collection" meant a cabinet of curiosities, it would certainly make an interesting exhibit. But a useful working tool? Don't think so. In fact, publicly admitting to using it for translation work is a reputational risk – potential clients may consider it unscrupulous.


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 05:41
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
Ho ho ho. Apr 13, 2018

Anton Konashenok wrote:

Wow. A TM containing several million entries of unknown origin, unknown quality and not even guaranteed to be translations, merely parallel texts, automatically crawled and unlikely to have been reviewed by a human. How quaint.
Am going to add it to my huge TMX collection in Memsource!

If the word "collection" meant a cabinet of curiosities, it would certainly make an interesting exhibit. But a useful working tool? Don't think so. In fact, publicly admitting to using it for translation work is a reputational risk – potential clients may consider it unscrupulous.


I have no problem admitting I use massive TMs found online. I also use machine translation. And a mouse. What else is new?

Of course there may be lots of noise in these huge databases, but also lots of little treasures. Have you ever run a concordance search? It provides you with a nice list of possible translations. The art is of course to be able to separate the wheat from the chaff, which I am actually quite good at.

I obviously don't just accept high fuzzy matches from these massive TMs, willy nilly, and race on to the next segment. In fact, high fuzzy matches hardly ever occur between texts in everyday work. What I mainly use these databases for is to find difficult terms, or at least pointers in the right direction. Because of their size (altogether my TM collection contains currently around 50,000,000 TUs), you are often quite likely to run into the term you're having trouble with. You can then use this to do further research, although what you find in your first search will often already be correct.

Anyway, as with any tool, you need to know how to use it.


Emanuele Vacca
 
Noe Tessmann
Noe Tessmann  Identity Verified
Local time: 06:41
English to German
+ ...
Thanks for sharing Apr 13, 2018

Thanks Michael for finding all these treasures and letting us know. I'll also add it somewhere.

KR

Noe


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 05:41
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
you're welcome! Apr 14, 2018

Noe Tessmann wrote:

Thanks Michael for finding all these treasures and letting us know. I'll also add it somewhere.

KR

Noe


Hi Noe,

I thought you'd appreciate this one!

Michael


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

ParaCrawl corpus released (as .tmx) on OPUS corpora website! (Dutch>English .TMX = 2.5 million TUs!)







TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »
Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »