ParaCrawl corpus released (as .tmx) on OPUS corpora website! (Dutch>English .TMX = 2.5 million TUs!)
Thread poster: Michael Beijer

Michael Beijer  Identity Verified
United Kingdom
Local time: 13:37
Member (2009)
Dutch to English
+ ...
Apr 12

@ http://opus.nlpl.eu/ParaCrawl.php

gyrotllkeyrwtgzwwvkk.png

e.g. The Dutch English TMX contains 2,560,421 translation unites (TUs)!

o8xmn94numvtne1yoiky.png

Am going to add it to my huge TMX collection in Memsource!

Michael


 

Hans Lenting  Identity Verified
Netherlands
Member (2006)
German to Dutch
Large TMs in times of NMT Apr 13

Michael Beijer wrote:

Am going to add it to my huge TMX collection in Memsource!


What would be the benefit of that compared to running 4 or more NMT resources?


 

Michael Beijer  Identity Verified
United Kingdom
Local time: 13:37
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
Hmm Apr 13

Hans Lenting wrote:

Michael Beijer wrote:

Am going to add it to my huge TMX collection in Memsource!


What would be the benefit of that compared to running 4 or more NMT resources?


Well, I really like being able to run a quick concordance on such massive databases from inside my CAT tool. Of course, it's also possible to use TMLookup, but this way I also get any potential matches (fuzzy or otherwise). CafeTran of course has Total Recall (for huge databases), but I find the search UI a bit clunky, and it takes forever to import them all.


 

Anton Konashenok  Identity Verified
Czech Republic
Local time: 14:37
English to Russian
+ ...
A different opinion Apr 13

Wow. A TM containing several million entries of unknown origin, unknown quality and not even guaranteed to be translations, merely parallel texts, automatically crawled and unlikely to have been reviewed by a human. How quaint.
Am going to add it to my huge TMX collection in Memsource!

If the word "collection" meant a cabinet of curiosities, it would certainly make an interesting exhibit. But a useful working tool? Don't think so. In fact, publicly admitting to using it for translation work is a reputational risk – potential clients may consider it unscrupulous.


 

Michael Beijer  Identity Verified
United Kingdom
Local time: 13:37
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
Ho ho ho. Apr 13

Anton Konashenok wrote:

Wow. A TM containing several million entries of unknown origin, unknown quality and not even guaranteed to be translations, merely parallel texts, automatically crawled and unlikely to have been reviewed by a human. How quaint.
Am going to add it to my huge TMX collection in Memsource!

If the word "collection" meant a cabinet of curiosities, it would certainly make an interesting exhibit. But a useful working tool? Don't think so. In fact, publicly admitting to using it for translation work is a reputational risk – potential clients may consider it unscrupulous.


I have no problem admitting I use massive TMs found online. I also use machine translation. And a mouse. What else is new?

Of course there may be lots of noise in these huge databases, but also lots of little treasures. Have you ever run a concordance search? It provides you with a nice list of possible translations. The art is of course to be able to separate the wheat from the chaff, which I am actually quite good at.

I obviously don't just accept high fuzzy matches from these massive TMs, willy nilly, and race on to the next segment. In fact, high fuzzy matches hardly ever occur between texts in everyday work. What I mainly use these databases for is to find difficult terms, or at least pointers in the right direction. Because of their size (altogether my TM collection contains currently around 50,000,000 TUs), you are often quite likely to run into the term you're having trouble with. You can then use this to do further research, although what you find in your first search will often already be correct.

Anyway, as with any tool, you need to know how to use it.


 

Noe Tessmann  Identity Verified
Local time: 14:37
English to German
+ ...
Thanks for sharing Apr 13

Thanks Michael for finding all these treasures and letting us know. I'll also add it somewhere.

KR

Noe


 

Michael Beijer  Identity Verified
United Kingdom
Local time: 13:37
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
you're welcome! Apr 14

Noe Tessmann wrote:

Thanks Michael for finding all these treasures and letting us know. I'll also add it somewhere.

KR

Noe


Hi Noe,

I thought you'd appreciate this one!

Michael


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

ParaCrawl corpus released (as .tmx) on OPUS corpora website! (Dutch>English .TMX = 2.5 million TUs!)

Advanced search







SDL Trados Studio 2019 Freelance
The leading translation software used by over 250,000 translators.

SDL Trados Studio 2019 has evolved to bring translators a brand new experience. Designed with user experience at its core, Studio 2019 transforms how new users get up and running, helps experienced users make the most of the powerful features, ensures new

More info »
BaccS – Business Accounting Software
Modern desktop project management for freelance translators

BaccS makes it easy for translators to manage their projects, schedule tasks, create invoices, and view highly customizable reports. User-friendly, ProZ.com integration, community-driven development – a few reasons BaccS is trusted by translators!

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search