Looking for properly aligned TMXs of the European Medicines Agency (EMEA) corpus! (DutchEnglish)
Thread poster: Michael Joseph Wdowiak Beijer

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 05:15
Member (2009)
Dutch to English
+ ...
Sep 26

Every once in a while I will look stuff up in my EMEA.tmx (which I downloaded via the OPUS corpus a while ago)(http://opus.lingfil.uu.se/EMEA.php … now offline), and although it is full of very useful info, it also contains a vast amount of misalignments, which makes it annoying to use for concordance purposes. I haven't had time lately to look into whether anyone has managed to get their hands on a properly aligned version. Have you?

I am specifically interested in the Dutch English part of the corpus.

Michael

(For future googlers) this is where I found it:

http://opus.lingfil.uu.se -> EMEA - European Medicines Agency documents (EMEA0.3.tar.gz - 5.0 GB) -> http://opus.lingfil.uu.se/EMEA.php

[Edited at 2017-09-27 10:10 GMT]


Direct link Reply with quote
 
CafeTran Training
Netherlands
Local time: 06:15
DeepL? Sep 27

Michael Joseph Wdowiak Beijer wrote:

Every once in a while I will look stuff up in my EMEA.tmx (which I downloaded via the OPUS corpus a while ago)(http://opus.lingfil.uu.se/EMEA.php … now offline),
Michael


Aren't these also harvested by Linguee and offered by DeepL?


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 05:15
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
I would expect so, but... Sep 27

CafeTran Training wrote:

Michael Joseph Wdowiak Beijer wrote:

Every once in a while I will look stuff up in my EMEA.tmx (which I downloaded via the OPUS corpus a while ago)(http://opus.lingfil.uu.se/EMEA.php … now offline),
Michael


Aren't these also harvested by Linguee and offered by DeepL?


I just did a quick check, and it seems that they aren't. I mean, I'm sure that they have found them and that they are part of their system, somehow, but when you take a specific source segment from one of the EMEA TMXs, both Linguee and DeepL produce different results. This usually wouldn't be such a problem, but when I am translating stuff like medicine inserts/leaflets, I need to stick to extremely specific terminology, some of which is present in the specialized medical TMXs.

By the way, have you tried using CafeTran yet as a TMX cleaner, for stuff like: 1. Remove any TU with either side = longer than the other, 2. Remove any TU that is etc.?


Direct link Reply with quote
 
FarkasAndras
Local time: 06:15
English to Hungarian
+ ...
Maybe Sep 27

I could look into cleaning it up or repeating the alignment from scratch.
What problems are you seeing and how bad is it? I don't think I even downloaded the EMEA corpus so all I know about it is that it is available.

I have a tool that removes TUs where one side is X% longer/shorter than the other. That can help remove most of the misaligned stuff. Just "longer" is not good as a criterion as one side is always bound to be a bit longer than the other. Removing very long (and very short) segments can also be useful.

[Edited at 2017-09-27 10:10 GMT]


Direct link Reply with quote
 
CafeTran Training
Netherlands
Local time: 06:15
It's a feature that I miss, however .... Sep 27

Michael Joseph Wdowiak Beijer wrote:

By the way, have you tried using CafeTran yet as a TMX cleaner, for stuff like: 1. Remove any TU with either side = longer than the other, 2. Remove any TU that is etc.?


It's a feature that I miss, but it's very unlikely that it'll get implemented ever. The developer today announced that he's thinking about 'refactoring' CafeTran (just like Kilgray did some time ago), in order to reduce its complexity. This can have consequences for CafeTran's feature set.

New feature's are only likely to get added in future, when they reduce CafeTran's complexity/simplify its operation/GUI and make it more intuitive/easy to understand.


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 05:15
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
the heat is on! Sep 27

CafeTran Training wrote:

Michael Joseph Wdowiak Beijer wrote:

By the way, have you tried using CafeTran yet as a TMX cleaner, for stuff like: 1. Remove any TU with either side = longer than the other, 2. Remove any TU that is etc.?


It's a feature that I miss, but it's very unlikely that it'll get implemented ever. The developer today announced that he's thinking about 'refactoring' CafeTran (just like Kilgray did some time ago), in order to reduce its complexity. This can have consequences for CafeTran's feature set.

New feature's are only likely to get added in future, when they reduce CafeTran's complexity/simplify its operation/GUI and make it more intuitive/easy to understand.


Have you seen this? https://www.proz.com/forum/déj_vu_support/319028-come_check_out_our_new_website.html (forum software broke the link, but you get the picture)

After months of hard work, we are proud to announce the launch of our new website!

The design has been improved to achieve a more modern and fresh look, and enhance the user experience .
Check it out, it's here: https://atril.com/

Expect intuitive navigation, easy access to our products and a wealth of valuable tips and resources.
We have also optimized it for smartphones and tablets to make it easier for you to access it wherever you are.

And there’s more to come!
The new DVX4, packed up with exciting new features, is just around the corner!
Go to the website and click the red X4 button to stay in the know.

Tell us what you think!

kind regards,
Matylda


Has the sleeping giant finally woken?


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 05:15
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
it's this one! Sep 27

FarkasAndras wrote:

I could look into cleaning it up or repeating the alignment from scratch.
What problems are you seeing and how bad is it? I don't think I even downloaded the EMEA corpus so all I know about it is that it is available.

I have a tool that removes TUs where one side is X% longer/shorter than the other. That can help remove most of the misaligned stuff. Just "longer" is not good as a criterion as one side is always bound to be a bit longer than the other. Removing very long (and very short) segments can also be useful.

[Edited at 2017-09-27 10:10 GMT]


Thanks!

src/trgt files separate: http://opus.lingfil.uu.se/download.php?f=EMEA/en-nl.txt.zip
tmx: http://opus.lingfil.uu.se/download.php?f=EMEA/en-nl.tmx.gz

Michael


Direct link Reply with quote
 
CafeTran Training
Netherlands
Local time: 06:15
Public use? Sep 27

FarkasAndras wrote:

I have a tool that removes TUs where one side is X% longer/shorter than the other. That can help remove most of the misaligned stuff. Just "longer" is not good as a criterion as one side is always bound to be a bit longer than the other. Removing very long (and very short) segments can also be useful.

[Edited at 2017-09-27 10:10 GMT]


Would you be willing to make this useful tool publicly available?


Direct link Reply with quote
 
FarkasAndras
Local time: 06:15
English to Hungarian
+ ...
Not really Sep 27

CafeTran Training wrote:

FarkasAndras wrote:

I have a tool that removes TUs where one side is X% longer/shorter than the other. That can help remove most of the misaligned stuff. Just "longer" is not good as a criterion as one side is always bound to be a bit longer than the other. Removing very long (and very short) segments can also be useful.

[Edited at 2017-09-27 10:10 GMT]


Would you be willing to make this useful tool publicly available?


It's part of the secret sauce I'm offering for people who want to buy TMs from me. As implemented, it's not very user friendly anyway.
BTW programming-wise, it's a trivial problem to solve:
if (length a) > (length b) * X then remove unit
if (length b) > (length a) * X then remove unit

I ended up using the formula if (length a) > ((length b) + 5) * 2 but you can always play with that.


Direct link Reply with quote
 
FarkasAndras
Local time: 06:15
English to Hungarian
+ ...
? Sep 27

Michael Joseph Wdowiak Beijer wrote:

Thanks!

src/trgt files separate: http://opus.lingfil.uu.se/download.php?f=EMEA/en-nl.txt.zip
tmx: http://opus.lingfil.uu.se/download.php?f=EMEA/en-nl.tmx.gz

Michael



FarkasAndras wrote:


What problems are you seeing and how bad is it?


Also, are you willing to pay for getting it fixed*? If it's trivial I'll do it for free but not if it takes hours or days.

*Obviously it will never be perfect but perhaps it's possible to make it much better than it is now.


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 05:15
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
sorry, wasn't clear Sep 27

FarkasAndras wrote:

Michael Joseph Wdowiak Beijer wrote:

Thanks!

src/trgt files separate: http://opus.lingfil.uu.se/download.php?f=EMEA/en-nl.txt.zip
tmx: http://opus.lingfil.uu.se/download.php?f=EMEA/en-nl.tmx.gz

Michael



FarkasAndras wrote:


What problems are you seeing and how bad is it?


Also, are you willing to pay for getting it fixed*? If it's trivial I'll do it for free but not if it takes hours or days.

*Obviously it will never be perfect but perhaps it's possible to make it much better than it is now.


Yes, I'd be willing to pay something for it. Not too much though (if it's too much work just drop it), as I only use it occasionally.

Michael


Direct link Reply with quote
 

Emma Goldsmith  Identity Verified
Spain
Local time: 06:15
Member (2010)
Spanish to English
I'm interested... Sep 27

... to hear the outcome of Andras' alignment of the EN-NL EMEA corpus.
The automatic alignment is very poor; I just use the tmx for concordance.

The other problem with the EMEA corpus is that it hasn't been updated since 2009 (as its name suggests).


Direct link Reply with quote
 
FarkasAndras
Local time: 06:15
English to Hungarian
+ ...
So? Sep 27

So, what are the problems? With a few typical examples if possible. Just misalignments?

Also, what documents are included in the TM? If there's interest I could look into collecting the newer documents.


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 05:15
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
a few example of misalignments Sep 28

Here are a few examples of the misalignments I am talking about:

Capture2

Capture

Capture3

I'm not sure which documents are included in the TM, I'm not much of a medical specialist, maybe someone else in this thread knows? Emma? I know you specialise in medical stuff.


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Looking for properly aligned TMXs of the European Medicines Agency (EMEA) corpus! (DutchEnglish)

Advanced search







PerfectIt consistency checker
Faster Checking, Greater Accuracy

PerfectIt helps deliver error-free documents. It improves consistency, ensures quality and helps to enforce style guides. It’s a powerful tool for pro users, and comes with the assurance of a 30-day money back guarantee.

More info »
Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search