Pages in topic:   [1 2] >
Downloading the Acquis Communautaire
Thread poster: Parrot

Parrot  Identity Verified
Spain
Local time: 09:41
Member
Spanish to English
+ ...
Oct 18, 2011

Those colleagues often meeting up with texts of EU Directives and jurisprudence may be interested in downloading the corpora of the Acquis Communautaire as *.tmx files. They basically come as zipped volumes that may be extracted into the language pairs of interest using the tool made available on the same page: http://langtech.jrc.it/DGT-TM.html

Once you have obtained the *.tmx files of interest, they may be converted for use with most CAT tools.

Please note the Conditions for Use.


Direct link Reply with quote
 

Daniela Zambrini  Identity Verified
Italy
Local time: 09:41
Member (2005)
English to Italian
+ ...
Thanks! Oct 18, 2011

Very usefulicon_smile.gif

Direct link Reply with quote
 

Maria Amorim  Identity Verified
Sweden
Local time: 09:41
Swedish to Portuguese
+ ...
Thanks Oct 19, 2011

for sharing!

Direct link Reply with quote
 

FarkasAndras
Local time: 09:41
English to Hungarian
+ ...
Europarl Oct 19, 2011

If you're posting about the DGT-TM, we might as well mention the europarl corpus as well.
http://www.statmt.org/europarl/
These are autoaligned corpora of the transcripts of EP plenaries.
The easiest way to get a TMX out of them is probably the following:

- Download and extract a corpus file pair from the statmt site
- Download the grab bag (1.5) and the aligner package from sourceforge.net/projects/aligner/files
- Use generate_tabbed.exe from the grab bag to make a tabbed txt out of the two files
- Use the tmx maker from the aligner package to generate a tmx

You can also try and shove the files into your preferred aligner instead of using these command-line tools, but that could go badly wrong in a number of ways.


Direct link Reply with quote
 

Parrot  Identity Verified
Spain
Local time: 09:41
Member
Spanish to English
+ ...
TOPIC STARTER
Great! Oct 19, 2011

Now this is what I call return on taxesicon_biggrin.gif

Direct link Reply with quote
 

Christophe Lefrancois  Identity Verified
Local time: 09:41
French
+ ...
Another one Oct 19, 2011

Hi,

another interesting link with several TMs available in many languages.

http://www.globalization-group.com/edge/2010/05/download-translation-memory/

Government Translation Memory

European Commission (millions of TUs in 22 EU languages)
EU Constitution (thousands of TUs in 21 EU languages)
European Parliament (millions of TUs in 11 EU languages)
Stockholm Parallel Corpora (thousands of TUs in English, Greek, and Chinese)

Localization and Technical Translation Memory

OpenOffice.org (tens of thousands of TUs in German, English, Spanish, French, Japanese, and Swedish)
KDE (hundreds of thousands of TUs in 92 languages)
PHP Manuals (thousands of TUs in 22 languages)
European Medicines Agency (millions of TUs in 22 EU languages)

Media Translation Memory

OpenSubtitles.org (millions of TUs in 30 languages)
SETimes.com (millions of TUs in 9 Southeastern European languages)

Enjoy!!

Christophe


Direct link Reply with quote
 

Siegfried Armbruster  Identity Verified
Germany
Local time: 09:41
Member (2004)
English to German
+ ...
"Autoaligned" is a synonym for GIGO Oct 19, 2011

I consider autoaligned corpora or TMx files a waste of our tax money. The alignment is pretty useless.

Direct link Reply with quote
 

FarkasAndras
Local time: 09:41
English to Hungarian
+ ...
And you base that on... Oct 19, 2011

Siegfried Armbruster wrote:

I consider autoaligned corpora or TMx files a waste of our tax money. The alignment is pretty useless.

Might I ask what data you're basing this on?
I'd also love to hear your ideas about alternative solutions for making, say, a corpus of 1 million sentences translated into 27 languages available for the general public - and searchable - for less than it costs to run it through an autoaligner.

Autoalignments are remarkably good - but even if they were only mediocre, they are the only option we have for mining this huge dataset.
By the way, the data I'm basing the above statement on is the numerical data from various academic researchers who tested various aligners on real-world texts, comparing the results to a manually prepared perfect alignment. Good aligners consistently produce around 95% correct alignments on mixed texts, and they can easily reach 98% or more if you tell them to automatically discard dubious sentence pairs. Of course they often exceed 99% on good quality input texts even without this "filtering".

http://papers.ldc.upenn.edu/LREC2006/Champollion.pdf
http://www.lrec-conf.org/proceedings/lrec2008/pdf/126_paper.pdf
http://utkl.ff.cuni.cz/~rosen/public/slovko05.pdf
ftp://ontologia.hu/Hunglish/doc/ranlp05.pdf


My own experience backs this result: I use autoaligned texts daily and rarely come across misaligned sentences - but I would never dream about making categorical statements based on anecdotal personal experience, of course.

[Edited at 2011-10-19 17:34 GMT]


Direct link Reply with quote
 

Parrot  Identity Verified
Spain
Local time: 09:41
Member
Spanish to English
+ ...
TOPIC STARTER
Easier said than done Oct 20, 2011

FarkasAndras wrote:

The easiest way to get a TMX out of them is probably the following:


I'm no good with sourceforge tools on such large, unparsed and somehow distorted files. Fortunately, Christophe's link provides us with the ready-made Europarl *.tmxs. Thanks to everyone!


Direct link Reply with quote
 

FarkasAndras
Local time: 09:41
English to Hungarian
+ ...
Old release Oct 20, 2011

Parrot wrote:

FarkasAndras wrote:

The easiest way to get a TMX out of them is probably the following:


I'm no good with sourceforge tools on such large, unparsed and somehow distorted files. Fortunately, Christophe's link provides us with the ready-made Europarl *.tmxs. Thanks to everyone!


That's an older release. The ready-made TMXes are based on version 3 of the corpus, which is now already at version 6. Since version 3, many new languages and more recent texts were added, and the quality improved somewhat as well.

The command-line tools I wrote are not what I'd call user friendly, but it's all fairly straightforward once you get started. The source files are large - that's why you can't use Excel or something like that to process them. Other than that, they are nice and neat.
It would be fairly easy to write a tool that generates the TMXes completely automatically after you launch it, but it shouldn't be necessary.


Direct link Reply with quote
 

Siegfried Armbruster  Identity Verified
Germany
Local time: 09:41
Member (2004)
English to German
+ ...
My opinion is based on the content of the TMX files Oct 20, 2011

The following screenshots are just more or less random screenshots of parts of one of "autoaligned" TMX files from one of the sources mentioned above. I guess everybody will agree that this alignment is crap.







From the 365.000 segments in the file, I already deleted > 40.000 and the file is still full with grap.

Perhaps my approach might be completely wrong, and I would be really interested how the "experts" use the uncleanded autoaligned TMX files and get something useful out of it.

[Edited at 2011-10-20 10:28 GMT]


Direct link Reply with quote
 

xxxjacana54  Identity Verified
Uruguay
English to Spanish
+ ...
Thank you, Parrot! Oct 20, 2011

icon_smile.gif

Direct link Reply with quote
 

FarkasAndras
Local time: 09:41
English to Hungarian
+ ...
Poor input files Oct 20, 2011

Siegfried Armbruster wrote:

The following screenshots are just more or less random screenshots of parts of one of "autoaligned" TMX files from one of the sources mentioned above. I guess everybody will agree that this alignment is crap.


From the 365.000 segments in the file, I already deleted > 40.000 and the file is still full with grap.

Perhaps my approach might be completely wrong, and I would be really interested how the "experts" use the uncleanded autoaligned TMX files and get something useful out of it.


There are a couple of things going on there. I agree that 40,000 useless TUs out of a total of 365,000 is too much, but that's not due to the alignment or autoaligners as such. It's due to original input files, which obviously don't have the same content in your screenshots. The GIGO principle applies, of course - if the source files are crap, there is only so much an automated system can do to fix them. The source files in the Europarl corpus and the DGT-TM corpus are very good in my experience, so I wouldn't expect much crap like this in those. Certainly not 10+ percent.
Your excerpts look like they might be from the EMEA corpus, which I'm not familiar with. It looks like they should have done a better job of cleaning the files. There are a couple of things they could have done with automated solutions, from throwing out dodgy source files to throwing out individual dodgy TUs or even cleaning the source files themselves (removing footnotes and purely numerical sections etc.) At the very least, low-quality alignments like this should be made available as tab delimited text as well, which is easier to evaluate, process and clean up than TMX.
Either way, one saving grace is that these won't come up as concordance hits often, as there isn't much text in there. It also wouldn't be too difficult to do some automated post-production on this material: throw out every TU where one language or the other doesn't contain at least one word (3 or more consecutive letter characters), for instance.


Direct link Reply with quote
 

Parrot  Identity Verified
Spain
Local time: 09:41
Member
Spanish to English
+ ...
TOPIC STARTER
Comments FWIW Oct 20, 2011

No expert, but on Studio 2011, the Acquis doesn't present problems. Europarl, on the other hand, is a bit of a nightmare. The raw files show distorted characters and need a lot of pre-processing. On the other hand, the old-release *.tmx files from Christophe's processed site just seem to require some legacy conversion before Studio admits them.

I used to work with EMEA as a reference, and the texts never quite tallied. I moved out of the field to concentrate on law, so I'm no longer sure about their status; still I can imagine it was not initially projected for alignment.


Direct link Reply with quote
 

Michael Beijer  Identity Verified
United Kingdom
Local time: 08:41
Member (2009)
Dutch to English
+ ...
generate_tabbed.exe doesn't seem to be working Feb 26, 2015

Hi András,

Any tips on how to get it to work? 2 simple txt files, UTF-8 without signature. I followed all the instructions, but nothing’s happening.

Right before the cmd.exe windows closes, I briefly see "Can't open…"

[Edited at 2015-02-26 20:59 GMT]

Weird. It doesn't work if I run it from:

C:\Users\michaelbeijer\Desktop\PatTR Patent Translation Resource\de-en\pattr\description\1 - Copy (1)

But it does work if I run it from:

C:\Users\michaelbeijer\Desktop\

Is it the spaces or brackets in the path maybe?

[Edited at 2015-02-26 21:05 GMT]


Direct link Reply with quote
 
Pages in topic:   [1 2] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Downloading the Acquis Communautaire

Advanced search







Déjà Vu X3
Try it, Love it

Find out why Déjà Vu is today the most flexible, customizable and user-friendly tool on the market. See the brand new features in action: *Completely redesigned user interface *Live Preview *Inline spell checking *Inline

More info »
Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search