Pages in topic:   [1 2 3] >
(Part of) the IATE database can now be downloaded as a massive TBX!
Thread poster: Michael Joseph Wdowiak Beijer

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 11:46
Member (2009)
Dutch to English
+ ...
Jul 10, 2014

Download IATE

IATE is a living database, i.e. translators and terminologists are continuously updating its content. In 2013, almost 97 000 new terms were added and 158 000 existing terms where modified. These changes were also reviewed and validated. Using the IATE search interface (http://iate.europa.eu/) thus ensures that you are accessing the most complete and up-to-date data. However, in order to cater for specific needs, you can also download a copy of some of the data contained in IATE.

The download file contains about 8 million terms in 24 official EU languages. It is provided in TermBase eXchange (TBX) format. For further details see: TBXcoreStructV02.dtd, TBXXCS.xcs, tbxxcsdtd.dtd.
The size of the uncompressed file is about 2.2 gigabytes.
For information on the data structure and the data categories included in the download file, please see: IATE Data fields explained
You can download the file by clicking on the link below.
IATE_download_25062014.zip (Publication date: 25/06/2014)

Statistics: The download file contains 1.3 million concepts.’


--------

A quick look at nl-en, and I count over 450,000 entries! And all of them reviewed and validated! Sadly, the definitions are not present, but that is understandable as the multilingual TBX is already 2GB!

-----> http://iate.europa.eu/tbxPageDownload.do


Direct link Reply with quote
 
Noe Tessmann  Identity Verified
Local time: 12:46
English to German
+ ...
Another massive one Jul 11, 2014

Dear Michael,

thanks for pointing my nose on that massive piece of terminology but it seems indigestable for my PC. I do have free space of 6 GB on my C drive and 25 on my D drive but when trying to import the tbx file I get immediately an error message that there's not enough free spase left.
But MQ is trying to load the file in backgroud but after one hour of no reponse I'll have to cancel the proces.

Is there a way to pre-extract only the language pairs I am really interested in to allegiate the job of MemoQ.

Have a nice week-end

Noe


PS: I put the tbx file now on my USB stick and it seems that in background some sort of import is going on.

[Edited at 2014-07-11 16:12 GMT]


Direct link Reply with quote
 

Erik Freitag  Identity Verified
Germany
Local time: 12:46
Member (2006)
Dutch to German
+ ...
No luck with MultiTerm either Jul 11, 2014

Dear Michael,

Thanks for the link! I have tried to import the tbx into MultiTerm, but the process is aborted within a couple of seconds with the message "System.OutofMemoryException". I'm using a Win7 PC, 64bit, with 16GB of RAM.

Seeing that MemoQ fails here as well, maybe something's wrong with the tbx?

Has anyone be able to succesfully import the tbx?

Kind regards,
Erik


Direct link Reply with quote
 
FarkasAndras
Local time: 12:46
English to Hungarian
+ ...
Nice Jul 11, 2014

Thanks, Michael. I will look into converting it to some more palatable format / extracting language pairs. It shouldn't be hard to generate a basic tabbed glossary, but there is quite a lot of metadata in there and some of it may be worth conserving. Doing that is trickier of course.

The download fails for me every time... it's supposed to be a 113 MB file, but the download stops early and I'm stuck with a partial file. Someone rehost it in dropbox or something, please.

[Edited at 2014-07-11 09:32 GMT]


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 11:46
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
Xbench is working fine here… Jul 11, 2014

Hi everyone,

I am using the latest (paid version of) Xbench, and everything is working fine. I'm on a Dell Precision laptop, with Win7 64-bit and 16 GB of RAM.

Here is the process in Xbench (copied over from the CafeTran mailing list, where I posted this first):

Project > PropertiesProject > Properties’ (or F2),
then click AddAdd’,
then select TBX/MARTIFF Glossary,
then NextNext’,
then Add FileAdd File’,
select the file,
then NextNext’,
and NextNext’ again ...

The ‘Getting language list’ message should now pop up, which will take quite a while. After that, all the languages should appear.

Once the file has been imported into your Xbench project, you can export it via Tools > Export itemsTools > Export items (Ctrl+R)’ … as either:

– a TMX
– a tab-delimited UTF-8 text file, or
– an Excel file

I am converting a number of languages for colleagues over on the CafeTransltors mailing list, but it might be better if Adras did it as he has much more experience with data from such large multilingual projects.

Michael

See also: https://groups.google.com/forum/?fromgroups=#!topic/cafetranslators/lAopgfpC1Sw

PS: here is the original TBX I downloaded from the site (zipped and unzipped):

https://www.dropbox.com/s/ck67kppuis7e050/IATE_download_25062014.zip (113.38MB)
https://www.dropbox.com/s/zv5aavl0baq316h/IATE_export_25062014.tbx (2,117.09MB)

As tab del (created with ‘Include segments even if the source or target is missing’ OFF):

en-de: https://www.dropbox.com/s/vyvx9lnmkiboemf/en-de.txt (529,778 entries)
de-nl: https://www.dropbox.com/s/ou0fu1r2t1q2h5a/IATE_de-nl-(396,933-entries).txt (396,933-entries)

src: Download IATE, European Union, 2014.

[Edited at 2014-07-11 13:04 GMT]


Direct link Reply with quote
 
FarkasAndras
Local time: 12:46
English to Hungarian
+ ...
Xbench Jul 11, 2014

I'm not sure how much of the data and the data structure xbench conserves. The description says there are synonyms in there (i.e. two English terms in the same entry). I expect that xbench will discard all but the first. Ditto for subject domains, reliability indexes etc. It's a viable option but it shouldn't be too hard to do it better. I've had a look inside and there are quite a few acronyms (labeled as 'abbreviations') + full expressions entered as synonyms. It's worth a bit of extra work to conserve those.

[Edited at 2014-07-11 12:33 GMT]


Direct link Reply with quote
 

Giovanni Guarnieri MITI, MIL  Identity Verified
United Kingdom
Local time: 11:46
Member (2004)
English to Italian
yep... Jul 11, 2014

as Erik, I get the message "System.OutofMemoryException" in Multiterm Convert...

I have a very powerful PC with tons of memory...


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 11:46
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
@András: Jul 11, 2014

FarkasAndras wrote:

I'm not sure how much of the data and the data structure xbench conserves. The description says there are synonyms in there (i.e. two English terms in the same entry). I expect that xbench will discard all but the first. Ditto for subject domains, reliability indexes etc. It's a viable option but it shouldn't be too hard to do it better. I've had a look inside and there are quite a few acronyms (labeled as 'abbreviations') + full expressions entered as synonyms. It's worth a bit of extra work to conserve those.

[Edited at 2014-07-11 12:33 GMT]


I seem to have managed to conserve the ‘reliabilityCodes’ and the ‘subjectField’ (numbers), but no synonyms or acronyms (using Xbench). I'll have to have a look at the info on the data structure when I have a moment:

http://iate.europa.eu/tbx/IATE%20Data%20Fields%20Explaind.htm
http://www.ttt.org/oscarstandards/tbx/TBXcoreStructV02.dtd
http://iate.europa.eu/downloadXcs.do

• en-de: https://www.dropbox.com/s/vyvx9lnmkiboemf/en-de.txt (529,778 entries)
• de-nl: https://www.dropbox.com/s/ou0fu1r2t1q2h5a/IATE_de-nl-(396,933-entries).txt (396,933-entries)
• nl-en https://www.dropbox.com/s/nmznnfotyuzl1tl/IATE_nl-en-(401,625-entries).txt (401,625-entries)

Michael

[Edited at 2014-07-11 13:26 GMT]


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 11:46
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
Heartsome? Jul 11, 2014

I wonder if any of the now-OS Heartsome tools can handle this TBX better?

Michael

http://www.heartsome.net/en-US/hsde.html
http://www.heartsome.net/en-US/downloads.html


Direct link Reply with quote
 
xxx2nl  Identity Verified
Netherlands
Local time: 12:46
Thank you for making this possible! Jul 11, 2014

Michael Beijer wrote:

• de-nl: https://www.dropbox.com/s/ou0fu1r2t1q2h5a/IATE_de-nl-(396,933-entries).txt (396,933-entries)


Thanks Michael!

Hans


Direct link Reply with quote
 

Tamas Elek  Identity Verified
United Kingdom
Local time: 11:46
Member (2014)
Hungarian to English
+ ...
Problem with memoQ Jul 12, 2014

I simply cannot import the database into memoQ. I was trying to import the Hungarian - English language pair, but after a few hours of processing, it stops with the following message:

Warnings
--------------------------
Line 2, column 2: TBX is not valid against DTD. Details: No DTD found.
Error during TBX validation: '_', hexadecimal value 0x03, is an invalid character. Line 16387643, position 47.. Skipping TBX validation.


General error.
TYPE:
System.Xml.XmlException

MESSAGE:
'_', hexadecimal value 0x03, is an invalid character. Line 16387643, position 47.

SOURCE:
System.Xml

CALL STACK:
at System.Xml.XmlTextReaderImpl.Throw(String res, String[] args)
at System.Xml.XmlTextReaderImpl.ParseNumericCharRefInline(Int32 startPos, Boolean expand, StringBuilder internalSubsetBuilder, Int32& charCount, EntityType& entityType)
at System.Xml.XmlTextReaderImpl.ParseCharRefInline(Int32 startPos, Int32& charCount, EntityType& entityType)
at System.Xml.XmlTextReaderImpl.ParseText(Int32& startPos, Int32& endPos, Int32& outOrChars)
at System.Xml.XmlTextReaderImpl.ParseText()
at System.Xml.XmlTextReaderImpl.ParseElementContent()
at MemoQ.Termbase.TBXImporter`1.readTbxAndGetLanguages(String tbxFilePath, XmlReaderSettings tbxSettings, Boolean collectLangCodes)
at MemoQ.Termbase.TBXImporter`1.checkDTD(Boolean validateXCS, Boolean collectLangCodesFromTBX)
at MemoQ.Termbase.TBXImporter`1.prepare()
at MemoQ.Termbase.GUI.Import.TBXLocalImporterJob.DoJob()
at MemoQ.Common.Job.JobBase.Execute(Object o)

Any idea how to resolve this issue? I tried three times, but it is always the same.

Thank you in advance.

[Edited at 2014-07-12 22:06 GMT]


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 11:46
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
multifarious.filkin.com Jul 13, 2014

Interesting post on Paul Filkin's blog: http://multifarious.filkin.com/2014/07/13/what-a-whopper/

He has found a way to get the data into MultiTerm (apparently with all the metadata intact).

Michael


Direct link Reply with quote
 

Giovanni Guarnieri MITI, MIL  Identity Verified
United Kingdom
Local time: 11:46
Member (2004)
English to Italian
done! Jul 13, 2014

Michael Beijer wrote:

Hi everyone,

I am using the latest (paid version of) Xbench, and everything is working fine. I'm on a Dell Precision laptop, with Win7 64-bit and 16 GB of RAM.

Here is the process in Xbench (copied over from the CafeTran mailing list, where I posted this first):

Project > PropertiesProject > Properties’ (or F2),
then click AddAdd’,
then select TBX/MARTIFF Glossary,
then NextNext’,
then Add FileAdd File’,
select the file,
then NextNext’,
and NextNext’ again ...

The ‘Getting language list’ message should now pop up, which will take quite a while. After that, all the languages should appear.

Once the file has been imported into your Xbench project, you can export it via Tools > Export itemsTools > Export items (Ctrl+R)’ … as either:

– a TMX
– a tab-delimited UTF-8 text file, or
– an Excel file


I used the 30 day trial of XBench for this... took 1 hour overall to convert EN>IT into TMX and then to import the TMX into a Studio 2014 memory... out of 883327 entries, only 425106 were imported... rest all errors... no idea why!


Direct link Reply with quote
 

SDL Community  Identity Verified
United Kingdom
Local time: 12:46
English
You should have clicked on the error! Jul 13, 2014

Giovanni Guarnieri MITI, MIL wrote:

I used the 30 day trial of XBench for this... took 1 hour overall to convert EN>IT into TMX and then to import the TMX into a Studio 2014 memory... out of 883327 entries, only 425106 were imported... rest all errors... no idea why!


Then it would probably tell you. Things like duplicates and missing data are common, especially for a conversion like this where it's quite possible you would have source info with no corresponding target info for example.

Regards

Paul


Direct link Reply with quote
 

Giovanni Guarnieri MITI, MIL  Identity Verified
United Kingdom
Local time: 11:46
Member (2004)
English to Italian
I did! Jul 13, 2014

SDL Support wrote:

Giovanni Guarnieri MITI, MIL wrote:

I used the 30 day trial of XBench for this... took 1 hour overall to convert EN>IT into TMX and then to import the TMX into a Studio 2014 memory... out of 883327 entries, only 425106 were imported... rest all errors... no idea why!


Then it would probably tell you. Things like duplicates and missing data are common, especially for a conversion like this where it's quite possible you would have source info with no corresponding target info for example.

Regards

Paul


I don't remember seeing any explanation... only the number of entries imported out of the total and the number of entries not imported... maybe I didn't look properly...


Direct link Reply with quote
 
Pages in topic:   [1 2 3] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

(Part of) the IATE database can now be downloaded as a massive TBX!

Advanced search







Déjà Vu X3
Try it, Love it

Find out why Déjà Vu is today the most flexible, customizable and user-friendly tool on the market. See the brand new features in action: *Completely redesigned user interface *Live Preview *Inline spell checking *Inline

More info »
SDL MultiTerm 2017
Guarantee a unified, consistent and high-quality translation with terminology software by the industry leaders.

SDL MultiTerm 2017 allows translators to create one central location to store and manage multilingual terminology, and with SDL MultiTerm Extract 2017 you can automatically create term lists from your existing documentation to save time.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search