https://www.proz.com/forum/translator_resources/271879-part_of_the_iate_database_can_now_be_downloaded_as_a_massive_tbx.html

Pages in topic:   [1 2 3] >
(Part of) the IATE database can now be downloaded as a massive TBX!
Thread poster: Michael Beijer
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 19:58
Member (2009)
Dutch to English
+ ...
Jul 10, 2014

Download IATE

IATE is a living database, i.e. translators and terminologists are continuously updating its content. In 2013, almost 97 000 new terms were added and 158 000 existing terms where modified. These changes were also reviewed and validated. Using the IATE search interface (http://iate.europa.eu/) thus ensures that you are accessing the most complete and up-to-date dat... See more
Download IATE

IATE is a living database, i.e. translators and terminologists are continuously updating its content. In 2013, almost 97 000 new terms were added and 158 000 existing terms where modified. These changes were also reviewed and validated. Using the IATE search interface (http://iate.europa.eu/) thus ensures that you are accessing the most complete and up-to-date data. However, in order to cater for specific needs, you can also download a copy of some of the data contained in IATE.

The download file contains about 8 million terms in 24 official EU languages. It is provided in TermBase eXchange (TBX) format. For further details see: TBXcoreStructV02.dtd, TBXXCS.xcs, tbxxcsdtd.dtd.
The size of the uncompressed file is about 2.2 gigabytes.
For information on the data structure and the data categories included in the download file, please see: IATE Data fields explained
You can download the file by clicking on the link below.
IATE_download_25062014.zip (Publication date: 25/06/2014)

Statistics: The download file contains 1.3 million concepts.’


--------

A quick look at nl-en, and I count over 450,000 entries! And all of them reviewed and validated! Sadly, the definitions are not present, but that is understandable as the multilingual TBX is already 2GB!

-----> http://iate.europa.eu/tbxPageDownload.do
Collapse


 
Noe Tessmann
Noe Tessmann  Identity Verified
Local time: 20:58
English to German
+ ...
Another massive one Jul 11, 2014

Dear Michael,

thanks for pointing my nose on that massive piece of terminology but it seems indigestable for my PC. I do have free space of 6 GB on my C drive and 25 on my D drive but when trying to import the tbx file I get immediately an error message that there's not enough free spase left.
But MQ is trying to load the file in backgroud but after one hour of no reponse I'll have to cancel the proces.

Is there a way to pre-extract only the language pairs I am re
... See more
Dear Michael,

thanks for pointing my nose on that massive piece of terminology but it seems indigestable for my PC. I do have free space of 6 GB on my C drive and 25 on my D drive but when trying to import the tbx file I get immediately an error message that there's not enough free spase left.
But MQ is trying to load the file in backgroud but after one hour of no reponse I'll have to cancel the proces.

Is there a way to pre-extract only the language pairs I am really interested in to allegiate the job of MemoQ.

Have a nice week-end

Noe


PS: I put the tbx file now on my USB stick and it seems that in background some sort of import is going on.

[Edited at 2014-07-11 16:12 GMT]
Collapse


 
Erik Freitag
Erik Freitag  Identity Verified
Germany
Local time: 20:58
Member (2006)
Dutch to German
+ ...
No luck with MultiTerm either Jul 11, 2014

Dear Michael,

Thanks for the link! I have tried to import the tbx into MultiTerm, but the process is aborted within a couple of seconds with the message "System.OutofMemoryException". I'm using a Win7 PC, 64bit, with 16GB of RAM.

Seeing that MemoQ fails here as well, maybe something's wrong with the tbx?

Has anyone be able to succesfully import the tbx?

Kind regards,
Erik


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 20:58
English to Hungarian
+ ...
Nice Jul 11, 2014

Thanks, Michael. I will look into converting it to some more palatable format / extracting language pairs. It shouldn't be hard to generate a basic tabbed glossary, but there is quite a lot of metadata in there and some of it may be worth conserving. Doing that is trickier of course.

The download fails for me every time... it's supposed to be a 113 MB file, but the download stops early and I'm stuck with a partial file. Someone rehost it in dropbox or something, please.

[Edit
... See more
Thanks, Michael. I will look into converting it to some more palatable format / extracting language pairs. It shouldn't be hard to generate a basic tabbed glossary, but there is quite a lot of metadata in there and some of it may be worth conserving. Doing that is trickier of course.

The download fails for me every time... it's supposed to be a 113 MB file, but the download stops early and I'm stuck with a partial file. Someone rehost it in dropbox or something, please.

[Edited at 2014-07-11 09:32 GMT]
Collapse


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 19:58
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
Xbench is working fine here… Jul 11, 2014

Hi everyone,

I am using the latest (paid version of) Xbench, and everything is working fine. I'm on a Dell Precision laptop, with Win7 64-bit and 16 GB of RAM.

Here is the process in Xbench (copied over from the CafeTran mailing list, where I posted this first):

Project > PropertiesProject > Properties’ (or F2),
then click AddAdd’,
then select TBX/MARTIFF Glossary,
then NextNext’,
... See more
Hi everyone,

I am using the latest (paid version of) Xbench, and everything is working fine. I'm on a Dell Precision laptop, with Win7 64-bit and 16 GB of RAM.

Here is the process in Xbench (copied over from the CafeTran mailing list, where I posted this first):

Project > PropertiesProject > Properties’ (or F2),
then click AddAdd’,
then select TBX/MARTIFF Glossary,
then NextNext’,
then Add FileAdd File’,
select the file,
then NextNext’,
and NextNext’ again ...

The ‘Getting language list’ message should now pop up, which will take quite a while. After that, all the languages should appear.

Once the file has been imported into your Xbench project, you can export it via Tools > Export itemsTools > Export items (Ctrl+R)’ … as either:

– a TMX
– a tab-delimited UTF-8 text file, or
– an Excel file

I am converting a number of languages for colleagues over on the CafeTransltors mailing list, but it might be better if Adras did it as he has much more experience with data from such large multilingual projects.

Michael

See also: https://groups.google.com/forum/?fromgroups=#!topic/cafetranslators/lAopgfpC1Sw

PS: here is the original TBX I downloaded from the site (zipped and unzipped):

https://www.dropbox.com/s/ck67kppuis7e050/IATE_download_25062014.zip (113.38MB)
https://www.dropbox.com/s/zv5aavl0baq316h/IATE_export_25062014.tbx (2,117.09MB)

As tab del (created with ‘Include segments even if the source or target is missing’ OFF):

en-de: https://www.dropbox.com/s/vyvx9lnmkiboemf/en-de.txt (529,778 entries)
de-nl: https://www.dropbox.com/s/ou0fu1r2t1q2h5a/IATE_de-nl-(396,933-entries).txt (396,933-entries)

src: Download IATE, European Union, 2014.

[Edited at 2014-07-11 13:04 GMT]
Collapse


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 20:58
English to Hungarian
+ ...
Xbench Jul 11, 2014

I'm not sure how much of the data and the data structure xbench conserves. The description says there are synonyms in there (i.e. two English terms in the same entry). I expect that xbench will discard all but the first. Ditto for subject domains, reliability indexes etc. It's a viable option but it shouldn't be too hard to do it better. I've had a look inside and there are quite a few acronyms (labeled as 'abbreviations') + full expressions entered as synonyms. It's worth a bit of extra work to... See more
I'm not sure how much of the data and the data structure xbench conserves. The description says there are synonyms in there (i.e. two English terms in the same entry). I expect that xbench will discard all but the first. Ditto for subject domains, reliability indexes etc. It's a viable option but it shouldn't be too hard to do it better. I've had a look inside and there are quite a few acronyms (labeled as 'abbreviations') + full expressions entered as synonyms. It's worth a bit of extra work to conserve those.

[Edited at 2014-07-11 12:33 GMT]
Collapse


 
Giovanni Guarnieri MITI, MIL
Giovanni Guarnieri MITI, MIL  Identity Verified
United Kingdom
Local time: 19:58
Member (2004)
English to Italian
yep... Jul 11, 2014

as Erik, I get the message "System.OutofMemoryException" in Multiterm Convert...

I have a very powerful PC with tons of memory...


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 19:58
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
@András: Jul 11, 2014

FarkasAndras wrote:

I'm not sure how much of the data and the data structure xbench conserves. The description says there are synonyms in there (i.e. two English terms in the same entry). I expect that xbench will discard all but the first. Ditto for subject domains, reliability indexes etc. It's a viable option but it shouldn't be too hard to do it better. I've had a look inside and there are quite a few acronyms (labeled as 'abbreviations') + full expressions entered as synonyms. It's worth a bit of extra work to conserve those.

[Edited at 2014-07-11 12:33 GMT]


I seem to have managed to conserve the ‘reliabilityCodes’ and the ‘subjectField’ (numbers), but no synonyms or acronyms (using Xbench). I'll have to have a look at the info on the data structure when I have a moment:

http://iate.europa.eu/tbx/IATE%20Data%20Fields%20Explaind.htm
http://www.ttt.org/oscarstandards/tbx/TBXcoreStructV02.dtd
http://iate.europa.eu/downloadXcs.do

• en-de: https://www.dropbox.com/s/vyvx9lnmkiboemf/en-de.txt (529,778 entries)
• de-nl: https://www.dropbox.com/s/ou0fu1r2t1q2h5a/IATE_de-nl-(396,933-entries).txt (396,933-entries)
• nl-en https://www.dropbox.com/s/nmznnfotyuzl1tl/IATE_nl-en-(401,625-entries).txt (401,625-entries)

Michael

[Edited at 2014-07-11 13:26 GMT]


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 19:58
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
Heartsome? Jul 11, 2014

I wonder if any of the now-OS Heartsome tools can handle this TBX better?

Michael

http://www.heartsome.net/en-US/hsde.html
http://www.heartsome.net/en-US/downloads.html


 
2nl (X)
2nl (X)  Identity Verified
Netherlands
Local time: 20:58
Thank you for making this possible! Jul 11, 2014

Michael Beijer wrote:

• de-nl: https://www.dropbox.com/s/ou0fu1r2t1q2h5a/IATE_de-nl-(396,933-entries).txt (396,933-entries)


Thanks Michael!

Hans


 
Tamas Elek
Tamas Elek  Identity Verified
Hungary
Local time: 20:58
English to Hungarian
+ ...
Problem with memoQ Jul 12, 2014

I simply cannot import the database into memoQ. I was trying to import the Hungarian - English language pair, but after a few hours of processing, it stops with the following message:

Warnings
--------------------------
Line 2, column 2: TBX is not valid against DTD. Details: No DTD found.
Error during TBX validation: '_', hexadecimal value 0x03, is an invalid character. Line 16387643, position 47.. Skipping TBX validation.


General error.
... See more
I simply cannot import the database into memoQ. I was trying to import the Hungarian - English language pair, but after a few hours of processing, it stops with the following message:

Warnings
--------------------------
Line 2, column 2: TBX is not valid against DTD. Details: No DTD found.
Error during TBX validation: '_', hexadecimal value 0x03, is an invalid character. Line 16387643, position 47.. Skipping TBX validation.


General error.
TYPE:
System.Xml.XmlException

MESSAGE:
'_', hexadecimal value 0x03, is an invalid character. Line 16387643, position 47.

SOURCE:
System.Xml

CALL STACK:
at System.Xml.XmlTextReaderImpl.Throw(String res, String[] args)
at System.Xml.XmlTextReaderImpl.ParseNumericCharRefInline(Int32 startPos, Boolean expand, StringBuilder internalSubsetBuilder, Int32& charCount, EntityType& entityType)
at System.Xml.XmlTextReaderImpl.ParseCharRefInline(Int32 startPos, Int32& charCount, EntityType& entityType)
at System.Xml.XmlTextReaderImpl.ParseText(Int32& startPos, Int32& endPos, Int32& outOrChars)
at System.Xml.XmlTextReaderImpl.ParseText()
at System.Xml.XmlTextReaderImpl.ParseElementContent()
at MemoQ.Termbase.TBXImporter`1.readTbxAndGetLanguages(String tbxFilePath, XmlReaderSettings tbxSettings, Boolean collectLangCodes)
at MemoQ.Termbase.TBXImporter`1.checkDTD(Boolean validateXCS, Boolean collectLangCodesFromTBX)
at MemoQ.Termbase.TBXImporter`1.prepare()
at MemoQ.Termbase.GUI.Import.TBXLocalImporterJob.DoJob()
at MemoQ.Common.Job.JobBase.Execute(Object o)

Any idea how to resolve this issue? I tried three times, but it is always the same.

Thank you in advance.

[Edited at 2014-07-12 22:06 GMT]
Collapse


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 19:58
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
multifarious.filkin.com Jul 13, 2014

Interesting post on Paul Filkin's blog: http://multifarious.filkin.com/2014/07/13/what-a-whopper/

He has found a way to get the data into MultiTerm (apparently with all the metadata intact).

Michael


 
Giovanni Guarnieri MITI, MIL
Giovanni Guarnieri MITI, MIL  Identity Verified
United Kingdom
Local time: 19:58
Member (2004)
English to Italian
done! Jul 13, 2014

Michael Beijer wrote:

Hi everyone,

I am using the latest (paid version of) Xbench, and everything is working fine. I'm on a Dell Precision laptop, with Win7 64-bit and 16 GB of RAM.

Here is the process in Xbench (copied over from the CafeTran mailing list, where I posted this first):

Project > PropertiesProject > Properties’ (or F2),
then click AddAdd’,
then select TBX/MARTIFF Glossary,
then NextNext’,
then Add FileAdd File’,
select the file,
then NextNext’,
and NextNext’ again ...

The ‘Getting language list’ message should now pop up, which will take quite a while. After that, all the languages should appear.

Once the file has been imported into your Xbench project, you can export it via Tools > Export itemsTools > Export items (Ctrl+R)’ … as either:

– a TMX
– a tab-delimited UTF-8 text file, or
– an Excel file


I used the 30 day trial of XBench for this... took 1 hour overall to convert EN>IT into TMX and then to import the TMX into a Studio 2014 memory... out of 883327 entries, only 425106 were imported... rest all errors... no idea why!


 
RWS Community
RWS Community
United Kingdom
Local time: 20:58
English
You should have clicked on the error! Jul 13, 2014

Giovanni Guarnieri MITI, MIL wrote:

I used the 30 day trial of XBench for this... took 1 hour overall to convert EN>IT into TMX and then to import the TMX into a Studio 2014 memory... out of 883327 entries, only 425106 were imported... rest all errors... no idea why!


Then it would probably tell you. Things like duplicates and missing data are common, especially for a conversion like this where it's quite possible you would have source info with no corresponding target info for example.

Regards

Paul


 
Giovanni Guarnieri MITI, MIL
Giovanni Guarnieri MITI, MIL  Identity Verified
United Kingdom
Local time: 19:58
Member (2004)
English to Italian
I did! Jul 13, 2014

SDL Support wrote:

Giovanni Guarnieri MITI, MIL wrote:

I used the 30 day trial of XBench for this... took 1 hour overall to convert EN>IT into TMX and then to import the TMX into a Studio 2014 memory... out of 883327 entries, only 425106 were imported... rest all errors... no idea why!


Then it would probably tell you. Things like duplicates and missing data are common, especially for a conversion like this where it's quite possible you would have source info with no corresponding target info for example.

Regards

Paul


I don't remember seeing any explanation... only the number of entries imported out of the total and the number of entries not imported... maybe I didn't look properly...


 
Pages in topic:   [1 2 3] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

(Part of) the IATE database can now be downloaded as a massive TBX!


Translation news





CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

Buy now! »
Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »