tmx from Parallel corpus of Patent Translation Resource?
Thread poster: Noe Tessmann
2nl (X)
2nl (X)
Local time: 08:44
Did you already post the complete version Feb 27, 2015

Hello Mr. B.,

did you already post the complete versions?

Mr. H.

Michael Beijer
Michael Beijer
United Kingdom
Local time: 07:44
Member (2009)
Dutch to English
+ ...
uploading now! Feb 27, 2015

2nl wrote:

Hello Mr. B.,

did you already post the complete versions?

Mr. H.

Hello Mr Hans,

No, not yet.

And sorry for all the dead links. Proz doesn't let you edit older posts!

I am finishing/uploading them all today, and will post one final list of download links for all 23 TMXs (totalling 22 million TUs)!



Noe Tessmann
Noe Tessmann
Local time: 08:44
English to German
+ ...
Massive Data attack Feb 28, 2015

Dear Michael,

wow impressing, you really did it and not just 10 batches, but the incredible amount of 23. This is data mining on big scale. You're that's too much for a standard CAT environment.

Enjoy your week-end

All the best


Michael Beijer
Michael Beijer
United Kingdom
Local time: 07:44
Member (2009)
Dutch to English
+ ...

PatTR: Patent Translation Resource files

converted to .TMXs
by Michael Beijer ( +


"PatTR is a sentence-parallel corpus extracted from the MAREC patent colle
... See more
PatTR: Patent Translation Resource files

converted to .TMXs
by Michael Beijer ( +


"PatTR is a sentence-parallel corpus extracted from the MAREC patent collection. The current version contains more than 22 million German-English and 18 million French-English parallel sentences collected from all patent text sections as well as 5 million German-French sentence pairs from patent titles, abstracts and claims."


Original files here:

Original workflow:

1. Append ".txt" to file names
2. Open files in EmEditor (or a good text editor capable of opening large files; UltraEdit is also good)
3. In Ron's CSV Editor, create empty file and paste in contents of .txt files (of src + trgt language) to create a tab-delimited .csv
4. In Xbench, convert aforementioned .csv to .tmx;
5. In Heartsome TMX editor, edit the TMX custom attributes and clean up the TMX (remove duplicates).

Improved workflow:

1. Append ".txt" to file names
2. Use "split" command in cmd.exe to split large text file into smaller files based on number of lines (1,000,000 lines): split -l 1000000 filename.txt
3. Use "generate_tabbed.exe" (in András Farkas’s "Grab Bag", included in LF Aligner download) to convert src and trgt language .txt files into tab-delimited .txt containing both src + trgt
4. Use Heartsome TMX editor to convert bilingual tab-del .txt files into .tmx

The 4 Sections:





COPYRIGHT: Wäschle, Katharina; Riezler, Stefan, 2012, "PatTR: Patent Translation Resource", doi:10.11588/data/10002 V3 [Version]

PatTR is available under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Please cite Wäschle & Riezler (2012b), if you use the corpus in your work, or use the data citation specified in the HeiDATA entry.


PS #1: I highly recommend using the Firefox addon DownThemAll ( ) if you want to download them all at once.
PS #2: I will try not to move the files in my Dropbox folder, so the links stay alive (unlike all other links in this thread).
PS #3: I will probably be creating some kind of dedicated page for my TMX conversions (similar to my glossary resource:, maybe with a Donate button, one of these days.

Noe Tessmann
Noe Tessmann
Local time: 08:44
English to German
+ ...
Claims claimed for me Mar 2, 2015

Dear Michael,

I downloaded the full claims batch and imported it into a seperate lookup tool database. It works perfect. I repeat myself but I have to tell you once again thanks for the incredible job you did. You're one of the very fey people in the translator community who are able to deal with this data mountains.

Yes please make a website and a donate button

All the best

... See more
Dear Michael,

I downloaded the full claims batch and imported it into a seperate lookup tool database. It works perfect. I repeat myself but I have to tell you once again thanks for the incredible job you did. You're one of the very fey people in the translator community who are able to deal with this data mountains.

Yes please make a website and a donate button

All the best


Michael Beijer
Michael Beijer
United Kingdom
Local time: 07:44
Member (2009)
Dutch to English
+ ...
Uploaded single zip of all files! Mar 3, 2015

You're welcome Noe!

---> (2.97 GB)

By the way, Igor (the developer of CafeTran) is working on adding SQLite functionality to CafeTran, meaning: it will soon be able to read TMLookup databases directly, and use them in it's Total Recall and pretranslation systems! That is, you will be able to work with
... See more
You're welcome Noe!

---> (2.97 GB)

By the way, Igor (the developer of CafeTran) is working on adding SQLite functionality to CafeTran, meaning: it will soon be able to read TMLookup databases directly, and use them in it's Total Recall and pretranslation systems! That is, you will be able to work with databases of this size directly in a CAT tool, and it will be fast – much, much faster than in SDL Studio, memoQ, Deja-Vu, etc. In fact, as far as I am aware, CafeTran will be the first CAT tool on the planet to be able to handle "Big Data" comfortably.

Anna Sarah Krämer
Anna Sarah Krämer
Local time: 08:44
Member (2011)
English to German
+ ...
Wow! Mar 3, 2015

Hi Michael,

I am now downloading and I would like to join Noe in asking for a donate button. You did a wonderful job here!

Thank you so much!

Best regards,

FarkasAndras
Local time: 08:44
English to Hungarian
+ ...
Interesting Mar 3, 2015

Michael Beijer wrote:

By the way, Igor (the developer of CafeTran) is working on adding SQLite functionality to CafeTran, meaning: it will soon be able to read TMLookup databases directly, and use them in it's Total Recall and pretranslation systems! That is, you will be able to work with databases of this size directly in a CAT tool, and it will be fast – much, much faster than in SDL Studio, memoQ, Deja-Vu, etc. In fact, as far as I am aware, CafeTran will be the first CAT tool on the planet to be able to handle "Big Data" comfortably.

Does the developer plan to support TMLookup db files specifically, or will he just use SQLite? Just because the db files of both tools are SQLite doesn't mean that they are interchangeable. For instance, Studio also uses SQLite, but their format is completely different, so TMLookup and Studio can't read each other's files.
I used a very simple format for TMLookup, so it's very easy to support if desired. SDL obviously added a bunch of fancy features to their TMs (fuzzy search, storing 'language resources' aka segmentation data in the TM etc.) and they needed a vastly more complex format for that.

Meta Arkadia
Meta Arkadia
Local time: 13:44
English to Indonesian
+ ...
Rather simple Mar 3, 2015

FarkasAndras wrote:
Does the developer plan to support TMLookup db files specifically, or will he just use SQLite?

The feature also works on a Mac, so it's not TMLookup support only. However, it's quite possible Igor took TMLookup into account. As far as I can see, the set-up is very simple:

But it works:

By the way, I don't think this feature is "official" yet.


Hans (guinea pig, but a happy one)

Michael Beijer
Michael Beijer
United Kingdom
Local time: 07:44
Member (2009)
Dutch to English
+ ...
Has Big Data finally reached CAT tool country? Mar 3, 2015

FarkasAndras wrote:

Michael Beijer wrote:

By the way, Igor (the developer of CafeTran) is working on adding SQLite functionality to CafeTran, meaning: it will soon be able to read TMLookup databases directly, and use them in it's Total Recall and pretranslation systems! That is, you will be able to work with databases of this size directly in a CAT tool, and it will be fast – much, much faster than in SDL Studio, memoQ, Deja-Vu, etc. In fact, as far as I am aware, CafeTran will be the first CAT tool on the planet to be able to handle "Big Data" comfortably.

Does the developer plan to support TMLookup db files specifically, or will he just use SQLite? Just because the db files of both tools are SQLite doesn't mean that they are interchangeable. For instance, Studio also uses SQLite, but their format is completely different, so TMLookup and Studio can't read each other's files.
I used a very simple format for TMLookup, so it's very easy to support if desired. SDL obviously added a bunch of fancy features to their TMs (fuzzy search, storing 'language resources' aka segmentation data in the TM etc.) and they needed a vastly more complex format for that.

Hi András,

Not sure really. The other day, Igor asked me to send him a small example of a TMLookup db, which I did. He then surprised us all by adding support for SQLite dbs to CafeTran (with the ability to open TMlookup-generated dbs directly). Previously, we had used H2 dbs, but performance was not always great, and they had problems. I can now open my TMLookup default.db directly in CT. Because the lnaguage codes are slightly different ("nl-NL" in CT and "nl" in TMLookup), you need to select the languages from a dropdown menu when opening a db in CT. Igor said that wouldn't be necessary of I could change the TMLookup .db languages to the nl-NL flavour.

The actual feature hasn't been officially released yet. That is, added to the ChangeLog. A few of us are testing it at the moment, and things might change.

I think Igor added it because we kept telling him how fast TMlookup is, and now we have access to the same lookup speeds from inside CT, a CAT tool first, as far as I know. What CAT tool has lightning-fast searching of databases of 40 million entries? And it doesn't stop there: we can now even use these dbs for pretranslation and "Total Recall" purposes, which might open up very interesting new possibilities. Running Pretranslate in e.g. memoQ, with my 40 million TUs connected brings the program to a crawl and takes forever. Yesterday, however, I ran Pretranslate on my job, against my TMLookup-derived .db, and everything was fast and responsive.

Stay tuned!



Here is what Igor emailed me re the new feature (which might interest you as you are both a programmer and the creator of TMlookup):

Hi Michael,

[…] SQLite (the same as in the TMLookup) as an alternative to H2 DB in the Total Recall system.

1. Download the sqlite java driver 3.8.7 fro here
2. Paste it into the lib folder in the CT installation.
3. Paste the attached SQLite.res into the infos/databases/ folder.
4. Replace Cafetran.jar as usual.
5. Run CT and go to Edit > Options > Database tab.
6. Select SQLite from the drop-down list in the Database connection:
7. By default it is configured to create a new database. However, you may access TMLookup base as well. Just press the Database connection button and set the path to your .db file in the URL field (jdbc:sqlite:C:\path\to\may\file\TMLTest.db)

Note: TMLookup bases do not set the country codes (only language codes) in the columns so CT will display a pop-up dialog to select the source and the target language columns.
You might wish to populate the newly SQLiteMemoryBase.db created by CT with TMXs and compatible language and country codes for columns.


****PS: In case any of you reading this are wondering how this relates to the original topic of the thread … this means we can effectively use the entire "PatTR: Patent Translation Resource" (all 22 million de-en TUs of it, e.g.) inside a CAT tool: both for lookup purposes and even as a massive TM!

[Edited at 2015-03-03 15:11 GMT]

FarkasAndras
Local time: 08:44
English to Hungarian
+ ...
OK Mar 3, 2015

That's nice to know. A little more compatibility between tools is always good to have (especially because having to keep two or three versions of a very large database eats up unreasonable amounts of disk space).

2nl (X)
2nl (X)
Local time: 08:44
I share your enthusiasm! Mar 3, 2015

Michael Beijer wrote:

****PS: In case any of you reading this are wondering how this relates to the original topic of the thread … this means we can effectively use the entire "PatTR: Patent Translation Resource" (all 22 million de-en TUs of it, e.g.) inside a CAT tool: both for lookup purposes and even as a massive TM!

Dear Michael Beijer,

I share your enthusiasm about CafeTran's new feature. In fact, this might be the reason for me to start using it.

I'd also like to thank you for your BIG EFFORT in making these patents available for all of us! May prosperity be your reward. Please add a donut button.



Markgraf001 (X)
Markgraf001 (X)
Local time: 08:44
English to German
+ ...
Great work! Mar 4, 2015

I really appreciate your work, thanks a lot for that.
When downloading I recognised that Description file 08 and 09 are the same (Both named 09, with TUs 8,000,000-9,000,000).
So could you please upload Description Part 08, just to have it complete?

Thanks again for that huge project!!!!

Michael Beijer
Michael Beijer
United Kingdom
Local time: 07:44
Member (2009)
Dutch to English
+ ...
Oops! Mar 4, 2015

Markgraf001 wrote:

I really appreciate your work, thanks a lot for that.
When downloading I recognised that Description file 08 and 09 are the same (Both named 09, with TUs 8,000,000-9,000,000).
So could you please upload Description Part 08, just to have it complete?

Thanks again for that huge project!!!!

I knew something would go wrong

Here is Description #8:,000,000-8,000,000).zip


[Edited at 2015-03-04 20:18 GMT]

Markgraf001 (X)
Markgraf001 (X)
Local time: 08:44
English to German
+ ...
Perfect :) Mar 5, 2015

Thank you so much!

tmx from Parallel corpus of Patent Translation Resource?

