Pages in topic:   [1 2 3 4] >
tmx from Parallel corpus of Patent Translation Resource?
Thread poster: Noe Tessmann
Noe Tessmann  Identity Verified
Local time: 12:16
English to German
+ ...
Dec 28, 2014

Dear colleagues,

I just found the Patent Translation Resource here: http://www.cl.uni-heidelberg.de/statnlpgroup/pattr/
It's really huge, about 2,8 GB zipped for EN-DE, but I couldn't find a way to make a tmx file out of it.

Anybody already tried to do this? Is there an aligner tool like for the EU DTM somewhere out there?

Thanks in advance and happy rest of the year

Noe


Direct link Reply with quote
 
FarkasAndras
Local time: 12:16
English to Hungarian
+ ...
Format? Dec 28, 2014

Well, what is the format? I don't want to download 2+GB just to see. If it's tabbed text, you can use the tmx maker in lf aligner to generate tmx files.
From the description, the texts were aligned with gargantua, i.e. you don't need an aligner but a converter of some sort.


Direct link Reply with quote
 
Noe Tessmann  Identity Verified
Local time: 12:16
English to German
+ ...
TOPIC STARTER
Separated by hard returns., extensions en, de, meta Dec 28, 2014

FarkasAndras wrote:

Well, what is the format? I don't want to download 2+GB just to see. If it's tabbed text, you can use the tmx maker in lf aligner to generate tmx files.


Hi Andras,

hard to say, I see just different directories with files like pattr.de-en.claims.en, pattr.de-en.claims.de and pattr.de-en.claims.meta, unzipped they are 2.9, 3.2 and 0.3 GB big.

I can open the titles file. The sentences are separated by hard returns. See text below.

Happy last days of 2014

Noe

A lifting device.
Method for the oxidation of quinine to quininone and quinidinone.
Process for the polymerisation of alpha-olefins and method for preparing solid catalytic complexes for use in this polymerisation process.
Process for obtaining acrylic acid from its solutions in tri-n-butyl phosphate.
Apparatus for the thermal treatment of a padding material formed from fibres with a thermosetting bonding material.
Triazol-substituted sulphur derivatives, their preparation and their utilisation as fungicides.
Process for image-wise modifying the surface of an etchable support and material suitable therefor comprising a colloid layer containing polymers with oxime-ester groups.
Use of vinylchloride polymer powders in the manufacture of battery separators.
Roundabout propelled by movement of the human body.
Etch bleaching liquid.
Resin binders containing amino groups and process for their preparation.
Aqueous air-drying alkyd dispersions and their use.
Use of alpha-polyolefin compositions for extrusion.
Process for the production and separation of hydrogen iodide and sulphuric acid and their respective uses in the production of hydrogen and oxygen.
Scanning radiographic apparatus and method.


[Edited at 2014-12-29 07:10 GMT]


Direct link Reply with quote
 
FarkasAndras
Local time: 12:16
English to Hungarian
+ ...
OK Dec 29, 2014

Looks like they just put the text in separate files in parallel. I.e. line 1 in pattr.de-en.claims.en is an English sentence, line 1 in pattr.de-en.claims.de is the corresponding German sentence and line 1 in pattr.de-en.claims.meta is the corresponding metadata. Line 2 in each file is segment 2 etc. IMO it's a silly format to distribute TM files in, but it could be worse. At least it's not pdf.
I don't know of any ready-made software that can convert it into something usable, but I could do it. Email me if you want me to look into it.


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 11:16
Member (2009)
Dutch to English
+ ...
AlignFactory? Dec 29, 2014

Hi Noe,

AlignFactory might be able to do it. I'll have a look.

Michael


Direct link Reply with quote
 
FarkasAndras
Local time: 12:16
English to Hungarian
+ ...
Oh right Dec 29, 2014

Some aligners have a primitive "mesh" mode where they don't even try to align the texts, they just dump them into a tmx as is. IIRC bitext2tmx does this too (with the difference that the primitive mode is all it has). I.e. you can feed the files to it and it generates a tmx... if it can handle many file pairs at once. I'm not sure if bitext2tmx can, but Alignfactory almost certainly can.
The only issue is if you want the metadata to be conserved. I could add it as a metadata field with some custom code, AlignFactory probably can't.


Direct link Reply with quote
 
Noe Tessmann  Identity Verified
Local time: 12:16
English to German
+ ...
TOPIC STARTER
Livedocs gets stuck Dec 29, 2014

Michael Beijer wrote:

Hi Noe,

AlignFactory might be able to do it. I'll have a look.

Michael


Hi Michael,

I simply tried to import the smallest files (titles) into MemoQ Livedocs. It does something but gets stuck after a while. What the use of making a parallel corpus when you have to align by yourself?

Kind regards and a happy new year with a lot of new data mines

Noe


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 11:16
Member (2009)
Dutch to English
+ ...
OK, I did one. Dec 29, 2014

OK, I did one, the folder called "abstract". I stuck the metadata in a custom attribute in the TMX and separated the separate entries with semicolons. I also removed duplicates from the TMX, which produced this:

.csv: 720,571 TUs
.tmx (after removing duplicates): 718,201 TUs

I uploaded the TMX and CSV to my server:

http://wordbook.nl/content/PatTR/

some_text

some_text

Michael

PS: I used EmEditor (open text files), Ron's CSV Editor (paste txt file content in to create .csv), Xbench (convert .csv to .tmx; Xbench automatically added the third column as a custom attribute called "x-col0"), and the Heartsome TMX editor (to edit the TMX custom attribute and clean the TMX).

[Edited at 2014-12-29 14:59 GMT]

[Edited at 2014-12-29 15:26 GMT]

[Edited at 2014-12-30 12:01 GMT]


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 11:16
Member (2009)
Dutch to English
+ ...
stronger PC? Dec 29, 2014

Noe Tessmann wrote:

Michael Beijer wrote:

Hi Noe,

AlignFactory might be able to do it. I'll have a look.

Michael


Hi Michael,

I simply tried to import the smallest files (titles) into MemoQ Livedocs. It does something but gets stuck after a while. What the use of making a parallel corpus when you have to align by yourself?

Kind regards and a happy new year with a lot of new data mines

Noe


Hi Noe,

Yeah, I started it in memoQ, and I think it would have completed fine, but killed it because it seemed faster to do it manually. It might depend on your computer. Mine has 32GB of RAM, 2 SSDs and all kinds of other bells and whistles; maybe you need a stronger PC.

My German isn't great, but is it correct that the files are already correctly aligned, in the sense that what is on line 1 in "pattr.de-en.abstract.de" corresponds to what is on like 1 in "pattr.de-en.abstract.en", etc.?

Michael

I just uploaded the TMX and CSV of the first folder ("abstract") to my server:

http://wordbook.nl/content/PatTR/

[Edited at 2014-12-30 12:01 GMT]


Direct link Reply with quote
 
Noe Tessmann  Identity Verified
Local time: 12:16
English to German
+ ...
TOPIC STARTER
Wow, looks great Dec 29, 2014

Hi Michael,

My PC is broken, I am working on a feeble replacement laptop. The alignement looks great. You did a wonderful job and you even uploaded the stuff. Wow, I am deeply amazed.

I'll have a try if my old machine can handle the import. Otherwise I'll have to wait until my powerful Asus comes back from Amazon land.

All the best

Noe


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 11:16
Member (2009)
Dutch to English
+ ...
@Noe: Dec 29, 2014

I'll do the others when I have a moment to spare. I'll put all the links here so others can also download them if they want. It looks like pretty good content.

In case you don't want or need the metadata, here is a TMX without it: http://wordbook.nl/content/PatTR/PatTR_abstract-(without-metadata)(TMX).zip (removing the metadata makes it a bit smaller)

Michael

Entire folder here: http://wordbook.nl/content/PatTR/


Direct link Reply with quote
 
Noe Tessmann  Identity Verified
Local time: 12:16
English to German
+ ...
TOPIC STARTER
Download stops Dec 30, 2014

Dear Michael,

I don't know why but the download of any of these files stops at around 30 MB. Maybe it's my internet connection. I'll try again next year when I am home. It's not urgent at all. Thanks again for your big effort. These are not even your language pairs.

Happy last days of 2014 and a wonderful 2015

Noe


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 11:16
Member (2009)
Dutch to English
+ ...
oops Dec 30, 2014

Hi Noe,

Yes, I just checked one of them and something definitely went wrong with the uploads. I'll have a look. I removed all the broken ones.

This one should work: PatTR_abstract-(without-metadata)(TMX).zip: https://www.dropbox.com/s/q62dppwexwkj9zh/PatTR_abstract-(without-metadata)(TMX).zip?dl=0

Michael


Direct link Reply with quote
 

Robert Bononno  Identity Verified
United States
Local time: 05:16
French to English
+ ...
Conversion on a Mac? Jan 2, 2015

I have the source files in FR and EN but don't believe I have any software or text editor that can manipulate and join the larger files (2.7 GB, 3+GB). One of my text editors, TextWrangler, refuses to open them. TextEdit will open the smaller ones but I haven't tried the larger files. I'm very reluctant to try to manipulate these in Excel; it's going to generate a humongous file. I have 8 GB RAM on the machine but these are big files. Might be easier to simply search the contents of the corpus on line (if possible).

Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 11:16
Member (2009)
Dutch to English
+ ...
my Dropbox link above is dead Jan 2, 2015

Sorry about that. I am in the process of migrating to OneDrive. The new link will be up ASAP.

Michael

OK, here are the first ones:

https://onedrive.live.com/redir?resid=66B0D0A01BB0B893!45567 ("PatTR_abstract-(with-metadata)(tab-delimited-CSV).zip")
https://onedrive.live.com/redir?resid=66B0D0A01BB0B893!45630 ("PatTR_abstract-(with-metadata)(TMX).zip")

https://onedrive.live.com/redir?resid=66B0D0A01BB0B893!45629 ("PatTR_abstract-(without-metadata)(TMX).zip")

[Edited at 2015-01-02 21:28 GMT]


Direct link Reply with quote
 
Pages in topic:   [1 2 3 4] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

tmx from Parallel corpus of Patent Translation Resource?

Advanced search







CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use SDL Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

More info »
LSP.expert
You’re a freelance translator? LSP.expert helps you manage your daily translation jobs. It’s easy, fast and secure.

How about you start tracking translation jobs and sending invoices in minutes? You can also manage your clients and generate reports about your business activities. So you always keep a clear view on your planning, AND you get a free 30 day trial period!

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search