TM cleaning/shrinking (in MBs). (CAT Tools Technical Help)

Technical forums » CAT Tools Technical Help »
TM cleaning/shrinking (in MBs).
Track this topic

Pages in topic: [1 2] >

TM cleaning/shrinking (in MBs).

Thread poster: Michael Beijer

Michael Beijer

United Kingdom
Local time: 22:48
Member (2009)
Dutch to English
+ ...

Sep 23, 2010

I have 2 questions:

1. How can I easily strip out all of the formatting from a TMX, in order to make it smaller (in MBs)?

2. What is the easiest way to split a big TMX up into smaller parts?
It's around 300 MB and I'd like to cut it up into 50 MB pieces.

Any suggestions?

Michael

Heinrich Pesch

Finland
Local time: 00:48
Member (2003)
Finnish to German
+ ...

Zip

Sep 23, 2010

You could try to pack the file into a zip-archive. I don't think removing formatting would save a lot of memory.
There are application for cutting files and glueing them again.
Regards
Heinrich

FarkasAndras

Local time: 23:48
English to Hungarian
+ ...

What I would do

Sep 23, 2010

Michael J.W. Beijer wrote:

2. What is the easiest way to split a big TMX up into smaller parts?
It's around 300 MB and I'd like to cut it up into 50 MB pieces.

I would use sed, a command-line text editor of sorts, coupled with wc, to make line counting easy.
TMX has a header, then TUs (text), then a footer, which is just made up of 2 tags (</body></tmx>)
So,
wc -l original.tmx
to see how many lines you have, then decide how to chop up the TMX. Say, it's 300,000 lines, then you'd want to cut it to 6 bits of 50,000 lines each.

sed -e "50000q" original.tmx > part_1.tmx
to copy the first 50,000 lines into a new file. That should be small enough to edit with a text editor, so just open it and add </body></tmx> to the end and you have your first TMX because it already has the header. Alternatively, try
sed -e "$ s/$/</body></tmx>" part_1.tmx > part_1b.tmx
to add the footer.

Then extract the header from the original file, extract lines 50,000 to 100,000 with
sed -n "50001,100000 p" original.tmx > part_2.tmx
...merge this with the header into one file etc.

Of course this way you will chop TUs in half. You can fix that manually (open the created tmx files and move bits manually) or, if they are still too large to handle, use sed to extract lines 49995-50005 and find a TU boundary, then chop the files at the boundary instead of at round numbers like 50,000.

Sed has a steep learning curve, but I think it's worth it.
Sed for windows
Great sed tutorial

Of course there may be tools that can do this stuff (Olifant, maybe?) but at sizes like 300MB and up, they are likely to fail, and you can often do more with these low-level solutions.
If Olifant has no such features, maybe I'll write a perl script one day that can merge and split TMX memories... Should be a nice afternoon or weekend project.

FarkasAndras

Local time: 23:48
English to Hungarian
+ ...

Fine tuning

Sep 24, 2010

Come to think of it, there is a neater solution.
Separate off the header, delete the footer, insert line breaks after </tu> tags and remove all other line breaks. Then each line is one TU so you can slice & dice the TM any way you like, add the header and footer and off you go.
I'll probably write it in perl next week, then you can just drag & drop a TMX, decide the size of the chunks you want and have the TMX files made for you automagically.

Michael Beijer

United Kingdom
Local time: 22:48
Member (2009)
Dutch to English
+ ...

TOPIC STARTER

Thanks FarkasAndras!

Sep 24, 2010

FarkasAndras wrote:

Come to think of it, there is a neater solution.
Separate off the header, delete the footer, insert line breaks after tags and remove all other line breaks. Then each line is one TU so you can slice & dice the TM any way you like, add the header and footer and off you go.
I'll probably write it in perl next week, then you can just drag & drop a TMX, decide the size of the chunks you want and have the TMX files made for you automagically.

Wow, that would be a great little program to have!

Thanks,

Michael

FarkasAndras

Local time: 23:48
English to Hungarian
+ ...

perl is a good friend to have

Sep 24, 2010

Michael J.W. Beijer wrote:

Wow, that would be a great little program to have!

Thanks,

Michael

Yes, things like this often come in handy for me, too.

I'll probably add tmx merging and conversion to tab delimited txt while I'm at it. Perl makes these relatively easy.

Michael Beijer

United Kingdom
Local time: 22:48
Member (2009)
Dutch to English
+ ...

TOPIC STARTER

TmTxT 1.0

Sep 24, 2010

Cool, tmx merging and conversion to tab delimited txt would be great.

While you're at it, how about tab delimited txt -> TMX as well, that would make it truly complete;)

1. TMX cutting/merging
2. TMX tab delimited txt

Michael

FarkasAndras

Local time: 23:48
English to Hungarian
+ ...

TMX creator

Sep 25, 2010

Michael J.W. Beijer wrote:

Cool, tmx merging and conversion to tab delimited txt would be great.

While you're at it, how about tab delimited txt -> TMX as well, that would make it truly complete;)

1. TMX cutting/merging
2. TMX tab delimited txt

Michael

I have that covered, I have a pretty full-featured TMX creator in my aligner project with an improved version on the way:
sourceforge.net/projects/aligner

There are a couple of other solutions as well, such as plustools and xbench.

FarkasAndras

Local time: 23:48
English to Hungarian
+ ...

Done, sort of

Oct 2, 2010

I have a working Windows program available for download: http://www.mediafire.com/?1mzcsym0a0t8pbf
Here's the perl script for Linux/mac users: http://www.mediafire.com/?v6os257p6qj424n

There is one issue with it, and, as always, it's encoding. I can find no easy way of handling all encodings automatically, so I just left the user handle it. Try UTF-8 first: just press enter on the first prompt without typing anything. If it doesn't work (you get corrupted characters), try UCS-2LE, UCS-2BE, UTF-16, UTF-16LE and UTF-16BE.
Trados exports into UTF-16LE encoding, but I couldn't get the script to work with Trados TMX files and I can't summon the mental strength to troubleshoot encoding problems for the 783297823th time in the last 2 months.

Of course this is only an issue with very large TMs as in the OP. With a smaller TM, you can just open the file in a text editor and simply resave it in UTF-8 (if the header announces the encoding, change it to say "utf-8"). Then again, with smaller TMs, you don't really need a tool like this in the first place... oops.

The program needs to write temporary files to the folder where the input tmx is, so if you have files with the same name there, they will get deleted. If for example your tmx is called largememory.tmx, the file names you have to avoid are largememory_[anything].txt and largememory_part[number].tmx.

Apart from the encoding problem, I don't really expect any issues. The files get the original header and the original content so this shouldn't go wrong.

I'll probably also write a TMX merger, but that's more tricky: one has to decide how to merge the headers and how to match up the language codes in the files. And if the files are in different encodings, it gets really messy.

[Edited at 2010-10-02 21:19 GMT] ▲ Collapse

FarkasAndras

Local time: 23:48
English to Hungarian
+ ...

New version

Oct 3, 2010

New TMX chopper that can strip formatting: http://www.mediafire.com/?9c7c7imq3raudtx
(perl script: http://www.mediafire.com/?m6anf0yn8y2wx8y )

TMX -> tab delimited converter: <... See more

New TMX chopper that can strip formatting: http://www.mediafire.com/?9c7c7imq3raudtx
(perl script: http://www.mediafire.com/?m6anf0yn8y2wx8y )

TMX -> tab delimited converter: http://www.mediafire.com/?gddr3v4iy5vo8kg (perl script: http://www.mediafire.com/?pbsed2opdg8mf7k ) ▲ Collapse

Michael Beijer

United Kingdom
Local time: 22:48
Member (2009)
Dutch to English
+ ...

TOPIC STARTER

Thanks a lot FarkasAndras!

Oct 13, 2010

I am right now trying them out.

The TMX chopper seems to work perfectly. I just chopped a 35 MB (11,5130 ) TMX into 6 smaller TMXs.

I am going to try the converter now...

Michael

p.s. Hope you don't mind, but I added them to my "interesting computer stuff for translators" list, here: http://beijer.mx/computer.html ....

[Edited at 2010-10-13 20:40 ... See more

FarkasAndras

Local time: 23:48
English to Hungarian
+ ...

Glad it worked

Oct 13, 2010

So your TMX was in UTF-8 then?
Do check characters with diacritic marks to make sure they survived... if they did, everything should be fine. Of course if you find one correct occurrence of a given character in the file, they will all be fine so this should take no more than a minute.

Of course I don't mind if you drop a link to this on your site, although the files won't be available for long. Mediafire takes them off after a while. My aligner project on sourceforge is a more... See more

Michael Beijer

United Kingdom
Local time: 22:48
Member (2009)
Dutch to English
+ ...

TOPIC STARTER

@FarkasAndras

Jun 14, 2012

Hey, I was just playing around with your TMX Chopper again today and I was wondering if there is a way to NOT remove so many line endings in the conversion process? The reason I am asking is that the TMXs it produces don't display very nicely in Copernic Desktop Search. (I am using your tool to chop up all my very large TMXs so I can index them.)

Michael

FarkasAndras

Local time: 23:48
English to Hungarian
+ ...

line breaks

Jun 15, 2012

Michael Beijer wrote:

Hey, I was just playing around with your TMX Chopper again today and I was wondering if there is a way to NOT remove so many line endings in the conversion process? The reason I am asking is that the TMXs it produces don't display very nicely in Copernic Desktop Search. (I am using your tool to chop up all my very large TMXs so I can index them.)

Michael

Lines are merged so that the programme can count TUs and make sure that it chops the file between two TUs and not in the middle of one.
It would be complicated to restore the original line breaks, but I inserted line breaks after the various parts of each TU.
I also just noticed that the script writes the output files in the folder where the script itself is, instead of the folder where the original tmx is. I can't be bothered to fix that at the moment, though...
Download (grab bag 1.7): www.sourceforge.net/projects/aligner

MikeTrans
Germany
Local time: 23:48
Italian to German
+ ...

Olifant & UltraEdit

Jun 15, 2012

For such operations, especially with very big databases, I use

Olifant (Okapi Frameworks; open-source)
a TMX manager that loads a file based on your available RAM, thus very fast. Able to import tab text and here you can tick the option "remove XML codes".
Supports RegEx and SQL filtering and flagging features.

UltraEdit:
A powerful text editor, also based on your RAM for handling files. I use this one to split any big files, also TMX files. In this case, just open a new file, copy the TMX header and the end closing XML statement, then copy-paste all the [open tu] up to [end tu] you want in your split file.
With UltraEdit you can easily build macros and scripts for complex, repeating operations.

Note:
I think you know that, but anyway: Removing codes is not always to your advantage, also it only minimally shrinks your TM.
For example, if a client asks you to update a document from monthes ago, if you don't have the codes anymore you are in trouble and going to lose time handling with the codes again.

Greets,
Mike

[Edited at 2012-06-15 13:04 GMT] ▲ Collapse

Pages in topic: [1 2] >

Login to reply/comment

To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Natalie	[Call to this topic]
Peter Zauner	[Call to this topic]
Prachya Mruetusatorn	[Call to this topic]

You can also contact site staff by submitting a support request »

TM cleaning/shrinking (in MBs).

Translation news related to CAT tools

» Memsource Sells to Carlyle: The Inside Story
(0 comments)
» memoQ 9.4: Turbo-Charging Productivity
(0 comments)
» The Future Of Work Now: The Computer-Assisted Translator And Lilt
(0 comments)

Submit translation news about CAT tools »
Read more translation news »

Forum rules

Help and orientation

Wordfast Pro
Translation Memory Software for Any Platform Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value Buy now! »

Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators. Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way. More info »

Recent posts | FAQ | Rules | Moderators | Article knowledgebase

Your current localization setting

English

Select a language

More languages...

TM cleaning/shrinking (in MBs).

TM cleaning/shrinking (in MBs).

You have native languages that can be verified

Your current localization setting

Select a language