Pages in topic: [1 2] > | TM cleaning/shrinking (in MBs). Thread poster: Michael Beijer
| Michael Beijer United Kingdom Local time: 22:48 Member (2009) Dutch to English + ...
I have 2 questions: 1. How can I easily strip out all of the formatting from a TMX, in order to make it smaller (in MBs)? 2. What is the easiest way to split a big TMX up into smaller parts? It's around 300 MB and I'd like to cut it up into 50 MB pieces. Any suggestions? Michael | | | Heinrich Pesch Finland Local time: 00:48 Member (2003) Finnish to German + ...
You could try to pack the file into a zip-archive. I don't think removing formatting would save a lot of memory. There are application for cutting files and glueing them again. Regards Heinrich | | | What I would do | Sep 23, 2010 |
Michael J.W. Beijer wrote: 2. What is the easiest way to split a big TMX up into smaller parts? It's around 300 MB and I'd like to cut it up into 50 MB pieces. I would use sed, a command-line text editor of sorts, coupled with wc, to make line counting easy. TMX has a header, then TUs (text), then a footer, which is just made up of 2 tags (</body></tmx>) So, wc -l original.tmx to see how many lines you have, then decide how to chop up the TMX. Say, it's 300,000 lines, then you'd want to cut it to 6 bits of 50,000 lines each. sed -e "50000q" original.tmx > part_1.tmx to copy the first 50,000 lines into a new file. That should be small enough to edit with a text editor, so just open it and add </body></tmx> to the end and you have your first TMX because it already has the header. Alternatively, try sed -e "$ s/$/</body></tmx>" part_1.tmx > part_1b.tmx to add the footer. Then extract the header from the original file, extract lines 50,000 to 100,000 with sed -n "50001,100000 p" original.tmx > part_2.tmx ...merge this with the header into one file etc. Of course this way you will chop TUs in half. You can fix that manually (open the created tmx files and move bits manually) or, if they are still too large to handle, use sed to extract lines 49995-50005 and find a TU boundary, then chop the files at the boundary instead of at round numbers like 50,000. Sed has a steep learning curve, but I think it's worth it. Sed for windows Great sed tutorial Of course there may be tools that can do this stuff (Olifant, maybe?) but at sizes like 300MB and up, they are likely to fail, and you can often do more with these low-level solutions. If Olifant has no such features, maybe I'll write a perl script one day that can merge and split TMX memories... Should be a nice afternoon or weekend project. | | |
Come to think of it, there is a neater solution. Separate off the header, delete the footer, insert line breaks after </tu> tags and remove all other line breaks. Then each line is one TU so you can slice & dice the TM any way you like, add the header and footer and off you go. I'll probably write it in perl next week, then you can just drag & drop a TMX, decide the size of the chunks you want and have the TMX files made for you automagically. | |
|
|
Michael Beijer United Kingdom Local time: 22:48 Member (2009) Dutch to English + ... TOPIC STARTER Thanks FarkasAndras! | Sep 24, 2010 |
FarkasAndras wrote: Come to think of it, there is a neater solution. Separate off the header, delete the footer, insert line breaks after tags and remove all other line breaks. Then each line is one TU so you can slice & dice the TM any way you like, add the header and footer and off you go. I'll probably write it in perl next week, then you can just drag & drop a TMX, decide the size of the chunks you want and have the TMX files made for you automagically. Wow, that would be a great little program to have! Thanks, Michael | | | perl is a good friend to have | Sep 24, 2010 |
Michael J.W. Beijer wrote: FarkasAndras wrote: Come to think of it, there is a neater solution. Separate off the header, delete the footer, insert line breaks after tags and remove all other line breaks. Then each line is one TU so you can slice & dice the TM any way you like, add the header and footer and off you go. I'll probably write it in perl next week, then you can just drag & drop a TMX, decide the size of the chunks you want and have the TMX files made for you automagically. Wow, that would be a great little program to have! Thanks, Michael Yes, things like this often come in handy for me, too. I'll probably add tmx merging and conversion to tab delimited txt while I'm at it. Perl makes these relatively easy. | | | Michael Beijer United Kingdom Local time: 22:48 Member (2009) Dutch to English + ... TOPIC STARTER
Cool, tmx merging and conversion to tab delimited txt would be great. While you're at it, how about tab delimited txt -> TMX as well, that would make it truly complete;) 1. TMX cutting/merging 2. TMX tab delimited txt Michael | | |
Michael J.W. Beijer wrote: Cool, tmx merging and conversion to tab delimited txt would be great. While you're at it, how about tab delimited txt -> TMX as well, that would make it truly complete;) 1. TMX cutting/merging 2. TMX tab delimited txt Michael I have that covered, I have a pretty full-featured TMX creator in my aligner project with an improved version on the way: sourceforge.net/projects/aligner There are a couple of other solutions as well, such as plustools and xbench. | |
|
|
Done, sort of | Oct 2, 2010 |
I have a working Windows program available for download: http://www.mediafire.com/?1mzcsym0a0t8pbf Here's the perl script for Linux/mac users: http://www.mediafire.com/?v6os257p6qj424n There is one issue with it, and, as always, it's encoding. I can find no easy way of handlin... See more I have a working Windows program available for download: http://www.mediafire.com/?1mzcsym0a0t8pbf Here's the perl script for Linux/mac users: http://www.mediafire.com/?v6os257p6qj424n There is one issue with it, and, as always, it's encoding. I can find no easy way of handling all encodings automatically, so I just left the user handle it. Try UTF-8 first: just press enter on the first prompt without typing anything. If it doesn't work (you get corrupted characters), try UCS-2LE, UCS-2BE, UTF-16, UTF-16LE and UTF-16BE. Trados exports into UTF-16LE encoding, but I couldn't get the script to work with Trados TMX files and I can't summon the mental strength to troubleshoot encoding problems for the 783297823th time in the last 2 months. Of course this is only an issue with very large TMs as in the OP. With a smaller TM, you can just open the file in a text editor and simply resave it in UTF-8 (if the header announces the encoding, change it to say "utf-8"). Then again, with smaller TMs, you don't really need a tool like this in the first place... oops. The program needs to write temporary files to the folder where the input tmx is, so if you have files with the same name there, they will get deleted. If for example your tmx is called largememory.tmx, the file names you have to avoid are largememory_[anything].txt and largememory_part[number].tmx. Apart from the encoding problem, I don't really expect any issues. The files get the original header and the original content so this shouldn't go wrong. I'll probably also write a TMX merger, but that's more tricky: one has to decide how to merge the headers and how to match up the language codes in the files. And if the files are in different encodings, it gets really messy.
[Edited at 2010-10-02 21:19 GMT] ▲ Collapse | | | | Michael Beijer United Kingdom Local time: 22:48 Member (2009) Dutch to English + ... TOPIC STARTER Thanks a lot FarkasAndras! | Oct 13, 2010 |
I am right now trying them out. The TMX chopper seems to work perfectly. I just chopped a 35 MB (11,5130 ) TMX into 6 smaller TMXs. I am going to try the converter now... Michael p.s. Hope you don't mind, but I added them to my "interesting computer stuff for translators" list, here: http://beijer.mx/computer.html ....
[Edited at 2010-10-13 20:40 ... See more I am right now trying them out. The TMX chopper seems to work perfectly. I just chopped a 35 MB (11,5130 ) TMX into 6 smaller TMXs. I am going to try the converter now... Michael p.s. Hope you don't mind, but I added them to my "interesting computer stuff for translators" list, here: http://beijer.mx/computer.html ....
[Edited at 2010-10-13 20:40 GMT] ▲ Collapse | | | Glad it worked | Oct 13, 2010 |
So your TMX was in UTF-8 then? Do check characters with diacritic marks to make sure they survived... if they did, everything should be fine. Of course if you find one correct occurrence of a given character in the file, they will all be fine so this should take no more than a minute. Of course I don't mind if you drop a link to this on your site, although the files won't be available for long. Mediafire takes them off after a while. My aligner project on sourceforge is a more... See more So your TMX was in UTF-8 then? Do check characters with diacritic marks to make sure they survived... if they did, everything should be fine. Of course if you find one correct occurrence of a given character in the file, they will all be fine so this should take no more than a minute. Of course I don't mind if you drop a link to this on your site, although the files won't be available for long. Mediafire takes them off after a while. My aligner project on sourceforge is a more serious deal, that will probably be available for years. I'll probably include these tools in the aligner download packages at some point. ▲ Collapse | |
|
|
Michael Beijer United Kingdom Local time: 22:48 Member (2009) Dutch to English + ... TOPIC STARTER @FarkasAndras | Jun 14, 2012 |
Hey, I was just playing around with your TMX Chopper again today and I was wondering if there is a way to NOT remove so many line endings in the conversion process? The reason I am asking is that the TMXs it produces don't display very nicely in Copernic Desktop Search. (I am using your tool to chop up all my very large TMXs so I can index them.) Michael | | |
Michael Beijer wrote: Hey, I was just playing around with your TMX Chopper again today and I was wondering if there is a way to NOT remove so many line endings in the conversion process? The reason I am asking is that the TMXs it produces don't display very nicely in Copernic Desktop Search. (I am using your tool to chop up all my very large TMXs so I can index them.) Michael Lines are merged so that the programme can count TUs and make sure that it chops the file between two TUs and not in the middle of one. It would be complicated to restore the original line breaks, but I inserted line breaks after the various parts of each TU. I also just noticed that the script writes the output files in the folder where the script itself is, instead of the folder where the original tmx is. I can't be bothered to fix that at the moment, though... Download (grab bag 1.7): www.sourceforge.net/projects/aligner | | | MikeTrans Germany Local time: 23:48 Italian to German + ... Olifant & UltraEdit | Jun 15, 2012 |
For such operations, especially with very big databases, I use Olifant (Okapi Frameworks; open-source) a TMX manager that loads a file based on your available RAM, thus very fast. Able to import tab text and here you can tick the option "remove XML codes". Supports RegEx and SQL filtering and flagging features. UltraEdit: A powerful text editor, also based on your RAM for handling files. I use this one to split any big files, also TMX files. In this c... See more For such operations, especially with very big databases, I use Olifant (Okapi Frameworks; open-source) a TMX manager that loads a file based on your available RAM, thus very fast. Able to import tab text and here you can tick the option "remove XML codes". Supports RegEx and SQL filtering and flagging features. UltraEdit: A powerful text editor, also based on your RAM for handling files. I use this one to split any big files, also TMX files. In this case, just open a new file, copy the TMX header and the end closing XML statement, then copy-paste all the [open tu] up to [end tu] you want in your split file. With UltraEdit you can easily build macros and scripts for complex, repeating operations. Note: I think you know that, but anyway: Removing codes is not always to your advantage, also it only minimally shrinks your TM. For example, if a client asks you to update a document from monthes ago, if you don't have the codes anymore you are in trouble and going to lose time handling with the codes again. Greets, Mike
[Edited at 2012-06-15 13:04 GMT] ▲ Collapse | | | Pages in topic: [1 2] > | To report site rules violations or get help, contact a site moderator: You can also contact site staff by submitting a support request » TM cleaning/shrinking (in MBs). Wordfast Pro | Translation Memory Software for Any Platform
Exclusive discount for ProZ.com users!
Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value
Buy now! » |
| Trados Studio 2022 Freelance | The leading translation software used by over 270,000 translators.
Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop
and cloud solution, empowering you to work in the most efficient and cost-effective way.
More info » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |