Cleaning up Translation Memories: Best practices
Thread poster: trhanslator (X)

trhanslator (X)
Jan 25, 2013

Today I started migrating yet another client's TMs from Transit to CafeTran. I'll describe what I did and will do later. If you see any possible additions, please don't hesitate to reply to this message.

Purpose: Convert legacy TM to a clean TM for use in CafeTran, that is as small as possible.

Steps in Transit:

1. Export folder structure with Transit language pairs to TMX UTF-8. This results in one large TMX file.

Steps in Olifant:

Note: The order of the steps is critical.

1. Load the TM, remove all inline codes from all TUs.
2. Since CafeTran is very well able to automatically adapt numbers in TUs: Replace all numbers in both source and target with 0.
3. Remove all TU attribues (change date, user, file name etc.)
4. Perform some Find and Replace actions to insert non-breaking spaces before position numbers, make all curly/straight quotes in the target identical etc.
5. Remove all identical TUs (leaving only one TU per series of identical TUs); I chose to flag all TUs with identical source, case-insensitive.

Steps in CafeTran:

1. Assign the TM to a project.
2. Click the Filter button for the assigned TM.

Tick:
Remove source=target

(this step is possible in Olifant too, I know)

What other possiblities are there to further reduce the size of the TM?

BTW: I've sent an RFE to Igor to implement step 5 of the Olifant cleaning in CafeTran too. Perhaps step 3 would be nice too. Step 1 is already possible in CafeTran.


[Edited at 2013-01-25 19:27 GMT]


 

Dominique Pivard  Identity Verified
Local time: 10:37
Finnish to French
Why replace numbers with 0? Jan 26, 2013

trhanslator wrote:
2. Since CafeTran is very well able to automatically adapt numbers in TUs: Replace all numbers in both source and target with 0.

I'm not sure I follow you on this one. Are you saying that if you have the following segment in your TM:
Apple says it made profit of $13.1 bn on revenue of $54.5 bn in fiscal quarter that ended on 29 Dec

... you would change it so that it looks like this:
Apple says it made profit of $0 bn on revenue of $0 bn in fiscal quarter that ended on 0 Dec

What would be the rationale of such a change?



[Edited at 2013-01-26 07:30 GMT]


 

trhanslator (X)
TOPIC STARTER
Eliminate double TUs that only differ in numbers Jan 26, 2013

The reason for resetting all numbers to a zero is to remove double TUs that only differ in numerical data:

The value is increased to 5 degrees.

The value is increased to 6 degrees.

The value is increased to 7 degrees.

There is no need to save this TU three times in the TM. I replace 5, 6 and 7 with a zero (which I've chosen arbitrary).

The idea for this came from Transit, that offers 'Remove all segments that only differ in numbers' in the TM maintenance menu. Transit doesn't reset to zero, it keeps the number of the first TU in the TM (in this case: 5).


 

trhanslator (X)
TOPIC STARTER
Replace multiple spaces with one space Jan 26, 2013

The very first step should have been:

In TextWrangler:

Find and Replace dialog:

Find: \s+
Replace: enter a real space
Tick the Grep checkbox

Note that this procedure removes double spaces both from source and target. I use TW for this because Olifant will freeze when you have a large TM with many double spaces (probably a RAM issue).


 

trhanslator (X)
TOPIC STARTER
Frodo the time saver cleans TMs Feb 1, 2013


5. Remove all identical TUs (leaving only one TU per series of identical TUs); I chose to flag all TUs with identical source, case-insensitive.



CafeTran's Development must be reading here: Frodo offers this very handy feature now!

Thanks CafeTran!


 

trhanslator (X)
TOPIC STARTER
Easy TMX editing with CafeTran Feb 8, 2013

I've spend this morning to clean up one of my large TMs with the newest build of CafeTran.

As you probably know, CafeTran doesn't import TMX files, it just opens them and uses them directly. No conversion, no time wasting.

Some examples of cleaning actions:

CafeTran can now filter on segments that don't contain any letters. After the filtering, you can delete them.

CafeTran can filter on any regular expression (e.g. \APosition) to filter on TUs that contain a certain string at a certain location (e.g. at the start of a TU). After that you can quickly remove those TUs from the TMX file.

The beauty of this all is that cleaning/modifying your TM is so fast and simple that you can do it in a few seconds during your translation work. No re-import needed.

Thanks Igor!


 

Michael Beijer  Identity Verified
United Kingdom
Local time: 08:37
Member (2009)
Dutch to English
+ ...
Looking for a standalone TMX validator Jul 6, 2013

Hi Hans (or anyone else that's reading),

Does anyone know where I can get my hands on a TMX validator, preferably a free one, although I'd be willing to pay a small price? I am also looking for one that is standalone. That is, not like the one (TMXValidator) that now comes bundled with Swordfish.

It should be able to:

1. validate a file
2. clean invalid characters

Michael


 

Michael Beijer  Identity Verified
United Kingdom
Local time: 08:37
Member (2009)
Dutch to English
+ ...
UPDATE Jul 10, 2013

I found a free validator, created by the guy behind OmegaT+ (the fork of OmegaT), which is available here: http://omegatplus.sourceforge.net/applications.html#Validator

It both validates and cleans TMXs.

Michael


 

Rodolfo Raya  Identity Verified
Local time: 05:37
English to Spanish
TMXValidator old copy Jul 10, 2013

AS TMXValidator is open source and the code is available to the public, the author of OmegaT+ used it and deployed the program as its own thing without respecting the license terms.

What you found is based on a very old version of TMXValidator. It's not updated.

Just make your life easier. Download Swordfish and use the real TMXValidator. You don't need to purchase a license or anything. TMXValidator is free.

Regards,
Rodolfo




[Edited at 2013-07-10 11:06 GMT]


 

Michael Beijer  Identity Verified
United Kingdom
Local time: 08:37
Member (2009)
Dutch to English
+ ...
Thanks Rodolfo, Jul 10, 2013

The reason I was still looking for a standalone validator was that I thought that I would no longer be able to use your TMXValidator once the Swordfish trial period ran out.

It's a shame the author of OmegaT+ didn't respect the licence terms. That's not the first time I hear negative things about that project.

Michael


 


To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Natalie[Call to this topic]

You can also contact site staff by submitting a support request »

Cleaning up Translation Memories: Best practices

Advanced search






TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »
CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use SDL Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search