Is there an automated way of removing all company/brand names from a translation memory (tmx)?
Thread poster: Michael Beijer

Michael Beijer  Identity Verified
United Kingdom
Local time: 14:44
Member (2009)
Dutch to English
+ ...
Jun 21, 2010

TMX-Anonymizer Pro - 3.14 ??? icon_smile.gif

 

FarkasAndras
Local time: 15:44
English to Hungarian
+ ...
Various ways... Jun 21, 2010

I'm sure some of the more advanced CAT tools offer decent search and replace options for TMX memories. Programs like Olifant should work too, I believe.

If you want something more powerful and versatile, you can't beat tools like sed.
If there are numerous TMX files to do the replacement in, or numerous company names to replace, you could (have somebody) write a script (BAT if you're on windows, BASH if you're on linux or mac) to do all these operations in one fell swoop. Additionally converting all numbers of any length to "XXX" is also pretty easy, in case you want to remove numerical data.

A basic sed command for this would look something like:
sed -e "s/Company name/XXX/g" originalmemory.tmx > anonymizedmemory.tmx

To remove even more stuff (convert the TMX to a tab delimited file first):

- To replace all 2-digit or longer numbers:
sed -e "s/[0-9][.,][0-9]/11/g" originalmemory.txt > anonymizedmemory.txt
sed -e -i "s/[0-9]\{2;\}/XXX/g" anonymizedmemory.txt

- Sort the segments in the TMX alphabetically (to essentially randomize their order) with sort (sort for Windows is here: http://sourceforge.net/projects/unxutils/files/unxutils).

The great thing about sed is that it can easily do any number of operations on files of any size.


 

Samuel Murray  Identity Verified
Netherlands
Local time: 15:44
Member (2006)
English to Afrikaans
+ ...
Well, what would you like? Jun 21, 2010

Michael J.W. Beijer wrote:
Is there an automated way of removing all company/brand names from a translation memory (tmx)?


Well, what would such a program do? Here are my ideas, but let me know your and what you think about mine:

* The user can specify to either delete the company name, replace it with a fake name, replace it with a numbered placeholder, or replace it with a generic placeholder.
* The user can optionally specify a list of company names that the program should look.
* The program looks for words that start with a capital letter.
* The user can decide what to do with words at the start of a sentence (as they will usually start with a capital letter anyway).
* The program can work in two cycles, i.e. first find a list of suspected company names, then let the user trim the list to the actual names, and then the program uses that list for the deletions.

* For very large TMs, the program can create a list of words that start with a capital letter, but then reduce the list by comparing it to a list of the 10 000 most common words in that language. This will be less than 100% secure, but it will be a lot faster for the user because he'll have to review fewer words.

Your thoughts?



[Edited at 2010-06-21 15:23 GMT]


 

FarkasAndras
Local time: 15:44
English to Hungarian
+ ...
To Samuel Jun 21, 2010

The ideas are nice and quite easy to execute.
One suggestion: the program should also look for multiword expressions (as company and personal names tend to be) or at least print (some instances of) the context of the suspect words it finds.


 

Samuel Murray  Identity Verified
Netherlands
Local time: 15:44
Member (2006)
English to Afrikaans
+ ...
@Farkas Jun 21, 2010

FarkasAndras wrote:
One suggestion: the program should also look for multiword expressions (as company and personal names tend to be) or at least print (some instances of) the context of the suspect words it finds.


Hmm, well, the program could have a feature whereby it removes only capitalised words if there is more than 1 of them next to each other, but then the program would miss a lot of company names.

Also, one could feed the program a list of common company extensions, and then tell it to simply delete all TUs that contain them (LLC, Ltd, bv, etc).


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Is there an automated way of removing all company/brand names from a translation memory (tmx)?

Advanced search







SDL Trados Studio 2019 Freelance
The leading translation software used by over 250,000 translators.

SDL Trados Studio 2019 has evolved to bring translators a brand new experience. Designed with user experience at its core, Studio 2019 transforms how new users get up and running and helps experienced users make the most of the powerful features.

More info »
SDL MultiTerm 2019
Guarantee a unified, consistent and high-quality translation with terminology software by the industry leaders.

SDL MultiTerm 2019 allows translators to create one central location to store and manage multilingual terminology, and with SDL MultiTerm Extract 2019 you can automatically create term lists from your existing documentation to save time.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search