Is there an automated way of removing all company/brand names from a translation memory (tmx)?
Thread poster: Michael Joseph Wdowiak Beijer

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 07:50
Member (2009)
Dutch to English
+ ...
Jun 21, 2010

TMX-Anonymizer Pro - 3.14 ???

Direct link Reply with quote
 
FarkasAndras
Local time: 08:51
English to Hungarian
+ ...
Various ways... Jun 21, 2010

I'm sure some of the more advanced CAT tools offer decent search and replace options for TMX memories. Programs like Olifant should work too, I believe.

If you want something more powerful and versatile, you can't beat tools like sed.
If there are numerous TMX files to do the replacement in, or numerous company names to replace, you could (have somebody) write a script (BAT if you're on windows, BASH if you're on linux or mac) to do all these operations in one fell swoop. Additionally converting all numbers of any length to "XXX" is also pretty easy, in case you want to remove numerical data.

A basic sed command for this would look something like:
sed -e "s/Company name/XXX/g" originalmemory.tmx > anonymizedmemory.tmx

To remove even more stuff (convert the TMX to a tab delimited file first):

- To replace all 2-digit or longer numbers:
sed -e "s/[0-9][.,][0-9]/11/g" originalmemory.txt > anonymizedmemory.txt
sed -e -i "s/[0-9]\{2;\}/XXX/g" anonymizedmemory.txt

- Sort the segments in the TMX alphabetically (to essentially randomize their order) with sort (sort for Windows is here: http://sourceforge.net/projects/unxutils/files/unxutils).

The great thing about sed is that it can easily do any number of operations on files of any size.


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 08:51
Member (2006)
English to Afrikaans
+ ...
Well, what would you like? Jun 21, 2010

Michael J.W. Beijer wrote:
Is there an automated way of removing all company/brand names from a translation memory (tmx)?


Well, what would such a program do? Here are my ideas, but let me know your and what you think about mine:

* The user can specify to either delete the company name, replace it with a fake name, replace it with a numbered placeholder, or replace it with a generic placeholder.
* The user can optionally specify a list of company names that the program should look.
* The program looks for words that start with a capital letter.
* The user can decide what to do with words at the start of a sentence (as they will usually start with a capital letter anyway).
* The program can work in two cycles, i.e. first find a list of suspected company names, then let the user trim the list to the actual names, and then the program uses that list for the deletions.

* For very large TMs, the program can create a list of words that start with a capital letter, but then reduce the list by comparing it to a list of the 10 000 most common words in that language. This will be less than 100% secure, but it will be a lot faster for the user because he'll have to review fewer words.

Your thoughts?



[Edited at 2010-06-21 15:23 GMT]


Direct link Reply with quote
 
FarkasAndras
Local time: 08:51
English to Hungarian
+ ...
To Samuel Jun 21, 2010

The ideas are nice and quite easy to execute.
One suggestion: the program should also look for multiword expressions (as company and personal names tend to be) or at least print (some instances of) the context of the suspect words it finds.


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 08:51
Member (2006)
English to Afrikaans
+ ...
@Farkas Jun 21, 2010

FarkasAndras wrote:
One suggestion: the program should also look for multiword expressions (as company and personal names tend to be) or at least print (some instances of) the context of the suspect words it finds.


Hmm, well, the program could have a feature whereby it removes only capitalised words if there is more than 1 of them next to each other, but then the program would miss a lot of company names.

Also, one could feed the program a list of common company extensions, and then tell it to simply delete all TUs that contain them (LLC, Ltd, bv, etc).


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Is there an automated way of removing all company/brand names from a translation memory (tmx)?

Advanced search







SDL MultiTerm 2017
Guarantee a unified, consistent and high-quality translation with terminology software by the industry leaders.

SDL MultiTerm 2017 allows translators to create one central location to store and manage multilingual terminology, and with SDL MultiTerm Extract 2017 you can automatically create term lists from your existing documentation to save time.

More info »
WordFinder
The words you want Anywhere, Anytime

WordFinder is the market's fastest and easiest way of finding the right word, term, translation or synonym in one or more dictionaries. In our assortment you can choose among more than 120 dictionaries in 15 languages from leading publishers.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search