Extracting acronyms from a TM
Thread poster: Mathieu Jacquet

Mathieu Jacquet  Identity Verified
France
Local time: 02:52
English to French
Jan 31, 2011

Dear all,

I am currently working on extracting all acronyms stored in a huge translation memory (.txt export is 75 Mb).

Acronyms are basically 2 or 3, maybe 4-character long, in uppercase.

I am looking for the best way to automate the extraction. Do you think that Multiterm Extract would be of any help for this kind of task (is there any filter able to extract on only 4-character long words in uppercase for example?)?

Another option would be to copy-paste the whole content of my .txt export into a Word document, then run a macro that would extract every 4-character long word in uppercase and copy it in an Excel file.

Do you know of any other method? Would you recommend a Word macro guru?

Thank you very much in anticipation,

Mathieu.


Direct link Reply with quote
 

Jabberwock  Identity Verified
Poland
Local time: 02:52
Member (2004)
English to Polish
Bilingual? Jan 31, 2011

The key question is: do you want it bilingual? I.e. do you want to extract both the source acronym and the (possible) translation?

It is important, as the method applied would differ dramatically. If you want bilingual content, only tools such as Extract will do it (although I doubt it will be very reliable with acronyms).

On the other hand, extracting just capital letters from pure text is rather trivial...


Direct link Reply with quote
 

Jabberwock  Identity Verified
Poland
Local time: 02:52
Member (2004)
English to Polish
Simple search and replace Jan 31, 2011

To get monolingual acronyms in Word:

Open the "Find" window, select "Regular expressions" and put:

[A-Z]{2;6}

in the Find box - this searches for all occurrences of strings of capital letters 2 to 6 characters long.

In the Replace box put:

^p^&^p

and select "Format / Font", choosing a font which is not used within the document.

This separates all acronyms with paragraph marks and formats them with a different font.

Now remove anything that is in the original font. You might get too many paragraph marks, but you can easily get rid of them (replace ^p^p with ^p making sure regular expressions are turned off).


Direct link Reply with quote
 

Mathieu Jacquet  Identity Verified
France
Local time: 02:52
English to French
TOPIC STARTER
Thanks for the wildcard course :) Jan 31, 2011

Thank you Jabberwok,

i'll try your Find/Replace solution with regular expressions, and also the Multiterm Extract tool.

The ideal would be to end up with a list of bilingual acronyms, together with their "expanded form" (do not know if the term is correct), in a 4-column Excel sheet (EN, EN expanded, FR, FR expanded).

For instance :

JDI | Just do it | JFL | Juste fais le

So the process would be :

1. Extract all EN acronyms, starting from a monolingual EN export.
2. Copy in Excel first column.
3. Extract all FR acronyms, starting from a monolingual FR export.
4. Copy in Excel third column.
5. Check that acronyms are correctly aligned (some FR acronyms do not have an equivalent in English and vice-versa...for instance, BF in French is "Bon fonctionnement", and "healthy" in English, with no acronym).
6. Manually fill in second and fourth column.

Thanks for your help!
Mathieu


Direct link Reply with quote
 

Jabberwock  Identity Verified
Poland
Local time: 02:52
Member (2004)
English to Polish
Trick Jan 31, 2011

In that case, a trick that might help:

Convert the TM in such a way that each segment is in a different column (i.e. you have two columns with multiple rows) - copying from Excel would be probably the easiest way to do that.

That way, when you do the search and replace, you get all the acronyms from corresponding segments aligned. Of course, there will be cases when there are several acronyms in one segment - you have to skip them manually...

Or an extended solution:

Get the TM into Word in four columns - source; source; target; target. Then do the acronym S&R on the first and third column only. Now you have: first column just with source acronym(s), second column with the source text from which you can extract the full phrase, third column with target acronym, fourth column with target full translation.


Direct link Reply with quote
 

Mathieu Jacquet  Identity Verified
France
Local time: 02:52
English to French
TOPIC STARTER
TM to Excel Jan 31, 2011

For information, there are interesting posts about how to convert a TM into an Excel file:

http://deu.proz.com/forum/sdl_trados_support/127498-exporting_tm_files_into_excel_or_csv.html

http://est.proz.com/forum/sdl_trados_support/178702-exporting_tm_to_excel_and_vice_versa.html

Mathieu.


Direct link Reply with quote
 

István Hirsch  Identity Verified
Local time: 02:52
English to Hungarian
Syntax in the system used Jan 31, 2011

In Hungary the syntax is the same that Jabberwock suggested, but in other countries it can be [A-Z]{2,6}
as far as I know.


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Extracting acronyms from a TM

Advanced search







TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »
SDL Trados Studio 2017 Freelance
The leading translation software used by over 250,000 translators.

SDL Trados Studio 2017 helps translators increase translation productivity whilst ensuring quality. Combining translation memory, terminology management and machine translation in one simple and easy-to-use environment.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search