Extracting acronyms from TM
Thread poster: Mathieu Jacquet

Mathieu Jacquet  Identity Verified
France
Local time: 17:17
English to French
Feb 2, 2011

Hi all,

I am currently looking for the best way to extract all acronyms from a big TM (75 MB), an acronym being defined as a 2 to 6-character long word, all in capital letters (IA, SCADA, etc.).

I know that Multiterm Extract includes Minimum term length and Maximum term length parameters, for specifying the minimum/maximum number of words required to form a term candidate, but I do not think extracted terms can be filtered either by upper/lowercase or number of characters in a word right?

Any idea on how to achieve such extraction?

Thank you very much in anticipation,
Mathieu.

[Modifié le 2011-02-02 11:08 GMT]


Direct link Reply with quote
 

István Hirsch  Identity Verified
Local time: 17:17
English to Hungarian
Deja vu Feb 2, 2011

Some days ago you asked this question and got an answer suggesting using a regular expression. So it has not solved the problem, why?

Direct link Reply with quote
 

Mathieu Jacquet  Identity Verified
France
Local time: 17:17
English to French
TOPIC STARTER
Exploration Feb 2, 2011

Hi István,

the thing is that I want to explore different ways to solve this problem. I might have to reproduce the exercise, so the easiest the way, the better.

I have not brought the Search & Replace solution to an end so far, but Multiterm can also be a solution (especially because Multiterm Extract allows you to attach a termbase for excluding already extracted terms). Imagine that you created a termbase including acronyms. 2 years later, you want to extract all acronyms from a TM, knowing that maybe half of the acronyms included in it are already present in your termbase. This could save time to exclude the acronyms included in your termbase from the extraction process.

I have logged a case with the SDL support about ways to do that using Multiterm Extract. I am waiting for an answer, I'll post it here.

I'll also post the Search & Replace solution when I'm done with it.

Cheers,
Mathieu.


Direct link Reply with quote
 

István Hirsch  Identity Verified
Local time: 17:17
English to Hungarian
Multiterm Extract... Feb 2, 2011

...seems a good choice for the purpose, but I cannot help, I have never used it.

Direct link Reply with quote
 

FarkasAndras
Local time: 17:17
English to Hungarian
+ ...
MT Extract <-> acronyms Feb 2, 2011

I for one would never consider buying MT Extract to get acronyms out of a TM. It's pretty much impossible for it to work much better than a simple regex, but it's guaranteed to cost a hell of a lot more than learning how to build simple regular expressions. It'll take about the same amount of time as well, or, considering that some kind colleague has already proposed a regex, more.

If you want to use it to extract terminology in general, that's another story. I'm still pessimistic about its usefulness, but I have no information to back up that pessimism apart from the fact that this sort of advanced natural language processing is very tricky and users generally don't say very flattering things about MT Extract.


Direct link Reply with quote
 

Mathieu Jacquet  Identity Verified
France
Local time: 17:17
English to French
TOPIC STARTER
SDL support answer Feb 3, 2011

The SDL support phoned me today, saying that it is not possible.

@Andras: you do not buy Multiterm Extract, you buy SDL TRados and it comes with the package. I have created a number of termbases using it, and it can be pretty useful. But not this time unfortunately.

I have tried the Olifant Okapi solution too.
I have one problem though: their filter is based on a SQL "LIKE" command, but a very limited one as it seems, which does not support generic characters ([A-Z] works with the LIKE command in SQL, but not in the Olifant filter).

So the regex/Word macro solution will be the winner for this issue I guess.

Mathieu.


Direct link Reply with quote
 

FarkasAndras
Local time: 17:17
English to Hungarian
+ ...
MT Extract sold separately Feb 3, 2011

Mathieu Jacquet wrote:

The SDL support phoned me today, saying that it is not possible.

@Andras: you do not buy Multiterm Extract, you buy SDL Trados and it comes with the package. I have created a number of termbases using it, and it can be pretty useful. But not this time unfortunately.

No it doesn't. If you got MT Extract with your Trados Suite, you must have bought some special megapack.
As SDL says, SDL MultiTerm Extract 2009 and SDL PhraseFinder are not included with SDL MultiTerm 2009.

[Edited at 2011-02-03 15:30 GMT]


Direct link Reply with quote
 

Jabberwock  Identity Verified
Poland
Local time: 17:17
Member (2004)
English to Polish
Various methods Feb 3, 2011

For what is worth, you can also apply the "old termbase" solution using Excel at some point to eliminate duplicate entries (Word macro would be a bit inefficient, I think).

The Word S&R was proposed as a quick solution for a one-time problem. If you think you will have to do it more often, there might be probably better options - Perl, for example, would beat anything performance-wise, but it would take more time to setup...


Direct link Reply with quote
 

Mathieu Jacquet  Identity Verified
France
Local time: 17:17
English to French
TOPIC STARTER
Word macro Feb 3, 2011

After cleaning my export (250000 translation units), I end up with more than 60000 acronyms (including names, words of titles in capital letters, etc.).

I have asked a friend to develop a Word macro extracting acronyms in a four-column Excel sheet (EN, EN context, FR, FR context), excluding:

. words preceded by "[A-Z] plus dot", because in our docs names of reviewers, etc. are written, for example, "M. JACQUET",
. words in capital letters followed by a blank space plus a word in capital letters; this will exclude titles.

This will cost between 2 and 4 hours in development.

So far this is a one-time process, but who knows...

Cheers,
Mathieu.


Direct link Reply with quote
 

FarkasAndras
Local time: 17:17
English to Hungarian
+ ...
Tool Mar 5, 2011

I've built a tool for the job.
The task was a lot more complicated than I first imagined, so the the script does quite a bit of trickery behind the scenes. I'm quite happy with the results, though: if you have a large TM with acronyms repeated several times, you get fairly reliable pairings even if several acronyms tend to occur in the same segment.
Download the grab bag from here: https://sourceforge.net/projects/aligner/files/?
It's windows only and it's called ACR_extract.

I've copy-pasted the readme here:


This tool extracts acronyms from translation memories. The TMs can be in TMX or tab delimited txt format. Txt files need to be in UTF-8 encoding. You can run the tool on more than one file at a time, even a mix of tmx and txt. Make sure the languages are in the same order in all files, though.
The script was intended for large memories/corpora, and should work on up to several million segments if you have a couple of GB of RAM in your computer.

Acronyms are taken to be words made up of 2-6 capital letters. Acronym pairs are identified based on how many times they co-occur. The script attempts to filter out segments that were written in ALL CAPS, but some may evade the filter and screw up your results.
The output is a tab delimited file that contains the acronym pair, the number of occurrences, and one example segment where the acronym pair occurred. You can copy-paste the output file to excel and sort it by the third column to get a list sorted by probability and relevance.
You can exclude acronyms that occur only once to filter out less likely candidates. You can also set the inclusion threshold to a higher number by answering the prompt with a number instead of y/n.


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Extracting acronyms from TM

Advanced search







BaccS – Business Accounting Software
Modern desktop project management for freelance translators

BaccS makes it easy for translators to manage their projects, schedule tasks, create invoices, and view highly customizable reports. User-friendly, ProZ.com integration, community-driven development – a few reasons BaccS is trusted by translators!

More info »
Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search