Term & phrase extraction tools
Thread poster: MikeTrans
MikeTrans
Germany
Local time: 14:19
Member (2005)
Italian to German
+ ...
Oct 26, 2010

Dear colleagues,

I would welcome some advices about extraction tools.
What I need is a tool capable of extracting terms & phrases based on a stopword list. Such an extracted term list comes handy when starting a large project and building up a termbase for it.
Years ago, I've made a research and found out the solutions listed below; in parantheses are the reasons why I'm not using them.

- Multiterm Extract (too slow, too expensive)
- Synchroterm 2005 (too slow, not designed for monolingual docs, Trados-dependent)
- DVX Lexicon (doesn't take stopwords into account)
- Extphr32

Extphr32 has huge advantages over the other tools: It's freeware and it's incredibly fast. In fact it takes less than 30 seconds to extract a ANSI-text file of 20 MB size!
The results of the extraction are presented along with their frequencies, alas not keeping the case of the original text. You can however chose to display them all in uppercase or all in lowercase.
This makes it useless for non-english languages (as german), or for terms with names and acronyms etc. Also entering those terms in a termbase means you must change the case on the fly which is a considerable time loss.

Do you know any additional extraction tools? Maybe one like Extphr32?
I would be happy if you could just refer them in this post. Thank you very much!

Mike


Direct link Reply with quote
 

Cristina Valenza
Italy
Local time: 14:19
Italian to English
+ ...
Terminology Wizard Oct 26, 2010

You can try Terminology Wizard or Term-minator ( opensource).

Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 14:19
Member (2006)
English to Afrikaans
+ ...
Tenka... Oct 26, 2010

MikeTrans wrote:
Do you know any additional extraction tools? Maybe one like Extphr32?


This rings a bell:
http://sourceforge.net/projects/corsis/
...although I can't remember if it does extraction.


Direct link Reply with quote
 
MikeTrans
Germany
Local time: 14:19
Member (2005)
Italian to German
+ ...
TOPIC STARTER
Thank you, Oct 26, 2010

Samuel,
Thanks for your tip.
In the meantime I managed to make a macro in UltraEdit in order to solve the upper/lower case problem. So I think I can happily continue to use ExtPhr32.

Regards,
Mike


Direct link Reply with quote
 

Jabberwock  Identity Verified
Poland
Local time: 14:19
Member (2004)
English to Polish
Okapi Rainbow Oct 26, 2010

Okapi Rainbow has term extraction as well - it is comparable to Extphr32, but quite convenient if you process the source files anyway...

Direct link Reply with quote
 

Stanislaw Czech, MCIL  Identity Verified
United Kingdom
Local time: 13:19
Member (2006)
English to Polish
+ ...
Where can I download it? Oct 27, 2010

Cristina Valenza wrote:

You can try Terminology Wizard or Term-minator ( opensource).



I have found this website http://www.synthema.it/index.php/en/Prodotti/terminologywizard/Terminology-Wizard.html butno option for download

Best Regards
Stanislaw


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 14:19
Member (2006)
English to Afrikaans
+ ...
Where to find it? Oct 27, 2010

Cristina Valenza wrote:
You can try Terminology Wizard or Term-minator ( opensource).


Terminology Wizard was, as far as I can see, never opensource and never free either. However, in past years (2006 etc) it was possible to download a demo version of it, with the name twizdwld.exe.

Term-inator may be open source but I can't see how to download it. It exists only as an online search interface. Or, did you mean "free" perhaps when you said "open source"?


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 14:19
Member (2006)
English to Afrikaans
+ ...
Rainbow can't do it, or can it? Oct 27, 2010

Jabberwock wrote:
Okapi Rainbow has term extraction as well - it is comparable to Extphr32, but quite convenient if you process the source files anyway...


There is a "text extraction" utility in Rainbow, but it doesn't extract terms or phrases. It extracts only lines or segments (which is useful for some purposes, but not what the OP asked for). Or... what am I missing?


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 14:19
Member (2006)
English to Afrikaans
+ ...
@Mike Oct 27, 2010

MikeTrans wrote:
In the meantime I managed to make a macro in UltraEdit in order to solve the upper/lower case problem. So I think I can happily continue to use ExtPhr32.


How does that work? Do you add a marker to every capital letter before doing the extraction?


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 14:19
Member (2006)
English to Afrikaans
+ ...
@Mike (more) Oct 27, 2010

MikeTrans wrote:
What I need is a tool capable of extracting terms & phrases based on a stopword list.


I remembered something posted on another list, and went and looked for it, and found it:

http://www.archivepub.co.uk/FRedit/
http://www.archivepub.co.uk/TheBook/

This is a whole bunch of macros useful for editors, but I do believe that there is also a text extraction utility somewhere in there. Some of the links in the package are dead links, but a bit of googling gets you by, e.g. http://www.lunerouge.org/gnu/wx/textstat_win32.zip .


Direct link Reply with quote
 

Mette Melchior  Identity Verified
Sweden
Local time: 14:19
English to Danish
+ ...
Try AntConc Oct 27, 2010

You might try using the free corpus tool AntConc. I mainly use it for concordance searches in monolingual corpora but you can use the Word List and Clusters tools to identify terms and phrases and copy-paste the results to Excel or similar.

There are also some suggestions in this thread if you haven't seen it already:
http://www.proz.com/forum/software_applications/96347-terminology_extraction_software.html


Direct link Reply with quote
 

Jabberwock  Identity Verified
Poland
Local time: 14:19
Member (2004)
English to Polish
How about... Oct 27, 2010

Samuel Murray wrote:
There is a "text extraction" utility in Rainbow, but it doesn't extract terms or phrases. It extracts only lines or segments (which is useful for some purposes, but not what the OP asked for). Or... what am I missing?


How about Utilities/Term extraction... ? (In Rainbow 6.0.8).


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 14:19
Member (2006)
English to Afrikaans
+ ...
Egg on my face Oct 27, 2010

Jabberwock wrote:
Samuel Murray wrote:
There is a "text extraction" utility in Rainbow, but it doesn't extract terms or phrases.

How about Utilities/Term extraction... ? (In Rainbow 6.0.8).


Aaah, yes, now I see it. It is further down the list. In a weird location. But it's there.


Direct link Reply with quote
 
MikeTrans
Germany
Local time: 14:19
Member (2005)
Italian to German
+ ...
TOPIC STARTER
Thank you all, Oct 27, 2010

Thank you all very much for posting your suggestions! I will certainly take a look at all these alternatives.

Samuel Murray wrote:

How does that work? Do you add a marker to every capital letter before doing the extraction?


Samuel,
yes, I consider all special characters like ö, Ö, ä, Ä, ü, Ü, ß; even special Unicode charachters like the (r) or (c). I then change ö to oe44, Ö to oe55, ß to ss44 etc.
I then change any capital letters to A = a33, B = b33 etc...

Here is the UltraEdit macro for changing all in LOWERCASE (shorted):

InsertMode
ColumnModeOff
HexOff
UltraEditReOn
Top
Find MatchCase "ü"
Replace All "ue44"
UltraEditReOn
Top
Find MatchCase "Ü"
Replace All "ue55"
UltraEditReOn
Top
Find MatchCase "ä"
Replace All "ae44"
UltraEditReOn
Top
Find MatchCase "Ä"
Replace All "ae55"
UltraEditReOn
Top
Find MatchCase "ö"
Replace All "oe44"
UltraEditReOn
Find MatchCase "Ö"
InsertMode
ColumnModeOff
HexOff
UltraEditReOn
Top
Find MatchCase "A"
Replace All "A33"
UltraEditReOn
Top
Find MatchCase "B"
Replace All "B33"

(...)

Find MatchCase "ß"
Replace All "ss44"
UltraEditReOn
Top
Find MatchCase "®"
Replace All "666"
UltraEditReOn
Find MatchCase "-"
Replace All "777"

You must also change the stopword list for ExtPhr32, by applying this macro and add the expressions to the stoplist. This must contain the "normal" words + the manipulated words.
Once you have applied the macro for any text, you proceed with the extraction.
After, you chose "copy keywords" and change ",+space" into "^p" within UltraEdit.
Then, you apply the Reverse Macro, that will change all to capitalization again where applyable.
Note: In UltraEdit you can save a series of macros to a single one and set the program to laod them at any start.

The Reverse-Macro, the UPPERCASE (shorted) is:

InsertMode
ColumnModeOff
HexOff
UltraEditReOn
Top
Find MatchCase "a33"
Replace All "A"
UltraEditReOn
Top
Find MatchCase "b33"
Replace All "B"

(...)

UltraEditReOn
Top
Find MatchCase "z33"
Replace All "Z"
UltraEditReOn
Top
Find MatchCase "ue44"
Replace All "ü"
UltraEditReOn
Top
Find MatchCase "ue55"
Replace All "Ü"
UltraEditReOn
Top
Find MatchCase "ae44"
Replace All "ä"
UltraEditReOn
Top
Find MatchCase "ae55"
Replace All "Ä"
UltraEditReOn
Top
Find MatchCase "oe44"
Replace All "ö"
UltraEditReOn
Top
Find MatchCase "oe55"
Replace All "Ö"
UltraEditReOn
Top
Find MatchCase "ss44"
Replace All "ß"
UltraEditReOn
Top
Find MatchCase "666"
Replace All "®"
UltraEditReOn
Top
Find MatchCase "777"
Replace All "-"

All this was designed to handle German text, but you can also use it for English.
ExtPhr32 doesn't support Unicode, so you must change accordingly any special characters, e.g. all possible accents in Spanish or French (not done in the macros above), but the principle is the same.

Hope this helps,

Mike

P.S.
Hell, if only I could use UltraEdit to make a decent CAT-Tool



[Edited at 2010-10-27 19:11 GMT]

[Edited at 2010-10-27 19:23 GMT]


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Term & phrase extraction tools

Advanced search







Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »
CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use SDL Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search