Glossary extraction tool
Thread poster: Emmanuel V
Jan 19, 2010

I am trying to take the first step into building glossaries and came across an article that mentioned "a term extraction from source documents ".

Can anyone shed some light on if indeed there are "automated" term extraction tools/software and how they actually "choose" which words should go into a glossary?

Thank you


 

Attila Piróth  Identity Verified
France
Local time: 20:41
Member
English to Hungarian
+ ...
Monolingual term extraction in PlusTools Jan 19, 2010

Hi Emmanuel,

There are several such tools, including the +Extract feature of PlusTools (Wordfast's free add-on), which is downloadable here.

The tool uses a statistical approach: it selects all word combinations that occur a certain number of times in the document. This number can be set by the user -- just like the maximum length of the combination ("look for all combinations of 10 words or less that occur at least twice in the document").

This purely statistical approach usually produces a long list that you need to trim manually. You can get slightly better results by excluding such trivial words as "of, and, or, with, on, in, the, for, about, ..."; the list of such stop words is also customizable. But even if you use such a list, the automatically obtained results need to be checked and weeded out.

Some ballpark figures: if your source text is about 30,000 words, and you set the minimum number of repetitions to as low as 2, you may end up with an automatic list of 1500-2000 words. Depending on your computer, the program takes between 5 and 20 minutes to produce this first list. After manual trimming, about 10% of these terms (200 words) are kept. This manual part takes 2 hours. (These figures are just very rough estimates.)

Whether or not this investment of time is justified, depends heavily on the specific details of the project. In my experience, the invested time is often justified for team projects, where having the vocabulary established in advance saves time in the long run. (If terminology unification is done after translators have produced their first version, reworking their translation will be more time consuming.) If you work alone, there is a good case for skipping this step and build the terminology database on the fly.

Kind regards,
Atitla


 

Emmanuel V
TOPIC STARTER
Thanks Jan 19, 2010

Thank you very much for the input and the extra information.
Much appreciated.


 

Pablo Bouvier  Identity Verified
Local time: 20:41
German to Spanish
+ ...
Glossary extraction tool Jan 19, 2010

Emmanuel V wrote:

I am trying to take the first step into building glossaries and came across an article that mentioned "a term extraction from source documents ".

Can anyone shed some light on if indeed there are "automated" term extraction tools/software and how they actually "choose" which words should go into a glossary?

Thank you



automatic, monololingual, free (only for german and english) Beosphere

automatic, bilingual, payment: Synchroterm

manual, bilingual, payment TermiDOG

[Editado a las 2010-01-19 17:23 GMT]


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Glossary extraction tool

Advanced search







Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

More info »
Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search