Extracting most frequent terms from source files in Trados project
Thread poster: Helen Portefaix

Helen Portefaix  Identity Verified
United Kingdom
Local time: 10:32
Member (2012)
English to French
Oct 14, 2016

Hello,

I work with SDL Trados 2014.
I created a new project containing 48 source files (all Excel), and I would like to know if there is a way to extract recurring terms within this project in order to produce a list of the most frequent words used across all source files.
Is there any such tool? An app or an add-on maybe?

Thank you in advance for your help.


Direct link Reply with quote
 

Lianne van de Ven  Identity Verified
United States
Local time: 05:32
Member (2008)
English to Dutch
+ ...
word count & frequency / sobolsoft Oct 14, 2016

I use PDF Word Count & Frequency by Sobolsoft for this:
http://www.sobolsoft.com/pdffreq/

I chose the pdf version (rather than excel, word etc.) because I can convert other file types to pdf and let it count words. In addition to just counting the number of words, it can output frequency statistics for individual words in an excel spreadsheet, which is what you are looking for. This has indeed helped me focus on keywords first.


Direct link Reply with quote
 
xxxToon Theuwis  Identity Verified
Belgium
Local time: 11:32
English to Dutch
+ ...
Interesting Oct 14, 2016

Lianne van de Ven wrote:
it can output frequency statistics for individual words in an excel spreadsheet, which is what you are looking for. This has indeed helped me focus on keywords first.


This sounds like an interesting tool. I just looked at the website. Does it filter out the obvious articles, pronouns, etc.? Does it handle compound nouns too?


Direct link Reply with quote
 

Lianne van de Ven  Identity Verified
United States
Local time: 05:32
Member (2008)
English to Dutch
+ ...
video Oct 14, 2016

Toon Theuwis wrote:

This sounds like an interesting tool. I just looked at the website. Does it filter out the obvious articles, pronouns, etc.? Does it handle compound nouns too?



I use it very basically with default settings, but if you watch the instruction video, you can play with all kinds of settings. If you want to ignore articles, for example, you could also delete that particular line in the spreadsheet, e.g. "it - 275" (or: het - 275) and it will subtract that from the sum. You can easily delete all irrelevant words and create a dictionary with the remaining ones and import it in your CAT tool.

I just tried converting an excel sheet to pdf and use it in pdf count, but for some reason it read all words in a cell as one, so something went wrong with the conversion from excel to pdf. But I am still reinstalling my system from a crash, and I am getting some font errors and have some pdf converter issues as well that I am still ironing out.

Other than that, not a tool that I use a lot, but it can definitely do what Helen is asking for.

[Edited at 2016-10-14 18:57 GMT]


Direct link Reply with quote
 

Helen Portefaix  Identity Verified
United Kingdom
Local time: 10:32
Member (2012)
English to French
TOPIC STARTER
Need a list of most frequent words for all 48 files combined, not file by file. Oct 14, 2016

Thank you very much for your reply, Lianne.
It looks very interesting, but what I am looking for is something I could use directly within Trados, and that would extract the words appearing most frequently in the source content of my Trados project as a whole, not file by file.
I really would like to know if a simple add-on can do this, without having to convert or manipulate source documents. Something within Trados that could tell me: here are the xxx most frequent words in the source content of this project, regardless of the number or type of individual files that make up the project.
But thanks again!


Direct link Reply with quote
 

Lianne van de Ven  Identity Verified
United States
Local time: 05:32
Member (2008)
English to Dutch
+ ...
Would be nice Oct 14, 2016

Have you searched for any add-ons?
Something like SDL Multiterm Extract?
http://producthelp.sdl.com/SDL_MultiTerm_2015/client_en/Guides/SDL_MultiTerm_Extract%20User%20Guide.pdf

http://www.sdl.com/cxc/language/terminology-management/multiterm/extract.html
(leading to a 1 hr video....)

There is nothing I can find in the app store:
http://appstore.sdl.com/

Such an add-on as you suggest would be great. They just had a contest for suggestions. I think what was chosen for development was quick-count add-on allowing to do a word count without setting up a project. Maybe a frequency counter can be added to that.


Direct link Reply with quote
 

Helen Portefaix  Identity Verified
United Kingdom
Local time: 10:32
Member (2012)
English to French
TOPIC STARTER
Will definitely look into SDL Multiterm Extract Oct 17, 2016

Thank you so much Lianne for your valuable tips and contribution.

Direct link Reply with quote
 

rthomas  Identity Verified
Local time: 10:32
German to English
How about a corpus tool like AntConc or Sketch Engine? Oct 31, 2016

Hi Helen,

This answer might be a couple of weeks too late, but I wanted to mention the tools I've used to do something similar. AntConc is the one I've used the most, and it's freeware - there's a quick guide here and you can download the tool from http://www.laurenceanthony.net/software/antconc/. Briefly, I convert the source files to simple TXT and load them into AntConc, then use the 'Word List' function to sort all the individual words by frequency (then copy and paste the results into e.g. Excel and manually delete the "obvious" words/rows to leave genuine key terms only), and then use the 'Clusters/N-grams' function to identify the frequently-occurring phrases rather than individual words, and do the same again (copy/paste to Excel and delete the irrelevant phrases). Then I add translations for any key terms/phrases I want to include, and use either the SDL Glossary Converter or Multiterm Convert to create a termbase. The process is a bit more manual and time-consuming than I'd like, but for big/repetitive projects I find that spending even an hour or two up front still saves time later.

I've also used Sketch Engine to a limited extent - I've barely scratched the surface but it's a brilliant tool. You'd need to take out a paid subscription and you have to upload the files you're working on (which may not be an option, of course). Both Sketch Engine and AntConc let you specify 'stoplists' of obvious terms that will be ignored, though I've only tried this in Sketch Engine. From my limited experience, extracting the genuine key terms/phrases is a lot quicker with Sketch Engine than with AntConc. Since both tools are designed for all kinds of corpus work, rather than being specifically for translators, there are plenty of functions that I haven't got my head round properly, but both could be worth a look. Hope this helps!

Richard


Direct link Reply with quote
 
EL_isa
United Kingdom
Local time: 10:32
English to Italian
+ ...
Antconc / relevant terms and alphabetical list Nov 9, 2016

Dear Richard,

I am so glad to read your post. Hope you will be able to answer my three following questions.

1. Is it any useful to create Alphabetical Lists ?
2. I have read that you deleted manually the words/phrases which were not relevant to your search. So, is this the only way? Me too, I am deleting words manually....
3. Did you happen to get many "mistakes"? For example, I get chopped words or single letters for no reason....


Thanks a lot.
Regards,
Elisa


Direct link Reply with quote
 
EL_isa
United Kingdom
Local time: 10:32
English to Italian
+ ...
Other questions (Antconc) Nov 9, 2016

El_isa wrote:

Dear Richard,

I am so glad to read your post. Hope you will be able to answer my three following questions.

1. Is it any useful to create Alphabetical Lists ?
2. I have read that you deleted manually the words/phrases which were not relevant to your search. So, is this the only way? Me too, I am deleting words manually....
3. Did you happen to get many "mistakes"? For example, I get chopped words or single letters for no reason....


Thanks a lot.
Regards,
Elisa


Another couple of questions, Richard:

4. I wanted to find relevant word phrases - so I thought I had to use the "Cluster" function and then "Words" as search term. However, that gave me not accurate results (i.e. three-word clusters instead of two-word clusters when I had selected the size min.2 max 2). So I switched to "N-Grams" as search term and I got precise results. Do you think this procedure is correct? What the actual difference between Clusters and N-Grams?

5. Is it possible to zoom-out the screenshot view so as to "see" more results in a single screenshot?

Many Thanks. Regards,
Elisa


Direct link Reply with quote
 

Selcuk Akyuz  Identity Verified
Turkey
Local time: 13:32
Member (2006)
English to Turkish
+ ...
Buy a separate SDL tool for it? Nov 9, 2016

Lianne van de Ven wrote:

Have you searched for any add-ons?
Something like SDL Multiterm Extract?
http://producthelp.sdl.com/SDL_MultiTerm_2015/client_en/Guides/SDL_MultiTerm_Extract%20User%20Guide.pdf

http://www.sdl.com/cxc/language/terminology-management/multiterm/extract.html
(leading to a 1 hr video....)

There is nothing I can find in the app store:
http://appstore.sdl.com/

Such an add-on as you suggest would be great. They just had a contest for suggestions. I think what was chosen for development was quick-count add-on allowing to do a word count without setting up a project. Maybe a frequency counter can be added to that.


Multiterm Extract is EUR 400. Surprising but it is not included in Studio Freelance Plus which is EUR 655. Some CAT tools on the other hand (e.g. Deja Vu X3) can extract frequent terms, all included in the price. So why should one buy an additional tool for it? Moreover, all files are opened as a single project in Deja Vu X, so you can extract frequent terms in a few clicks. This feature is called Lexicon in Deja Vu X3.

@Helen:
Even a trial version of Deja Vu X3 is sufficient.


Direct link Reply with quote
 

Alyssa Yorgan  Identity Verified
United States
Local time: 05:32
Russian to English
Cafe Tran has frequent term extraction Dec 12, 2016

Title says it all - CafeTran includes this function in their standard version...can search along several parameters.

Direct link Reply with quote
 
MikeTrans
Germany
Local time: 11:32
Member (2005)
Italian to German
+ ...
Extraphr32 is a free monolingual extraction tool Dec 12, 2016

There is a Win and Java verison of it (I don't have the link). The only quirk is that it doesn't support Unicode text (only ANSI), but that's all I have to complain about. Otherwise it has a pleiade of options, stoplists, breaklists, statistics, how often a word must occur to be extracted, how long an expression must be to be extracted etc. etc.
I find this one better than Mutliterm Extract (about 300$-400$!) which doesn't give you the expression frequencies in a project.
For big projects, I often convert my termbases to tmx, create a special TM of them, then I extract my project files with Extraphr32, I then arrange the list like that:

machine 38
manual 27
correct function 05
instruction manual 03
etc...

I then pre-translate this list with the special TMs from my termbases with a fuzzy match of 65. In a blink of an eye I can say what terminology I have at my disposal and what not, so I can evaluate the time for term researches, delivery times etc.
The problem with that is: It takes time to get there (more than 20 minutes, so it's not useful if you have to respond to a client in a hurry to accept a project or not), it would be much better if Trados had such a feature...

CAT tools I used with term extraction:
MemoQ (very fast and useful term extraction with subsequent checking against termbases; dedicated menu and window)
Déjà Vu (Lexicon that generates combination of project terms; its NOT a term extraction because a lot of 'garbage' is produced in a brute-force method; but it can check the output against termbases.)
Fusion Translate (an old CAT tool from Canada, works with term extractions and sub-segment matches)

What I most miss in Trados Studio is: Pre-translating by using only termbases. This is not currently possible. SDLX can do that! Maybe the makers should remind their old versions...

Greetings,
Mike

[Edited at 2016-12-12 11:09 GMT]


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Extracting most frequent terms from source files in Trados project

Advanced search







WordFinder
The words you want Anywhere, Anytime

WordFinder is the market's fastest and easiest way of finding the right word, term, translation or synonym in one or more dictionaries. In our assortment you can choose among more than 120 dictionaries in 15 languages from leading publishers.

More info »
SDL MultiTerm 2017
Guarantee a unified, consistent and high-quality translation with terminology software by the industry leaders.

SDL MultiTerm 2017 allows translators to create one central location to store and manage multilingual terminology, and with SDL MultiTerm Extract 2017 you can automatically create term lists from your existing documentation to save time.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search