How to find a list of most repeated words/phrases in a document
Thread poster: Zeki Guler

Zeki Guler  Identity Verified
Ireland
Local time: 23:17
Member (2012)
English to Turkish
+ ...
May 22, 2015

Hi everybody,

While translating on my CAT tool (MemoQ), I add some ffrequent terms to my Term Base to avoid translating them several times later throughout the document. Is there a way to list "the most frequently used terms throghout the text" ?

That would help add them to my TM at the very beginning of my Project, rather than doing it after translating some terms several times. To save time and effort.

Best,


Direct link Reply with quote
 

Selcuk Akyuz  Identity Verified
Turkey
Local time: 02:17
Member (2006)
English to Turkish
+ ...
try CafeTran May 23, 2015

Hi Zeki,

This feature and many other useful ones are available at CafeTran. See http://cafetran.wikidot.com/extracting-frequent-words


Direct link Reply with quote
 

Zeki Guler  Identity Verified
Ireland
Local time: 23:17
Member (2012)
English to Turkish
+ ...
TOPIC STARTER
Thanks May 23, 2015

Hi Mr. Selçuk,

That's exactly what I looked for. Many Thanks.

Kind Regards,


Direct link Reply with quote
 

Meta Arkadia
Local time: 06:17
English to Indonesian
+ ...
The free version of CafeTran... May 23, 2015

... will do. Download the free version for your operating system (OS X or Linux, it's even available for Windows), it's only some 10 MBs. Run CafeTran, a window will appear, drop your document on the Dashboard in that window, and that's it. I don't think there's a need to set languages or change other settings. Your document will load. Next (and this is different from what the Wiki entry Selçuk mentioned says), go to Menu | Task | Frequent Words, and pick your choice.

If not for Selçuk's mentioning the link to the Wiki, I wouldn't even have replied to this topic, but it solves a problem I have at the moment. I was going to use AntConc to extract the words from a rather large file, only because I can make AntConc to "deduct" stopwords from the resulting list. That will reduce the list of frequent words considerably. But AntConc is incredibly slow. And just now I read the Wiki page referred to. Large files are no problem for CafeTran, it's fast, and I can add the stopwords list to deduct them from the frequent words. Far better, I can leverage that stopwords list and my termbases to eliminate all the stopwords and all the words already in my termbase, leaving me with only the terms that need to be translated. Incredible!
However, the free version of CafeTran is limited to a certain number of TM and termbase entries - 1,000? - so if you want to use larger termbases, you'll have to buy CafeTran (€ 80/year), or... repeat the process a number of times.

Cheers,

Hans


Direct link Reply with quote
 

Meta Arkadia
Local time: 06:17
English to Indonesian
+ ...
Thinking aloud May 23, 2015

This is what you'll get:



It's pretty obvious that those stopwords ruin the results. You'll have to deduct them from the frequent words list. You can find a list of stopwords for various languages here. However, I don't think you can use them as such, because they contain no target language. So you'll have to add something as the second part of a tab delimited file. I don't think it matters what you add - "I cheat" for all entries will do. Those files from the link I mentioned, seem to consist of less than a 1,000 words, so that's OK for CafeTran. There are several files per language though, but since CafeTran can handle an unlimited number of resource files, you should be able to deduct all stopwords in one go.

As per the subject line, I'm thinking aloud. This is completely new for me. Where do I go in the wrong?

[Edit] You probably can't deduct them in one go, because the limit for termbases goes for all termbases combined. Unfortunately, I forgot what that limit is, and I can't seem to find it[/Edit]

Cheers,

Hans

[Edited at 2015-05-23 03:58 GMT]


Direct link Reply with quote
 

Emin Arı  Identity Verified
Turkey
Local time: 02:17
Member
English to Turkish
+ ...
Extract term? May 23, 2015

Though I have not used much, there is "extract term function" in memoQ. Does not help?

Direct link Reply with quote
 

Dominique Pivard  Identity Verified
Local time: 01:17
Finnish to French
Term extraction in memoQ May 23, 2015

Zeki G. wrote:
While translating on my CAT tool (MemoQ), I add some ffrequent terms to my Term Base to avoid translating them several times later throughout the document. Is there a way to list "the most frequently used terms throghout the text" ?

Did you have a look at this:

http://kilgray.com/memoq/2015/help-en/index.html?term_extraction.html


Direct link Reply with quote
 

M Pradeep Kumar  Identity Verified
India
English to Telugu
+ ...
Try this link May 23, 2015

Hi,

You can try this link.

http://www.textfixer.com/tools/online-word-counter.php

It also has an option to remove common words.


Direct link Reply with quote
 

DZiW
Ukraine
English to Russian
+ ...
PlusTools May 23, 2015

If it's about txt/doc/rtf and some others, then a MS Word add-on PlusTools served me fine from MS WORD XP and 2003 (not sure about newer versions though), analyzing not just weighted words and common phrases/synonyms, but also with quite flexible settings.

[Edited at 2015-05-23 15:14 GMT]


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

How to find a list of most repeated words/phrases in a document

Advanced search







TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »
BaccS – Business Accounting Software
Modern desktop project management for freelance translators

BaccS makes it easy for translators to manage their projects, schedule tasks, create invoices, and view highly customizable reports. User-friendly, ProZ.com integration, community-driven development – a few reasons BaccS is trusted by translators!

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search