Word count for html files
Thread poster: gocacp

gocacp  Identity Verified
Germany
Local time: 17:59
German to English
Jun 1, 2011

Hi,
I am using the Beta version of OmegaT to translate html files on a Mac. When I use the word count feature in OmegaT it gives me a completely different result to openoffice (more than a 1,000 word difference). I also tried an online tool http://www.wordcounttool.com/ just to test the difference and this gave me another result again. When I tested this using a short test document, the online tool was the most accurate but since I am billing a customer I need to be sure that the method is accurate.

What is the best way of counting the words? I wouldn't expect OmegaT to include words inside in the word count, but it seems like it does. Is there a way of removing the tags so that I can use the OmegaT word count function?

Thanks for your help,
Amy


 

Didier Briel  Identity Verified
France
Local time: 17:59
Member (2007)
English to French
+ ...
Check the options in the HTML filter Jun 1, 2011

gocacp wrote:
I am using the Beta version of OmegaT to translate html files on a Mac. When I use the word count feature in OmegaT it gives me a completely different result to openoffice (more than a 1,000 word difference).

Such a difference is not usual.

Check the options in the HTML filter (or the XHTML filter, depending on your source files), and uncheck things you are not translating.


I also tried an online tool http://www.wordcounttool.com/ just to test the difference and this gave me another result again. When I tested this using a short test document, the online tool was the most accurate but since I am billing a customer I need to be sure that the method is accurate.

There is no such thing as an accurate word count. There are different methods, all giving different results. The important thing is to understand what is being counted.

I wouldn't expect OmegaT to include words inside < > in the word count, but it seems like it does.

It doesn't.


Is there a way of removing the tags so that I can use the OmegaT word count function?

OmegaT doesn't count the tags.
What may happen is that you are declaring things as translatable (e.g., images) while they are not to be translated for this project.

Didier


 

Manticore (X)  Identity Verified

Local time: 17:59
English to German
+ ...
@Didier Jun 2, 2011

It might interest you - I have just started translating a large *.docx text. OmegaT is better than anything else on the market, irrespective of price.

 

Didier Briel  Identity Verified
France
Local time: 17:59
Member (2007)
English to French
+ ...
Thank you for the feedback Jun 3, 2011

Roland Fischer wrote:
It might interest you - I have just started translating a large *.docx text. OmegaT is better than anything else on the market, irrespective of price.

Thank you for the feedback.

OmegaT relies on its user community.

There are plenty ways of getting involved, from a simple "yes" on Sourceforge, to more active roles.

Didier


 
Post removed: This post was hidden by a moderator or staff member because it was not in line with site rule

ma1cius
United Kingdom
Local time: 16:59
French to English
+ ...
large variance between OmegaT, SmartCAT and pasting text into Word, when counting words in Excel Mar 5

I just tried assessing 9 xlsx files in OmegaT and got a total word count of 11063. I ran the same files through SmartCAT and got a total word count of 18,336.

Copying out the text from the largest file into a Word file, not including segments that were just numbers, I got a count of 12,053 from Microsoft Word's built-in word count. Including the numbers, this came to 19,147.
SmartCAT counted the same document at 13,187 words and counted 2,372 segments that contained just numbers or symbols, which, I think, were not included in the word count. OmegaT counted this file at 8,029 words.

This variance seems enormous. I can understand if it's not counting number/symbol-only segments, which, I think, counts for much of the discrepancy between SmartCAT and Word but, even allowing for that, OmegaT's count comes out at about two thirds of MS Word's count. There is a lot of repetition but this should, surely, just be shown in the statistics and not affect the total words.

Do I have something majorly wrong in my OmegaT settings or have I somehow misunderstood how OmegaT presents word counts?

Can anyone explain how I might have got such different word counts and what I can do to restore my faith in the statistics generated by these CAT tools? I am using OmegaT 3.6.0 update 8.

Thanks.

Malc


 

Didier Briel  Identity Verified
France
Local time: 17:59
Member (2007)
English to French
+ ...
OmegaT does not count repetitions on XLSX files Mar 5

ma1cius wrote:
I just tried assessing 9 xlsx files in OmegaT and got a total word count of 11063. I ran the same files through SmartCAT and got a total word count of 18,336.

OmegaT does not count repetitions in XSLS files, simply because they are not in the file (Microsoft removes them). To get a word count including repetitions, save the XSLS file under another format (e.g., XML 2003 spreadsheet).

Copying out the text from the largest file into a Word file, not including segments that were just numbers, I got a count of 12,053 from Microsoft Word's built-in word count. Including the numbers, this came to 19,147.
SmartCAT counted the same document at 13,187 words and counted 2,372 segments that contained just numbers or symbols, which, I think, were not included in the word count. OmegaT counted this file at 8,029 words.

It's not usual. Generally, OmegaT is rather close to Word.

This variance seems enormous. I can understand if it's not counting number/symbol-only segments,

Indeed, OmegaT does not count numbers.

which, I think, counts for much of the discrepancy between SmartCAT and Word but, even allowing for that, OmegaT's count comes out at about two thirds of MS Word's count. There is a lot of repetition but this should, surely, just be shown in the statistics and not affect the total words.

As I wrote above, this is not usual for Word documents. Have you checked what is loaded or not in OmegaT for the Word filter? Options > File Filters > Microsoft XML.

Do I have something majorly wrong in my OmegaT settings or have I somehow misunderstood how OmegaT presents word counts?

Another setting that might affect word count is Options > Tag processing (whether you include custom tags or not in statistics).

Can anyone explain how I might have got such different word counts and what I can do to restore my faith in the statistics generated by these CAT tools? I am using OmegaT 3.6.0 update 8.

For XLSX files, the explanation is obvious. For Word, it's hard to say without details.

Didier


 

Daniel Frisano
Switzerland
Local time: 17:59
Member (2008)
English to Italian
+ ...
Word Mar 5

Right-click, select "Open with...", select MS Word, use that word count.

 


There is no moderator assigned specifically to this forum.
To report site rules violations or get help, please contact site staff »


Word count for html files

Advanced search






WordFinder Unlimited
For clarity and excellence

WordFinder is the leading dictionary service that gives you the words you want anywhere, anytime. Access 260+ dictionaries from the world's leading dictionary publishers in virtually any device. Find the right word anywhere, anytime - online or offline.

More info »
TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search