Why? : Why do match counts change when files are analyzed together vs. separately?
Thread poster: lkopplin
Feb 4, 2008

I have a batch of 30 files. I'm using Workbench (Trados 6.5) and the same TM to analyze the files in 2 ways.

When I analyze the files in 3 separate batches of 10 files each, then add the readouts together, I get the following:

Reps: 4117
100%: 5788
95-99%: 384
85-94%: 477
75-84%: 427
50-74%: 129
No Match: 12,805

When I analyze the files together (as a single batch of 30), the readout is as follows:

Reps: 4848
100%: 5788
95-99%: 286
85-94%: 387
75-84%: 344
50-74%: 112
No Match: 12,362

Repetitions go up when the files are analyzed together. 100% matches stay the same. Fuzzy matches go down.

My question: Why do the match counts differ?
Can someone explain why fuzzy matches go down? And why 100% matches stay the same?

Here is my theory, but perhaps I am wrong: When files are analyzed together, there are more occurrences of the same fuzzy match than there would be if the files were analyzed separately. When there are multiple occurrences of the same fuzzy match, Trados would count the fuzzy match once as a fuzzy match, then count the remaining occurrences as repetitions. So repetitions go up, fuzzy matches go down.
But if that theory is correct, why wouldn't it apply to 100% matches as well? Is it because a 100% match is the maximum benefit you can achieve through TM, so there is no need to re-assign a recurring 100% match as a repetition?

Any help or insight anyone can provide would be greatly appreciated!

Lauren Nemec
Marketing Manager
Translatus, Inc.

[Edited at 2008-02-04 17:09]

Claudia Digel  Identity Verified
Your theory is correct Feb 4, 2008

Hi Lauren,

Your theory is correct. If you analyze all files in one batch, a re-occurring fuzzy match will be counted as a fuzzy match just once, all other occurences will be counted as repetitions.

The number of 100% matches does not change because a 100% match is different from a repetition. A 100% match is an identical match from the TM. This is in the TM before you start your translation. All occurences of this sentence in your files will be counted as 100% matches, not as repetitions, no matter how often this sentence appears in the files. Since you use the same TM for both of your analysis processes, the number of 100% matches doesn't change.

A repetition is a 'new 100% match' which you generate from within your translation files, i.e. the sentence is not in the TM before you start the translation. (There might be a fuzzy match for the sentence but not a 100% match.) Of course, this new sentence can occur in several files, which means you get cross-file repetitions. This is why the number of repetitions rises when you analyze more files in one batch.

Hope this helps.

Best regards,

ViktoriaG  Identity Verified
You are right Feb 4, 2008

Your theory is right, and Claudia explains it well.

I just wanted to add that the best way to count words when using CAT tools is to analyse files in a way that takes into account the batches that go to each translator. So, if you are splitting a project of six files into two (three files each), don't analyse them one by one, but don't analyse them in one batch either. If you send files 1, 2 and 3 to translator A and files 4, 5 and 6 to translator B, then analyse files 1, 2 and 3 together for translator A and so on.

This will help leverage TM to the max but it will also be more just for each translator. Of course, use this analysis method to quote a rate to the end client as well.

[Edited at 2008-02-04 19:37]

Thank you! Feb 6, 2008

Thank you Claudia and Viktoria for your help. Everything is clear now.

