Pages in topic:   [1 2] >
Internal fuzzy matching: WFP 3.4 x Trados 2011 x MemoQ 6.2
Thread poster: Samuel Murray

Samuel Murray  Identity Verified
Netherlands
Local time: 06:44
Member (2006)
English to Afrikaans
+ ...
Aug 22, 2015

Hello everyone

I was curious about how the three tools on my computer that can calculate internal fuzzy matching actually calculate it. On my computer I have WFP 3.4, Trados 2011 and MemoQ 6.2 (trial). If what I write here is no longer true for newer versions of these tools, please let me know.

I took a test file of 5000 words (no repetitions) and created five versions of it, namely:
(a) the original file,
(b) one with all segments sorted A-Z,
(c) one sorted Z-A,
(d) one with all segments sorted by length (long to short), and
(e) one sorted short to long.

I then analysed these five files in my three CAT tools.

I found that it is difficult to compare the tools without knowing what an agency's (or translator's) word weighting grid is. So for testing, I just assumed the following hypothetical grid (let me know what your grid is, and I'll recalculate):

99-95% = 20%
94-85% = 35%
84-75% = 50%
74-50% = 75%
0-49% = 100%

The resulting weighed word counts (for the original file) are:

WFP: 84.75%
Trados: 84.05%
MemoQ: 82.13%
(in other words, if you accepted/offered discounts, you'd be getting this percentage of the total amount that you would have gotten if you hadn't accepted/offered discounts, based on the grid mentioned above)

Both WFP and MemoQ analyse fuzzy matching from the start of the file to the end of the file (which means that my five files had five different weighted word counts in WFP and MemoQ), but Trados always first sorts the file by segment length descending before doing the analysis (I deduce/guess this because all five files had exactly the same analysis in Trados, which was identical to WPF's analysis for the file sorted by segment length descending).

MemoQ reported a much, much higher number of 50-74% matches (typically twice as many as WFP and Trados). This can be detrimental for the translator if (a) there are many such matches and (b) the discount for 50-74% matches is too high. However, since the discount for 50-74% matches is usually quite low or non-existent , the fact that MemoQ reported a disproportionately high number of 50-74% matches in my test affected the overall weighted word count by only 2-4 percentage points.

In MemoQ, once you select internal fuzzy matching for one project, it will automatically be selected again for the next project that you create. In Trados, you have to specifically select it for every new project. WFP allows you to do analyses without creating projects (nice!).

In both Trados and WFP, internal fuzzy matches are listed separately in the statistics, but in MemoQ there is no option to list it separately -- internal and external fuzzy matches are not distinguished in the statistics. This means that if a client analysed the files, and the translator then doesn't want to offer discounts for internal fuzzy matches, the client would have to re-analyse the project all over again in MemoQ, but in WFP and Trados the translator could simply inform the client of the fact, and still use the client's original analysis when he creates his invoice.

Samuel


 

Samuel Murray  Identity Verified
Netherlands
Local time: 06:44
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
And DVX3 Aug 24, 2015

Samuel Murray wrote:
The resulting weighed word counts (for the original file) are:
WFP: 84.75%
Trados: 84.05%
MemoQ: 82.13%
(in other words, if you accepted/offered discounts, you'd be getting this percentage of the total amount that you would have gotten if you hadn't accepted/offered discounts, based on the grid mentioned above)


DVX3: 85.80%

To get internal fuzzy matches in DVX3, simply go to Project > Analyze and select "Intra-Project Analysis". The file test files all had different internal fuzzy statistics (in other words, it matters in which order the segments occur in the file). As far as I can tell, internal fuzzy matches are not stated separately on the analysis but are included in the external match counts. The setting to count internal fuzzy matches does not "stick" in DVX3 -- you have to reselect it every time you do an analysis, even in the same project.


 

Kevin Dias
Local time: 13:44
SITE STAFF
Great analysis Aug 24, 2015

Hi Samuel,

Thanks for providing this. I am not a translator, but it has always struck me as strange that translators don't demand more transparency in how fuzzy matches/internal fuzzy matches are calculated as your pay can be highly dependent on this calculation.

Of course, the tool makers seem to always counter with "translators wouldn't be able to understand it anyway"; however, I don't think this is true. To this I would say to the tool makers, "If you can't explain it simply, you don't understand it well enough.".

I wrote a blog post back in March about some of the points/areas that could cause fuzzy match scores to differ. In my opinion I think the industry should agree on a standardized way to calculate fuzzy matches/internal fuzzy matches.

Kevin


 

Manuel Arcedillo
Spain
Local time: 06:44
English to Spanish
Minimum match value Sep 17, 2015

Hi,

Interesting threadicon_smile.gif. What values did you use as minimum match thresholds? I think defaults are 60 in memoQ and 70 in Studio, so that may have had an impact on the different leverage reported for the 50-74 band (for which I have never encountered discounts, by the way).

It is strange that you get different leverage depending on segment order. I recall testing this some time ago and found that the fuzzy score of A matching B was the same as B matching A, but I'll try it again in more current environments.


 

Samuel Murray  Identity Verified
Netherlands
Local time: 06:44
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
@Manuel Sep 17, 2015

Manuel Arcedillo wrote:
What values did you use as minimum match thresholds? I think defaults are 60 in memoQ and 70 in Studio, so that may have had an impact on the different leverage reported for the 50-74 band.


I'm not really sure. I did try at the time to make sure all tools perform as much matching as they can, but perhaps there is an additional setting somewhere in e.g. MemoQ that I was not aware of.

I'm under the impression that in most tools the "minimum match threshold" does not relate to the analysis but rather to whether a match is displayed (or inserted) when you use the editor.


 

John Fossey  Identity Verified
Canada
Local time: 22:44
Member (2008)
French to English
Homogeneity? Sep 17, 2015

MemoQ has a feature "Calculate homogeneity". This has something to do with internal matches and when it is checked on a document with internal matches you will get a lower score than if it is unchecked. I'm not too clear as to how it works, but try it with the feature checked and unchecked. I don't know how it compares with other CAT tools.

[Edited at 2015-09-17 12:28 GMT]


 

Bernhard Sulzer  Identity Verified
United States
Local time: 00:44
English to German
+ ...
Rudimentary fuzzy word matches anyone? Sep 17, 2015

Kevin Dias wrote:

Hi Samuel,

Thanks for providing this. I am not a translator, but it has always struck me as strange that translators don't demand more transparency in how fuzzy matches/internal fuzzy matches are calculated as your pay can be highly dependent on this calculation.

Of course, the tool makers seem to always counter with "translators wouldn't be able to understand it anyway"; however, I don't think this is true. To this I would say to the tool makers, "If you can't explain it simply, you don't understand it well enough.".

I wrote a blog post back in March about some of the points/areas that could cause fuzzy match scores to differ. In my opinion I think the industry should agree on a standardized way to calculate fuzzy matches/internal fuzzy matches.

Kevin


Hi Kevin,

Since this thread has been pushed up again to greater visibility, I just wanted to add something.
These kinds of calculations and calculated results have no great impact on the price I charge. Especially these so-called internal fuzzies. Let me explain.

This analysis and other general fuzzy word analyses will give you a certain insight into how segments or the content of such segments repeat, or, if compared with a TM, they will show that there are certain similarities with previous translated segments of other texts.

But to make this the basis of calculating a price or let agencies dictate to you some arbitrarily reduced payment per certain fuzzy repeat percentages is unacceptable and exploitation.

I am telling others who don't know this that charges for language services must be based on the quality you deliver, more than anything else. Of course you will factor in how long it will take you, but the price or rate you arrive at needs to reflect an amount that enables you to deliver that quality. Meaning, you factor in your skills, your professional and life experience, your knowledge of the fields you work in, and of using CAT tools and TMs, timely delivery and many other considerations to arrive at that price. And that will go into calculating a fair price nevertheless, not an unrealistic high price. But realistic it must be!

Charging per word is really simply a way to express all that goes into delivering a flawless product.
It's not "just" the words, what kind of words, how many words there are, and how many repetitions of text occur (as per a CAT tool analysis) that determine professional prices.

This is important for any translator to understand; clients often have no idea about adequate prices in our profession and unprofessional agencies compete in the low ball section of our industry and are trying to use schemes like internal and other fuzzy word counts to DEMAND discounts for repetitions. But as the number of amateurs increases, the harder they will fight it out on the bottom of our industry. You don't want to be a part of that because in the end, it/you will go nowhere.

Just because we have tools to perform certain (often very insignificant) word analyses doesn't mean we are now calculating our prices based on these analyses.

Just to make it clear: no one should ever let their price/fee depend on fuzzy word analyses.

The analysis isn't equivalent (or expresses) the actual amount of work that needs to be performed;

it is no measure of the quality of the translation;

it doesn't express the knowledge necessary to judge how significant or insignificant that count is,

it is no measure of the experience, the skills and commitment of the translator;

and when a translator uses a CAT tool, he/she doesn't use it to arrive at the lowest price anyone will accept but as part of his/her overall "tool box;" quicker and more accurate delivery, consistency regarding terminology can be but isn't always a benefit reaped by using CAT tools. And if it is, it doesn't follow logically that the translator should charge less for his/her work. The opposite seems more logical. And that goes for faster delivery too.

And that goes for any fields of expertise unless you have 100% repetitions of certain words or numbers in a text that not only are the same in the original text but also in the target text. Just because a certain word occurs 70 times in the source text doesn't mean it will occur exactly the same way 70 times in the target text. And that goes for previous TMs as well. That's just one example of why one cannot base prices on fuzzy word counts.

What's next to demand lower prices? The rudimentary internal fuzzy match?


 

Samuel Murray  Identity Verified
Netherlands
Local time: 06:44
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
Yes Sep 17, 2015

Bernhard Sulzer wrote:
These kinds of calculations and calculated results have no great impact on the price I charge.


One can calculate rates in many ways, and one method is to align the rate to the amount of time it will take to do the translation.

Charging per word is really simply a way to express all that goes into delivering a flawless product.


That is true, and giving discounts for fuzzy matching is really a simple way to express the duration variable of "all that goes into" it.

Just because we have tools to perform certain (often very insignificant) word analyses doesn't mean we are now calculating our prices based on these analyses.


I think you will find that it does.

1. The analysis isn't equivalent (or expresses) the actual amount of work that needs to be performed;
2. it is no measure of the quality of the translation;
3. it is no measure of the experience, the skills and commitment of the translator;


All true, yes. But that doesn't mean fuzzy match statistics are completely useless for price determination. It should simply not be the *only* variable.

And that goes for any fields of expertise unless you have 100% repetitions of certain words or numbers in a text that not only are the same in the original text but also in the target text.


Yes, yes, of course, and that is why you should not have one single rate for all types of translation and all types of clients and all types of document formats. However, that does not render the fuzzy match statistics useless for price determination.

Samuel


 

Bernhard Sulzer  Identity Verified
United States
Local time: 00:44
English to German
+ ...
Don't let others dictate discounts based on arbitrary fuzzy matches Sep 17, 2015

Samuel Murray wrote:

Bernhard Sulzer wrote:
These kinds of calculations and calculated results have no great impact on the price I charge.


One can calculate rates in many ways, and one method is to align the rate to the amount of time it will take to do the translation.


The amount of time it will take to do the translation is not directly (or in any way) proportional to the arbitrary value of a match as per Cat tool. It's arbitrary. What the machine counts as a match isn't a match in the sense that it signifies that in the target text you will encounter the same "match", and it isn't a guarantee that you can use the same word/phrase again that has been used for an idea/concept in a previous TM.

The fuzzy match is NOT an expression of the actual work that goes into translating these "so-called" segment matches. And if a TM is involved, you're dealing with the good or poor quality of it (it could be poor if you use or acquiesce to the demands of an agency to use it). Using a TM without knowing what it's worth or without taking the time to evaluate it wouldn't be professional; I am sure you agree.


Bernhard Sulzer wrote:
Charging per word is really simply a way to express all that goes into delivering a flawless product.


Samuel Murray wrote:
That is true, and giving discounts for fuzzy matching is really a simple way to express the duration variable of "all that goes into" it.


I respectfully disagree. I hold that the fuzzy word match is no measure for the duration of a translation. That's the idea certain agencies use to demand discounts.

Arbitrary percentage values for fuzzy word repetitions by a machine are simply that: completely arbitrary and in no way an expression of how long it will take to translate the text, unless you are talking 100% matches that are 100% matches in the source as well as in the target text.


Bernhard Sulzer wrote:
Just because we have tools to perform certain (often very insignificant) word analyses doesn't mean we are now calculating our prices based on these analyses.


Samuel Murray wrote:
I think you will find that it does.


I don't.

Samuel Murray listing a few of Bernhard's points:
1. The analysis isn't equivalent (or expresses) the actual amount of work that needs to be performed;
2. it is no measure of the quality of the translation;
3. it is no measure of the experience, the skills and commitment of the translator;


Samuel Murray wrote:
All true, yes. But that doesn't mean fuzzy match statistics are completely useless for price determination. It should simply not be the *only* variable.


If you want to use it as a variable, that's your business. And maybe you can, but even so, it should in no way mean right away that you have to give discounts. Speeding up your delivery time and improving consistency of terminology (not necessarily a result of any fuzzy match) should probably be sold at a higher price.

Bernhard Sulzer wrote:
And that goes for any fields of expertise unless you have 100% repetitions of certain words or numbers in a text that not only are the same in the original text but also in the target text.


Samuel Murray wrote:
Yes, yes, of course, and that is why you should not have one single rate for all types of translation and all types of clients and all types of document formats. However, that does not render the fuzzy match statistics useless for price determination.


Never said it had to. But how you use fuzzy matches is your business (not an agency's that will DEMAND it to pay a significantly lower price).

[Edited at 2015-09-17 15:21 GMT]


 

Kevin Dias
Local time: 13:44
SITE STAFF
I don't think fuzzy matches are arbitrary, just not standardized across tools Sep 17, 2015

Hi Bernhard,


Bernhard Sulzer wrote:
Arbitrary percentage values for fuzzy word repetitions by a machine are simply that: completely arbitrary and in no way an expression of how long it will take to translate the text, unless you are talking 100% matches that are 100% matches in the source as well as in the target text.


I don't think fuzzy match calculations are arbitrary (definition of arbitrary: based on random choice or personal whim, rather than any reason or system.). For the most part an 80% match in one tool will probably be +- 10% in other tools. My point is that the calculations are neither standardized nor published and I think they should be.

Regardless I agree with you that translators are in the best position to set their rates (taking into account all that goes into delivering a great final product). For what it is worth - from TM-Town's TOS:
The results of fuzzy match or repetition calculations when comparing your documents against a potential document to be translated are strictly YOUR DATA and will never be shared or disclosed to a client or 3rd party (unless you choose to do so yourself).

Kevin


 

Bernhard Sulzer  Identity Verified
United States
Local time: 00:44
English to German
+ ...
Why fuzzies are arbitrary business Sep 17, 2015

Kevin Dias wrote:

Hi Bernhard,


Bernhard Sulzer wrote:
Arbitrary percentage values for fuzzy word repetitions by a machine are simply that: completely arbitrary and in no way an expression of how long it will take to translate the text, unless you are talking 100% matches that are 100% matches in the source as well as in the target text.



I don't think fuzzy match calculations are arbitrary (definition of arbitrary: based on random choice or personal whim, rather than any reason or system.). For the most part an 80% match in one tool will probably be +- 10% in other tools. My point is that the calculations are neither standardized nor published and I think they should be.



No standardization required, no thank you!

They are arbitrary in two regards:

1) they are no measurement for "translation," they don't justify a certain (arbitrarily defined) discounted rate for the translation of those words - I explained that above. An 80% match of what? And if it is "similar" to something else in the text or a previous text, that still has nothing to do with the price for the translation. You can do analyses all day, I am not going to automatically discount my price to it or change the way I fair and square arrive at the prices I quote.
Statistical analysis = discounted price? That's no valid equation.


2) To call something a 55% match (as to either an overall occurrence within a/as a segment) and then assign to it a discounted percentage word price for translation is, to say the least, absolutely ignorant and in no way justified. In any case, it's certainly arbitrary, especially when it is DEMANDED by an agency, and assigning the same discounted percentages across whatever field, language, .... is a joke. It's unfortunate that so many fall for this.

I guess when something looks "mathematical," many think it must make ultimate sense. Well, this doesn't for me.

It just helps agencies to demand rock bottom prices. Professional agencies don't do that.

If you were a translator you might see things differently. But again, you might not
Seems we have plenty of people that can't wait to be "discounted" by agencies.


Kevin Dias wrote:
Regardless I agree with you that translators are in the best position to set their rates (taking into account all that goes into delivering a great final product). ...
Kevin


Yes, that's right.


 

Samuel Murray  Identity Verified
Netherlands
Local time: 06:44
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
@Bernhard Sep 17, 2015

Bernhard Sulzer wrote:
Kevin Dias wrote:
I don't think fuzzy match calculations are arbitrary (definition of arbitrary: based on random choice or personal whim, rather than any reason or system.).

To call something a 55% match ... and then assign to it a discounted percentage word price for translation is ... certainly arbitrary ...


The fuzzy match calculations are not arbitrary, but: there certainly is a degree of arbitrariness in the setting of discount percentages.

That is usually not a problem, though, because it is transparent (each translator can see exactly what the discount categories are, and decide for themselves if they regard it as fair). In rare cases, agencies apply discounts without explaining to translators how the discounts are calculated, but in general, translators know exactly what the categories are.

It is also unavoidable that setting discount percentages is done with a degree of arbitrariness, primarily because of some things you mentioned: there are so many variables that influence the speed of a translation that one often has to make a best guess to determine how much quicker a translation will take, if only fuzzy match statistics are taken into account. It is for this reason that anyone who sets up a discount scheme will do so not after implementing some scientific process but by making a best guess about what is most fair to all parties.

Of course, if you know of a scientific way to determine discount percentages, you're welcome to tell us, but I don't think it exists, and if it does, I doubt if it would be incredibly useful or result in more reliable fuzzy match discount categories.


 

Bernhard Sulzer  Identity Verified
United States
Local time: 00:44
English to German
+ ...
What's a 85% match? Sep 17, 2015

Samuel Murray wrote:

Bernhard Sulzer wrote:
Kevin Dias wrote:
I don't think fuzzy match calculations are arbitrary (definition of arbitrary: based on random choice or personal whim, rather than any reason or system.).

To call something a 55% match ... and then assign to it a discounted percentage word price for translation is ... certainly arbitrary ...


The fuzzy match calculations are not arbitrary, but: there certainly is a degree of arbitrariness in the setting of discount percentages.


How are they not arbitrary? Show me why a certain match is a 85% match. What's the definition of a 85% match?



[Edited at 2015-09-17 20:08 GMT]


 

Samuel Murray  Identity Verified
Netherlands
Local time: 06:44
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
A private definition of "arbitrary", perhaps? Sep 17, 2015

Bernhard Sulzer wrote:
How are [fuzzy match calculations] not arbitrary? Show me why a certain match is a 85% match. What's the definition of a 85% match?


Each CAT tool will calculate the matches according to its own proprietary algorithm. That is why the tool whose match algorithm was used should always be mentioned, unless the translator is happy to bank on averages.

But note this: each tool that performs a matching will always yield the same match percentage for the same match, so therefore it can't be arbitrary (unless you're applying a special, private definition of "arbitrary"?).

If it was arbitrary, then the tool would give say that a given segment's match against a given translation memory unit is e.g. 85% on one day, and then say it's a 90% on another day. No, these tools do not decide that something is an 85% by whim, but by calculation.



[Edited at 2015-09-17 18:36 GMT]


 

Bernhard Sulzer  Identity Verified
United States
Local time: 00:44
English to German
+ ...
More on arbitrariness Sep 17, 2015

Samuel Murray wrote:

Bernhard Sulzer wrote:
How are [fuzzy match calculations] not arbitrary? Show me why a certain match is a 85% match. What's the definition of a 85% match?


Each CAT tool will calculate the matches according to its own proprietary algorithm. That is why the tool whose match algorithm was used should always be mentioned, unless the translator is happy to bank on averages.

But note this: each tool that performs a matching will always yield the same match percentage for the same match, so therefore it can't be arbitrary (unless you're applying a special, private definition of "arbitrary"?).

If it was arbitrary, then the tool would give say that a given segment's match against a given translation memory unit is e.g. 85% on one day, and then say it's a 90% on another day. No, these tools do not decide that something is an 85% by whim, but by calculation.



[Edited at 2015-09-17 18:36 GMT]


Why would there always be the same results if the algorithm is different? It isn't as you rightly point out above.
And that one specific CAT tool will always find the same kind of match for exactly the same kind of text? Well, not so sure, but even if ...

Who decides about the algorithm and with what in mind? That a 85% match (of what? I ask again) equates to 15% less work writing this text in the target language? Or that an 85% match of whatever in the source language equates to the same 85% match in the target language? Define "match" - match for what? In terms of value or meaning for translating it or expressing it in the target language?

An 85% match as in both source and target language (I doubt it). Source text only meaning what? That in the target language, any of these occurrences (not words or segments per se but certain strings?!) will be treated the same way as they were found to be in the source language and thus should be assigned the exact same discounted rate - per word, per what word??!!!

Don't have much time right now, but Wladyslaw Janowski makes a few excellent points here - let me refresh your memory:


http://www.proz.com/forum/cat_tools_technical_help/258550-fuzzy_matching_and_consistency_devil’s_inventions_against_translators_translation_itself.html#2218219

And now we get “fuzzy matches”. If even 100% matches, CM’s or PM’s are not really reliable (and it is not rare, that customers wish, we “check” them too), what to think about a purely formal, statistical “match”, which is assessed by some algorithm - but this is not identical with the algorithm of human thinking. If 7 words of 10 in a sentence match another sentence, we get 70% fuzzy. Does this mean, the work, we have to execute on such sentence, is 30% of the work we would have when translating from scratch? If the real context would be negligible (of course is not), we are still not sure, if the rest 70% of the match is correct.

[Edited at 2015-09-17 21:00 GMT]


 
Pages in topic:   [1 2] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Internal fuzzy matching: WFP 3.4 x Trados 2011 x MemoQ 6.2

Advanced search







Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »
Déjà Vu X3
Try it, Love it

Find out why Déjà Vu is today the most flexible, customizable and user-friendly tool on the market. See the brand new features in action: *Completely redesigned user interface *Live Preview *Inline spell checking *Inline

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search