Pages in topic:   [1 2] >
A CAT tool for translators only?
Thread poster: Selcuk Akyuz

Selcuk Akyuz  Identity Verified
Turkey
Local time: 15:26
Member (2006)
English to Turkish
+ ...
Jul 14, 2012

I want a CAT tool that analyses segments and calculates matches if only the segment has minimum 10 words.

Not clear?!

Some years ago an agency sent me a TTX project with almost 70% exact matches and 20% fuzzies. But it was an Indesign project badly segmented everywhere. Most segments were 2 or 3 words divided by a tag, mostly by carriage returns (hard or soft).

It was almost impossible to translate that file without joining segments (an extra work and sometimes you cannot join segments). But of course at the end exact matches were no more exact matches.

This is not specific to Trados, you can get such files in any CAT tool.


(segment 1) This is not specific
(segment 2) to
(segment 3) Trados,
(segment 4) you can get such files
(segment 5) in any
(segment 6) CAT
(segment 7) tool.



Possibly you have some exact matches for each of the above segments(?), but a segment is not always a sentence and it is a nightmare when you get such a file (not only extra work but also free work).

In localisation you can have 1 or 2-word segments, but CAT Tools are not localisation tools (like Passolo or Catalyst). In each CAT tool matches should be calculated (as an option) only if the segment is longer than 10 (or 7) words.

Instead, CAT tools are now calculating even subsegment matches. No, I want a CAT tool with proper analysis and I will buy it even its price is 1000 EUR.


 

MikeTrans
Germany
Local time: 14:26
Italian to German
+ ...
Not the perfect solution but... Jul 14, 2012

Hi Selcuk,

I'm still trying to make my perfect CAT tool by myself, well, every once so often I try it, but resign 10 minutes later... That's to say: The most useful features are spread around into various CAT tools which doesn't at the end give you any advantage.

That said, as I know you're also using DVX2, do the following, butI can't believe you don't know thaticon_smile.gif

In DVX2, just filter the segments, use SQL if necessary to get only x words in a sentence; export to external view; open & select all source columns and import as a new document.
Then, analyze.
It's not a perfect solution, I know, it takes time, but if I see a better solution I'll tell you immediately.

As for *any* segmentation problem: get your docs imported as tmx files, this avoids any segmentation problems, it ensures all the codes will remain in the tmx files because DVX2 does a wonderful job with this filter.

I hope this will be at least a quick help for you.

Good luck,
Mike


 

Selcuk Akyuz  Identity Verified
Turkey
Local time: 15:26
Member (2006)
English to Turkish
+ ...
TOPIC STARTER
SQL filter is a solution in DVX but Jul 15, 2012

Hi Mike,

I know about the InStr VBA function which can be used in SQL filters in DVX.

For example (instr(1, source, " ", 1)= 0) displays segments with 1 word only.

To display segments with 10 or less words a longer statement is needed:

(instr(1, source, " ", 1)= 0) OR (instr(instr(1, source, " ", 1)+1, source, " ",1)= 0) OR (instr(instr(instr(1, source, " ", 1)+1, source, " ",1)+1, source, " ",1)= 0) OR (instr(instr(instr(instr(1, source, " ", 1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)= 0) OR (instr(instr(instr(instr(instr(1, source, " ", 1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)= 0) OR (instr(instr(instr(instr(instr(instr(1, source, " ", 1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)= 0) OR (instr(instr(instr(instr(instr(instr(instr(1, source, " ", 1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)= 0) OR (instr(instr(instr(instr(instr(instr(instr(instr(1, source, " ", 1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)= 0) OR (instr(instr(instr(instr(instr(instr(instr(instr(instr(1, source, " ", 1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)= 0) OR (instr(instr(instr(instr(instr(instr(instr(instr(instr(instr(1, source, " ", 1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)= 0)


Then I can lock these segments, go to All Except Locked Segments view, change their status to Pending and export these Pending segments longer than 10 words to an External View file. Hide all except the Source column and create a new project for these segments.

Not so difficult for DVX users but what about other CAT tools.

I normally do not receive projects which require use of a specific CAT tool, so I am lucky. But for those who receive CAT projects, this feature should be included in the analysis.

Matches for segments with less than 10 words are not reliable. See http://www.proz.com/forum/sdl_trados_support/225082-the_lord_of_the_rings.html

That is why CAT tool developers should implement a feature to exclude short segments from analysis reports. But then agencies will use another CAT toolicon_frown.gif

MikeTrans wrote:
As for *any* segmentation problem: get your docs imported as tmx files, this avoids any segmentation problems, it ensures all the codes will remain in the tmx files because DVX2 does a wonderful job with this filter.


How can I import a doc file as tmx file?! Perhaps you can answer it in the DVX forum, it is better to discuss that short segments issue here.

Selcuk


 

Samuel Murray  Identity Verified
Netherlands
Local time: 14:26
Member (2006)
English to Afrikaans
+ ...
Another idea Jul 15, 2012

Selcuk Akyuz wrote:
I want a CAT tool that analyses segments and calculates matches if only the segment has minimum 10 words.


I don't think it would be easy to come up with a method of catching all such instances that you refer to. However, I think segment length is possibly not the best way to do it. How about something that: excludes segments that do not start on a start-of-sentence indicator (e.g. a capital letter) and do not end on an end-of-sentence indicator (e.g. fullstop, question mark or exclamatoin mark).


 

Jaroslaw Michalak  Identity Verified
Poland
Local time: 14:26
Member (2004)
English to Polish
MemoQ Jul 15, 2012

In MemoQ you can check the segments against specified character length (not number of words), based on the comment field. However, as you cannot set several comments at the same time, this requires a roundtrip through rtf column export (and Excel to make the filling out easier)...

I am not sure if character length suits your puporse, though.

On the other hand, if you roundtrip through Excel, it should be easy enough to check the number of words there and then filter those segments in MemoQ.


 

Selcuk Akyuz  Identity Verified
Turkey
Local time: 15:26
Member (2006)
English to Turkish
+ ...
TOPIC STARTER
good segmentation is another issue Jul 15, 2012

Hi Samuel,

Are there punctuation marks in all languages, also what about capital letters? On the other hand a word is not always a word in all languages. Japanese, I think a good example for both.

Words or characters?
The number of words in a sentence differs even in European languages, long words in German, plenty of 'la' and 'le' in a French text, and then there are some agglutinative languages. That is why character (or line) count is preferred in some countries.

Some CAT tools, e.g. MemoQ can sort segments based on character length. I think any CAT tool can make a filter to display segments wit less than n words.

But is it sufficient? No! This should also be implemented (as a standard) in analysis reports. A Trados, MemoQ or Deja Vu analysis should not consider short segments because they are not reliable!

A simple one-word segment, e.g. 'parts' can have several translations in French (and in all languages). How can we consider it an exact match?


 

Selcuk Akyuz  Identity Verified
Turkey
Local time: 15:26
Member (2006)
English to Turkish
+ ...
TOPIC STARTER
MemoQ Jul 15, 2012

Hi Jarosław,

I know how to sort segments based on number of words in MemoQ. I can lock them and make an analysis not including locked segments. But is there a setting in the analysis window to exclude segments shorter than n words?

There are solutions to exclude them in every CAT tool but are we translators aware of the useless exact and fuzzy short segments?


 

Heinrich Pesch  Identity Verified
Finland
Local time: 15:26
Member (2003)
Finnish to German
+ ...
Translate in Word Jul 15, 2012

If the document is full of tags you must request a text in Word and translate using Wordfast or the like. After translation someone else my reformat the text as they like. I would not count any matches at all but bill according to Word wordcount.

[Bearbeitet am 2012-07-15 23:19 GMT]


 

Selcuk Akyuz  Identity Verified
Turkey
Local time: 15:26
Member (2006)
English to Turkish
+ ...
TOPIC STARTER
in real world there are discounts Jul 15, 2012

Heinrich Pesch wrote:

If the document is full of tags you must require a text in Word and translate using Wordfast or the like. After translation someone else my reformat the text as they like. I would not count any matches at all but bill according to Word wordcount.


Hi Heinrich,

The problem is not tags or any specific CAT tool or document types here. And I generally work for direct clients who do not ask for discounts for repetitions.

But many translators get TMs from agencies and asked for discounts, right?

Translators failed to unite against such requests but at least they should ask short segments not to be included in match analyses. And this could be achieved only if supported by CAT tool developers.

So simple, make a setting in CAT A or B so that (discount) analysis will exclude segments shorter than n words.


 

MikeTrans
Germany
Local time: 14:26
Italian to German
+ ...
answer about SQL + segmentation Jul 15, 2012

Selcuk Akyuz wrote:

To display segments with 10 or less words a longer statement is needed:
... (long SQL statement taken out)...


To express an SQL statement which contains x words, you don't focus on the words, but on the spaces separating the words, so it should look like that:

Sentence like "* * * *" etc...
This will give you 3 word sentences. If you want any numbers or non-words excluded, this should work:

((Sentence not like "[^a-z]*[0-9]*") OR (Sentence not like "[0-9]*[^a-z]*")) AND Sentence like "* * * *"

This will work for 3 words; for x words add asterisks followed by spaces in the last Like statement ending with an asterisk.


How can I import a doc file as tmx file?! Perhaps you can answer it in the DVX forum, it is better to discuss that short segments issue here.


With your consent I will copy my answer in the Yahoo group. I do the following based on FarkasAndras advices in LF Aligner:

- Import your doc in DVX2; select all rows and F5 (copy source to target)
- Create a new TM and send file to TM
- Export new TM as tmx
- Reimport tmx as file to translate in your project

This has several advantages, especially reducing codes and you can handle with it ALL sort of documents, being also able to join/split at will.
This is particularly helpful when you have to deliver in legacy CAT tool formats: use the same procedure, but instead of importing into DVX2 in the first place, import, copy source to target, and pretranslate in the CAT tool you need for delivery in order to build a tmx file.

Mike


[Edited at 2012-07-15 21:04 GMT]


 

Selcuk Akyuz  Identity Verified
Turkey
Local time: 15:26
Member (2006)
English to Turkish
+ ...
TOPIC STARTER
SQL filter Jul 15, 2012

MikeTrans wrote:

To express an SQL statement which contains x words, you don't focus on the words, but on the spaces separating the words, so it should look like that:

Sentence like "* * * *" etc...
This will give you 3 word sentences.


Hi Mike,

In fact that was the first SQL filter I tried but there was a mistake in my filter, NOT was forgotten. And in your example it should be "Source", "Sentence" is used in TM not in Project files.

So it should be Source NOT LIKE "* * * *"

Otherwise, if I don't add the NOT operator, even a segment with 20 words will be displayed because "* * *" exists in a long segment as well.

Enough SQL for todayicon_smile.gif

MikeTrans wrote:
I do the following based on FarkasAndras advices in LF Aligner:

- Import your doc in DVX2; select all rows and F5 (copy source to target)
- Create a new TM and send file to TM
- Export new TM as tmx
- Reimport tmx as file to translate in your project

This has several advantages, especially reducing codes and you can handle with it ALL sort of documents, being also able to join/split at will.
This is particularly helpful when you have to deliver in legacy CAT tool formats: use the same procedure, but instead of importing into DVX2 in the first place, import, copy source to target, and pretranslate in the CAT tool you need for delivery in order to build a tmx file.



But you can't export as a doc file, right? Possibly, after finishing translation of the TMX file, you pretranslate the doc file with the dedicated TM.

Number of codes will not change in this method, you can join/split segments but then when pretranslating the doc file you will not get exact matches for joined/split segments.


It seems that CAT tool support personnel are not interested in this topic.


[Edited at 2012-07-15 21:55 GMT]


 

MikeTrans
Germany
Local time: 14:26
Italian to German
+ ...
And the winner, ehem, the right formula is: Jul 15, 2012

Sentence shorter than 10 words:
Equivalent to a sentence which doesn't have at least 10 words:

Source not like "*[a-z]* *[a-z]* *[a-z]* *[a-z]* *[a-z]* *[a-z]* *[a-z]* *[a-z]* *[a-z]* *[a-z]"

For segments ending with a space, just add a space after the last [a-z] , correct and repeat.

Hope this helps,
Mike

[Edited at 2012-07-15 23:16 GMT]


 

Heinrich Pesch  Identity Verified
Finland
Local time: 15:26
Member (2003)
Finnish to German
+ ...
You are right Jul 15, 2012

Selcuk Akyuz wrote:


The problem is not tags or any specific CAT tool or document types here. And I generally work for direct clients who do not ask for discounts for repetitions.

But many translators get TMs from agencies and asked for discounts, right?

Translators failed to unite against such requests but at least they should ask short segments not to be included in match analyses. And this could be achieved only if supported by CAT tool developers.

So simple, make a setting in CAT A or B so that (discount) analysis will exclude segments shorter than n words.


Very short segments are questionable matches, but often even one word segments are acceptable 100% matches. I doubt if one can draw a reasonable line there.
On the other side I often get jobs where most of the less than 99% matches are in fact 100%, only that the author has reformatted the segment or corrected a spelling mistake since previous edition.
All work should get priced according to the effort it takes to complete them. For rough formatting there must be a penalty.

[Bearbeitet am 2012-07-16 05:29 GMT]


 

Selcuk Akyuz  Identity Verified
Turkey
Local time: 15:26
Member (2006)
English to Turkish
+ ...
TOPIC STARTER
I don't need SQL filters :) Jul 15, 2012

Mike, another good try but it failsicon_smile.gif Test and see yourself.

I can use the loooong filter with InStr function or the shorter one: Source NOT LIKE "* * * * * * * * * * *" They both work.

I like using or creating SQL filters but what I want here is a tool that can exclude shorter segments from analysis. That is all! And I want it for all CAT users, especially for those who receive CAT projects with exact and fuzzy matches and are expected to give discounts.

Discounts for exact and fuzzy matches are ok but not for short segments.

Selcuk


 

MikeTrans
Germany
Local time: 14:26
Italian to German
+ ...
I see your point now - a little late... Jul 15, 2012

Selcuk,

yes, I understand now your argument. Well, what I do is giving my *good* clients a tollerance margin of +/- words in a text if they are telling me "the text has 1344 words". But still the problem with short segments when counting fuzzy matches is a tricky one, I agree with Heinrich.

Personally I don't accept discounts for fuzzy matches and for any new client I make it very clear: they have to take out what's not to be translated. This will help the relationship in the future.

Also, when there are rules there are exceptions. I think I've once run into a translation of technical drafts with 1-2 words in lenght, very hard to translate, a lot of research to be done, I needed more than once to contact the author, but: that's part of my job, I cannot ask for higher rates just because I take more time to translate; the rate was appropriate for the subject, but it was a nightmare for me, I remember...

DVX2, Trados Studio, MemoQ etc... : I would not be surprised if all those CATs count the words and output their analyzes differently. I think they have just added this feature to make the translator busy.
I've heard that in Studio 2009 the Analyze function is broken due to a serious bug, I don't know about Studio 2011.

Mike


[Edited at 2012-07-16 00:04 GMT]

[Edited at 2012-07-16 00:11 GMT]


 
Pages in topic:   [1 2] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

A CAT tool for translators only?

Advanced search







memoQ translator pro
Kilgray's memoQ is the world's fastest developing integrated localization & translation environment rendering you more productive and efficient.

With our advanced file filters, unlimited language and advanced file support, memoQ translator pro has been designed for translators and reviewers who work on their own, with other translators or in team-based translation projects.

More info »
CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use SDL Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search