https://www.proz.com/forum/localization/38697-high_repetition_ratio_in_software_translations.html

High repetition ratio in software translations
Thread poster: John Moran
John Moran
John Moran  Identity Verified
Ireland
Local time: 21:36
German to English
+ ...
Nov 5, 2005

Hi all,

I am working on a job which has a very high ratio of repetitions and also the segments (sentences, fragments, strings, whatever you want to call them) are not contiguous so they do not need to be translated in sequence.

I normally work with Trados but I did not want to have to scroll through 100 repetitions of the same segment so I wrote a program in Java to store the segments in a MySQL database and then I reindexed the segments so that the files had no r
... See more
Hi all,

I am working on a job which has a very high ratio of repetitions and also the segments (sentences, fragments, strings, whatever you want to call them) are not contiguous so they do not need to be translated in sequence.

I normally work with Trados but I did not want to have to scroll through 100 repetitions of the same segment so I wrote a program in Java to store the segments in a MySQL database and then I reindexed the segments so that the files had no repetitions, just new words and fuzzy matches. The program is very rough and ready, I wrote it as a once off (took about 80 hours) and I am now wondering if I should package it into a tool as it really did save me alot of time (more than 80 hours). This is not trivial and might take a year or so of my time as I can only afford work a couple of hours a day on it. At some point I guess it would be nice to see a return on that investment but first I need to see if it is worth it - which brings be to my question.

Are there any tools on the market which do what I just described (reindex resource string files so that there are no repetitions)?

I worked with Catalyst a few years ago and I don't remember it having this feature. I have a vague memory of LocStudio having a feature which put all the repeating segments into a single file so I maybe could have used that but I remember it being a very non-ergonomic environment to work in and I wanted to stick to using MSWord and Trados. I also remember it being expensive and it had a buggy interface to Trados but I think that has been improved.

I guess if I did decide to go ahead with it, this is what it would say on the box:

Features: Import all common resource strings formats, .rc files, CSV, Excel, XML, tabbed delimited etc.

Export into all common CAT tool formats, Trados, Logoport, whatever.

Benefits:

Time: Repetitions are reindexed so that the translator only sees each segment once.

QA: Arbitrary checks can be done on source and target strings, e.g. the number of & characters in the source and target should match, the target should not be longer than the source etc. etc.

Also, reports can be generated, for example how many segments were corrected by the proofreader for each translator working on the job.

I don't really want to start on something that overlaps too much with an existing product in terms of functionality. I guess the main advantage of the tool is that it would let you work in whatever your favourite tool is (e.g. MSWord/Trados) but you still get the benefits of having the job managed by a database, reindexing, QA, reports etc.

Interestingly, getting rid of the repetitions let me calculate the exact return on investment down to the nearest cent.

I probably will not go ahead with it because I am busy with more immediate work but if there is interest in this I would like to hear it before I ditch the idea.

In particular I would be interested to hear from three groups of people.

A) People who have experience managing or engineering large software localization jobs and specifically jobs where the repetitions were a high ratio of total word count. If a job only has 10% reps this tool does not add much value but if it is 90% it does.

B) Anyone who thinks what I just described overlaps with an existing tool/package.

C) Anyone who thinks they would like to collaborate. I am open to talking about any ideas, even open source or partnership.


Sorry about the long post. Hope it wasn't too techie


John

p.s. If anyone wants to talk privately, my address is
[email protected]

(without the minus signs!)
Collapse


 
John Moran
John Moran  Identity Verified
Ireland
Local time: 21:36
German to English
+ ...
TOPIC STARTER
RC WinTrans Nov 5, 2005

Hi again!

Replying to my own post. Classy. Just had a look at the latest version of RCWinTrans. It seems to have alot of what I described.

I guess the main difference is that whoever is working with RCWinTrans has to work out how to use and my tool means the other translators on the job only had to use MSWord and Trados but but it definately overlaps.

Cheers,

John


 
Jaroslaw Michalak
Jaroslaw Michalak  Identity Verified
Poland
Local time: 22:36
Member (2004)
English to Polish
SITE LOCALIZER
Trados? Nov 5, 2005

You can analyze the files to be translated with Trados and export only unknown segments to a txt or rtf file. Then you have a file with "clean" segments, without external untranslated segments and repetitions. I use that technique a lot, in fact.

 
pcovs
pcovs
Denmark
Local time: 22:36
English to Danish
What about the outcome? Nov 5, 2005

I'm sorry to ask such a stupid question, but when you "export" these non-translated segments and you then translate in this new document, what will the document for delivery look like?

How do you deliver a document that looks like the original document (only translated) containing all repetitions etc. translated to the client?


 
Harry Bornemann
Harry Bornemann  Identity Verified
Mexico
Local time: 14:36
English to German
+ ...
Access & Perl? Nov 5, 2005

To get rid of too many repetitions I use to apply an Access query with the function "group by" - very quick and simple.

To import/export between various formats I use Perl, which has an excellent parsing functionality ("regular expressions"). I think it would be difficult to keep up to date with all of the formats of the newest versions of the most common CAT tools (have you considered e.g. POT files? - they have been used for my second largest project), so I write a new Perl script
... See more
To get rid of too many repetitions I use to apply an Access query with the function "group by" - very quick and simple.

To import/export between various formats I use Perl, which has an excellent parsing functionality ("regular expressions"). I think it would be difficult to keep up to date with all of the formats of the newest versions of the most common CAT tools (have you considered e.g. POT files? - they have been used for my second largest project), so I write a new Perl script for every new project needing some text conversion.
Collapse


 
Jaroslaw Michalak
Jaroslaw Michalak  Identity Verified
Poland
Local time: 22:36
Member (2004)
English to Polish
SITE LOCALIZER
It's a good question... Nov 5, 2005

PCovs wrote:

I'm sorry to ask such a stupid question, but when you "export" these non-translated segments and you then translate in this new document, what will the document for delivery look like?

How do you deliver a document that looks like the original document (only translated) containing all repetitions etc. translated to the client?


I did not explain the procedure in detail...

1. I convert the original files with the appropriate tools to a format which can be processed by Trados. This still can be ugly - lots of strings not to be translated (but visible), repetitions, tags, etc.

2. I analyze and export the simple segments to rtf.

3. I translate the temporary file so that I have all the segments needed in the TM (the file itself is not used - it is just a "tool" to get the translation in TM).

4. I translate the converted source file automatically with Trados.

5. I convert the translated file back to the original format.


Trados is not perfect, unfortunately, so the automatic translation needs to be checked. I don't mind, as I still would check the final file anyway.

Of course, if you use SDLX etc. the result is quite similar, but I am most comfortable with Word.


 
PatriziaM.
PatriziaM.  Identity Verified
Italy
Local time: 22:36
English to Italian
+ ...
DVX populates automatically Nov 6, 2005

John Moran wrote:

I normally work with Trados but I did not want to have to scroll through 100 repetitions of the same segment so I wrote a program in Java to store the segments in a MySQL database and then I reindexed the segments so that the files had no repetitions, just new words and fuzzy matches.


Hi!
Do you know DejàVuX? It includes a function that allows to populate automatically all repetitions (that is, the translation is input automatically by DejàVu) after having translated manually the first one. It seems to me that it's very similar to what you describe. Isn't it?


 
Rodolfo Raya
Rodolfo Raya  Identity Verified
Local time: 17:36
English to Spanish
Heartsome XLIFF Editor also auto-propagates repetitions Nov 6, 2005

John Moran wrote:

I normally work with Trados but I did not want to have to scroll through 100 repetitions of the same segment...

Once you translate a segment in Heartsome XLIFF Editor (see http://www.heartsome.net ) your translation is automatically copied to all identical segments. Fuzzy matches are also automatically added to segments with a similarity above a user selected threshold.

The statistics provided by the XLIFF Editor already contemplate repetitions and you don't need to alter the source document for preparing a quote.

It doesn't matter if you have 90% of repetitions. CAT tools are designed to save time and reuse data from TM, specially exact matches.

Regards,
Rodolfo


 
pcovs
pcovs
Denmark
Local time: 22:36
English to Danish
I see, but why not simply use 'translate to fuzzy'? Nov 6, 2005

That's what I usually do, but obviously if it's a very large file, it may take some time.

But the extracting etc. also takes time, so I guess it would be a tool to be used only with very large files with a lot of repetitions?


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 22:36
Member (2006)
English to Afrikaans
+ ...
I think Wordfast has it... Nov 6, 2005

John Moran wrote:
Are there any tools on the market which do what I just described (reindex resource string files so that there are no repetitions)?


Wordfast's extract tool extracts all non-unique segments and saves it in a single file (it also extracts all segments, but that's saved as anothe file). I'm not sure if the extraction count is also subject to the non-registration TM limit (but try it anyway).


I don't really want to start on something that overlaps too much with an existing product in terms of functionality.


Allow me to comment on the features, then.


Features: Import all common resource strings formats, .rc files, CSV, Excel, XML, tabbed delimited etc.


Wordfast can do all of this (although some documents need to be tagged). Except perhaps .rc files... I always get confused with .rc and .res -- Wordfast can import the non-binary one of the two.


Export into all common CAT tool formats, Trados, Logoport, whatever.


Wouldn't it be easier to simply export to TMX?


Repetitions are reindexed so that the translator only sees each segment once.


Wordfast can extract segments. Even if it couldn't, it can still perform a dummy auto translation with source=target, and then you just remove duplicates from the TM, and remove all columns except the source text column.


QA: Arbitrary checks can be done on source and target strings, e.g. the number of & characters in the source and target should match...


I *think* you can try to define the ampersand as a placeable in Wordfast, then Wordfast will QA check to see if the number of placeables match.


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

High repetition ratio in software translations






Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »
Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »