High repetition ratio in software translations
Thread poster: John Moran

John Moran  Identity Verified
Ireland
Local time: 14:02
Member (2004)
German to English
+ ...
Nov 5, 2005

Hi all,

I am working on a job which has a very high ratio of repetitions and also the segments (sentences, fragments, strings, whatever you want to call them) are not contiguous so they do not need to be translated in sequence.

I normally work with Trados but I did not want to have to scroll through 100 repetitions of the same segment so I wrote a program in Java to store the segments in a MySQL database and then I reindexed the segments so that the files had no repetitions, just new words and fuzzy matches. The program is very rough and ready, I wrote it as a once off (took about 80 hours) and I am now wondering if I should package it into a tool as it really did save me alot of time (more than 80 hours). This is not trivial and might take a year or so of my time as I can only afford work a couple of hours a day on it. At some point I guess it would be nice to see a return on that investment but first I need to see if it is worth it - which brings be to my question.

Are there any tools on the market which do what I just described (reindex resource string files so that there are no repetitions)?

I worked with Catalyst a few years ago and I don't remember it having this feature. I have a vague memory of LocStudio having a feature which put all the repeating segments into a single file so I maybe could have used that but I remember it being a very non-ergonomic environment to work in and I wanted to stick to using MSWord and Trados. I also remember it being expensive and it had a buggy interface to Trados but I think that has been improved.

I guess if I did decide to go ahead with it, this is what it would say on the box:

Features: Import all common resource strings formats, .rc files, CSV, Excel, XML, tabbed delimited etc.

Export into all common CAT tool formats, Trados, Logoport, whatever.

Benefits:

Time: Repetitions are reindexed so that the translator only sees each segment once.

QA: Arbitrary checks can be done on source and target strings, e.g. the number of & characters in the source and target should match, the target should not be longer than the source etc. etc.

Also, reports can be generated, for example how many segments were corrected by the proofreader for each translator working on the job.

I don't really want to start on something that overlaps too much with an existing product in terms of functionality. I guess the main advantage of the tool is that it would let you work in whatever your favourite tool is (e.g. MSWord/Trados) but you still get the benefits of having the job managed by a database, reindexing, QA, reports etc.

Interestingly, getting rid of the repetitions let me calculate the exact return on investment down to the nearest cent.

I probably will not go ahead with it because I am busy with more immediate work but if there is interest in this I would like to hear it before I ditch the idea.

In particular I would be interested to hear from three groups of people.

A) People who have experience managing or engineering large software localization jobs and specifically jobs where the repetitions were a high ratio of total word count. If a job only has 10% reps this tool does not add much value but if it is 90% it does.

B) Anyone who thinks what I just described overlaps with an existing tool/package.

C) Anyone who thinks they would like to collaborate. I am open to talking about any ideas, even open source or partnership.


Sorry about the long post. Hope it wasn't too techie


John

p.s. If anyone wants to talk privately, my address is
transpi--ral@--yahoo--.com

(without the minus signs!)


Direct link Reply with quote
 

John Moran  Identity Verified
Ireland
Local time: 14:02
Member (2004)
German to English
+ ...
TOPIC STARTER
RC WinTrans Nov 5, 2005

Hi again!

Replying to my own post. Classy. Just had a look at the latest version of RCWinTrans. It seems to have alot of what I described.

I guess the main difference is that whoever is working with RCWinTrans has to work out how to use and my tool means the other translators on the job only had to use MSWord and Trados but but it definately overlaps.

Cheers,

John


Direct link Reply with quote
 

Jabberwock  Identity Verified
Poland
Local time: 15:02
Member (2004)
English to Polish
Trados? Nov 5, 2005

You can analyze the files to be translated with Trados and export only unknown segments to a txt or rtf file. Then you have a file with "clean" segments, without external untranslated segments and repetitions. I use that technique a lot, in fact.

Direct link Reply with quote
 

PCovs
Denmark
Local time: 15:02
Member (2003)
English to Danish
+ ...
What about the outcome? Nov 5, 2005

I'm sorry to ask such a stupid question, but when you "export" these non-translated segments and you then translate in this new document, what will the document for delivery look like?

How do you deliver a document that looks like the original document (only translated) containing all repetitions etc. translated to the client?


Direct link Reply with quote
 

Harry Bornemann  Identity Verified
Mexico
English to German
+ ...
Access & Perl? Nov 5, 2005

To get rid of too many repetitions I use to apply an Access query with the function "group by" - very quick and simple.

To import/export between various formats I use Perl, which has an excellent parsing functionality ("regular expressions"). I think it would be difficult to keep up to date with all of the formats of the newest versions of the most common CAT tools (have you considered e.g. POT files? - they have been used for my second largest project), so I write a new Perl script for every new project needing some text conversion.


Direct link Reply with quote
 

Jabberwock  Identity Verified
Poland
Local time: 15:02
Member (2004)
English to Polish
It's a good question... Nov 5, 2005

PCovs wrote:

I'm sorry to ask such a stupid question, but when you "export" these non-translated segments and you then translate in this new document, what will the document for delivery look like?

How do you deliver a document that looks like the original document (only translated) containing all repetitions etc. translated to the client?


I did not explain the procedure in detail...

1. I convert the original files with the appropriate tools to a format which can be processed by Trados. This still can be ugly - lots of strings not to be translated (but visible), repetitions, tags, etc.

2. I analyze and export the simple segments to rtf.

3. I translate the temporary file so that I have all the segments needed in the TM (the file itself is not used - it is just a "tool" to get the translation in TM).

4. I translate the converted source file automatically with Trados.

5. I convert the translated file back to the original format.


Trados is not perfect, unfortunately, so the automatic translation needs to be checked. I don't mind, as I still would check the final file anyway.

Of course, if you use SDLX etc. the result is quite similar, but I am most comfortable with Word.


Direct link Reply with quote
 

PatriziaM.  Identity Verified
Italy
Local time: 15:02
English to Italian
+ ...
DVX populates automatically Nov 6, 2005

John Moran wrote:

I normally work with Trados but I did not want to have to scroll through 100 repetitions of the same segment so I wrote a program in Java to store the segments in a MySQL database and then I reindexed the segments so that the files had no repetitions, just new words and fuzzy matches.


Hi!
Do you know DejàVuX? It includes a function that allows to populate automatically all repetitions (that is, the translation is input automatically by DejàVu) after having translated manually the first one. It seems to me that it's very similar to what you describe. Isn't it?


Direct link Reply with quote
 

Rodolfo Raya  Identity Verified
Local time: 10:02
English to Spanish
Heartsome XLIFF Editor also auto-propagates repetitions Nov 6, 2005

John Moran wrote:

I normally work with Trados but I did not want to have to scroll through 100 repetitions of the same segment...

Once you translate a segment in Heartsome XLIFF Editor (see http://www.heartsome.net ) your translation is automatically copied to all identical segments. Fuzzy matches are also automatically added to segments with a similarity above a user selected threshold.

The statistics provided by the XLIFF Editor already contemplate repetitions and you don't need to alter the source document for preparing a quote.

It doesn't matter if you have 90% of repetitions. CAT tools are designed to save time and reuse data from TM, specially exact matches.

Regards,
Rodolfo


Direct link Reply with quote
 

PCovs
Denmark
Local time: 15:02
Member (2003)
English to Danish
+ ...
I see, but why not simply use 'translate to fuzzy'? Nov 6, 2005

That's what I usually do, but obviously if it's a very large file, it may take some time.

But the extracting etc. also takes time, so I guess it would be a tool to be used only with very large files with a lot of repetitions?


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 15:02
Member (2006)
English to Afrikaans
+ ...
I think Wordfast has it... Nov 6, 2005

John Moran wrote:
Are there any tools on the market which do what I just described (reindex resource string files so that there are no repetitions)?


Wordfast's extract tool extracts all non-unique segments and saves it in a single file (it also extracts all segments, but that's saved as anothe file). I'm not sure if the extraction count is also subject to the non-registration TM limit (but try it anyway).


I don't really want to start on something that overlaps too much with an existing product in terms of functionality.


Allow me to comment on the features, then.


Features: Import all common resource strings formats, .rc files, CSV, Excel, XML, tabbed delimited etc.


Wordfast can do all of this (although some documents need to be tagged). Except perhaps .rc files... I always get confused with .rc and .res -- Wordfast can import the non-binary one of the two.


Export into all common CAT tool formats, Trados, Logoport, whatever.


Wouldn't it be easier to simply export to TMX?


Repetitions are reindexed so that the translator only sees each segment once.


Wordfast can extract segments. Even if it couldn't, it can still perform a dummy auto translation with source=target, and then you just remove duplicates from the TM, and remove all columns except the source text column.


QA: Arbitrary checks can be done on source and target strings, e.g. the number of & characters in the source and target should match...


I *think* you can try to define the ampersand as a placeable in Wordfast, then Wordfast will QA check to see if the number of placeables match.


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

High repetition ratio in software translations

Advanced search






LSP.expert
You’re a freelance translator? LSP.expert helps you manage your daily translation jobs. It’s easy, fast and secure.

How about you start tracking translation jobs and sending invoices in minutes? You can also manage your clients and generate reports about your business activities. So you always keep a clear view on your planning, AND you get a free 30 day trial period!

More info »
PerfectIt consistency checker
Faster Checking, Greater Accuracy

PerfectIt helps deliver error-free documents. It improves consistency, ensures quality and helps to enforce style guides. It’s a powerful tool for pro users, and comes with the assurance of a 30-day money back guarantee.

More info »



All of ProZ.com
  • All of ProZ.com
  • Term search
  • Jobs