Mobile menu

Pages in topic:   [1 2] >
DOC->RTF->DOC workaround results in bad segments around accented characters
Thread poster: Alan Frankel

Alan Frankel  Identity Verified
United States
Local time: 10:47
German to English
Feb 3, 2009

I'm using a very recent version of Trados 2007 -- the Workbench version is 8.3.0.863. I'm also using Microsoft Office Word 2003 (11.8237.8221) SP3.

I received a set of five DOC files that were probably produced by using optical character recognition on PDF files (since they have some telltale artifacts). When I try to load one into TagEditor, I get a message: "80003: TagEditor is unable to open this document because the file type is not recognized." I have been able to load other DOC files in the past.

I tried all the following, which have been suggested in forums here:

(1) Enable the TradosTag workflow for doc files. (I was already doing this.)
(2) Within Word, search for every section break as ^b , replace it with a space, and save. (There were 6 section breaks.)

So I tried using Word to save the file as an RTF, and then saving it again as a DOC. This indeed made it possible for me to open it with TagEditor. However, there is a big problem: there are tons of "cf" tags. In particular, every accented character in German (ä, ü, ß, etc.) is now wrapped with tags, which, for example, turns the string "Schlagwörter" into this monstrosity:

< cf size="8" complexscriptssize="8" fontcolour="0x0" >Schlagw< /cf >< cf size="8" complexscriptsfont="Times New Roman" complexscriptssize="8" fontcolour="0x0" >ö< /cf >< cf size="8" complexscriptssize="8" fontcolour="0x0" >rter: < /cf >< cf size="8" complexscriptsfont="Times New Roman" complexscriptssize="8" fontcolour="0x0" >

(I had to put spaces around the angle brackets to make it display correctly here.)

All those superfluous tags make translation difficult. I can't come up with an automated way to remove them. TagEditor's Find/Replace is no good because I can't get it to use wildcards AND search in tags. (Furthermore, it doesn't let you do a "Replace this one", then move to the next instance and do another replace. You can either do a single replace or a "Replace all". And it won't let you keep the focus in the document and use F3 to move to the next instance of a phrase. Very poor. Do they actual have people do usability studies on this product?)

It looks like it's TagEditor that introduces those nasty tags. When I look at the RTF file, or the RTF-to-DOC file, in a text editor, there are no strange characters or other weirdness around the accented characters.

I don't have any wacky segmentation settings that I can find. But perhaps something is telling TagEditor that the documents are supposed to be in English, so it's treating the accented characters as punctuation? That would be pretty silly, but I suppose it could happen.

I could not find anything about this problem in the forums here or in the SDL knowledge base.

Any feedback would be very welcome. I was impressed by the quick response to my previous problem report.


Direct link Reply with quote
 

Alan Frankel  Identity Verified
United States
Local time: 10:47
German to English
TOPIC STARTER
Failed at trying to "escape" the accented characters Feb 3, 2009

I had the idea that perhaps I could "escape" the accented characters by converting "ä" to "aumlaut", "ö" to "oumlaut", etc. in the original Word document. Then I could convert it to RTF, and then again to DOC, and then open it in the TagEditor. Finally, I could convert "aumlaut" back to "ä", etc.

But strangely enough, when I opened the file in TagEditor, I now saw those cf tags surrounding "aumlaut", "oumlaut", etc., just as they had surrounded "ä", "ö", etc., before. *#$^! I suppose Word must store changes in some more sophisticated way than I had expected. Rather than wiping out the string buffer and replacing it with new contents, it probably stores something like an index to the original and a bunch of "minibuffers" with the new strings.

I also thought of editing the TTX file in a text editor to handle just the cases where the cf tags wrap a single accented character, but that looks very tricky and risky.

So I'm still looking for suggestions...


Direct link Reply with quote
 

Vito Smolej
Germany
Local time: 16:47
Member (2004)
English to Slovenian
+ ...
why dont you cut and paste... Feb 3, 2009

... the contents into a simple straightforward (UTF8) text file? Seems like all the formatting characters and tags (which you may lose going this way) are more of a nuisance than of any real use.

Direct link Reply with quote
 

Alan Frankel  Identity Verified
United States
Local time: 10:47
German to English
TOPIC STARTER
I want to preserve the tags. Feb 3, 2009

I realize that I could get rid of all the tags easily enough by converting to a text file. However, I want to preserve the formatting, particularly the tables, since it takes a lot of effort to insert a table and then move strings into the proper place. I'd also like to be able to preserve the paragraph styles (headings, etc.), though those are easier to reinsert.

Direct link Reply with quote
 

ViktoriaG  Identity Verified
Canada
Local time: 10:47
English to French
+ ...
Can't help - but here's a thought Feb 3, 2009

I can't help you fix your problem, and I sympathize with your woes. However, I think there is something you need to be reminded of.

As translators, we are not supposed to be doing DTP, layout, etc. We are supposed to only translate. Of course, we want to make our clients happy, so we sometimes do touch the formatting and remove some bugs. But we can only go so far. If the source text you are working on is totally crappy, the client can't expect you to magically fix that at no extra charge. Think about it this way: by how much would you raise your per word rate if all jobs were like this one?

One way you can go about this is ask the client if the source document is indeed the result of OCR. If it is, ask them if they have the original file. I may be able to help you create a healthy Word or RTF copy for you - I am a fierce OCR user. However, I recommend you also ask for a bit of monetary compensation for this. Otherwise, the client may expect you to go through hell again, at no extra charge.

I also would ask the client if it really matters that much that you keep all of the formatting (which is nonexistent, really, since your document seems to be pretty badly handicapped). If keeping the formatting wasn't explicitly included in your agreement, then I would advise the client that, due to very bad formatting, some of the formatting will be lost in the process, and then I would concentrate on translating and skip the formatting if it's in the way.

[Edited at 2009-02-03 18:10 GMT]


Direct link Reply with quote
 

Alan Frankel  Identity Verified
United States
Local time: 10:47
German to English
TOPIC STARTER
Special circumstances explain why I want to keep the formatting Feb 3, 2009

Viktoria, you bring up very good points.

The circumstances are a little unusual in this case for the following reasons:

(1) I told my contact at the translation agency that I would charge more if I had to work directly from PDF files. So she got me some Word files from the client. I noticed a few OCR artifacts in those DOC files, but not many. In particular, they contained genuine tables. (I had found that when I did my own OCR conversion on the PDF files, the tables had tended to mess things up.) So, because the translation agency contact made a special effort to oblige me, I would like to oblige her if possible.

(2) I bought Trados recently and am investing a lot of time in learning how to use it, how to research the problems I encounter, etc.

(3) I have a bit of extra time this week and would like to get these kinks out of the way now. I figure that the time may come when someone hands me some DOC files in the future that cause TagEditor to choke as it's doing now. It would be nice to know how to get around the problem.

Also, the extra formatting in DOC files can make it easier for me to read, and thus work with, the source.

Having said all that, unless I find a solution soon, I do feel that I can tell my contact that the DOC files are too corrupt for my translation tool and I will have to hand over files where the formatting is missing.

Thanks for the feedback.


Direct link Reply with quote
 

Alan Frankel  Identity Verified
United States
Local time: 10:47
German to English
TOPIC STARTER
Other ways I've tried to clean up Feb 3, 2009

I've tried to clean up the file by choosing the "repair" option from Word. Didn't work. Also tried "repairing" Word itself from the "Help" menu ("Detect and Repair"). But I didn't think Word was broken, and indeed selecting "Detect and Repair" didn't change its behavior.

Direct link Reply with quote
 

ViktoriaG  Identity Verified
Canada
Local time: 10:47
English to French
+ ...
The dangers of charging different rates for different formats Feb 3, 2009

I totally see your point, Alan. There are two issues, however.

Firstly, I don't think your contact sent you Word files to make your life easier. I fear they only made the extra effort to save some money - and now, they are saving that money at your expense. In such cases, I don't feel that you should feel obliged to your contact.

I totally agree with you that charging more to translate PDF files (especially if the formatting is to be reproduced) is just and fair. I in fact charge more for this as well. However, the client took your word for it - and now it's backfiring. I suggest to make it clear that poor OCR is not part of what you call editable file formats next time you discuss rate differences between different file formats. In other words, you should be charging more for PDFs AND poorly formatted editable documents as well. Poor OCR is the same as translating a PDF. Oh wait! It is actually worse than translating a PDF.

This sure is turning into a great occasion to learn for you. Just make sure you don't overdo it. As you say, you may be training yourself now for eventual poorly formatted documents - but you shouldn't have to ever worry about poorly formatted documents as a translator. That would be helping your clients to save on DTP and get it done at no charge by you instead of contracting the work to a professional. Not to mention that it is a slippery slope - if your DTP is not top notch, you may end up having non payment issues.

Sorry if I could not be of much help. I hope you will get some sort of benefit out of fiddling with this problem.

All the best!

[Edited at 2009-02-03 06:10 GMT]


Direct link Reply with quote
 

Ralf Lemster  Identity Verified
Germany
Local time: 16:47
English to German
+ ...
TagEditor reflects the formatting of the Word file Feb 3, 2009

Hi Alan,
t looks like it's TagEditor that introduces those nasty tags. When I look at the RTF file, or the RTF-to-DOC file, in a text editor, there are no strange characters or other weirdness around the accented characters.

Have a close look at the font and formatting settings for those characters - looks like the OCR software used a different font (or font setting) for the special characters. TagEditor reflects these as tags.


I also thought of editing the TTX file in a text editor to handle just the cases where the cf tags wrap a single accented character, but that looks very tricky and risky.

Not necessarily, provided you remove the right pair of tags in each case. TagEditor is XML-based, and XML does not tolerate tag errors - for instance, if you remove the opening tag, but not the corresponding closing tag (/cf), there will be an error. You could try removing the tag pairs in TagEditor (if you do, test the file by saving as target after having translated one of two segments).

HTH, Ralf


Direct link Reply with quote
 

KSL Berlin  Identity Verified
Portugal
Local time: 15:47
Member (2003)
German to English
+ ...
Try these too Feb 3, 2009

Look at the "How To" tab of my profile. You'll find a link for "post-OCR workflow" where you can download some instructions that may help clean things up. Or you might try downloading and installing OpenOffice, then opening the DOC files with it and resaving them as DOC. Another good tool that often works better than the aforementioned suggestions is to use Dave Turner's CodeZapper macro collection. It is freeware and can be downloaded from the appropriate area of the dejavu-l list of Yahoogroups. It was developed to cope with a lot of tag garbage in RTF/DOC files in Déjà Vu, but it works just as well for the same issues in TagEditor.

Direct link Reply with quote
 

Alan Frankel  Identity Verified
United States
Local time: 10:47
German to English
TOPIC STARTER
Responses to post-OCR suggestions Feb 3, 2009

Wow, great responses!

Viktoria, your assessment is absolutely right. You helped me clarify what it is that the client was doing and what it is that I have a right to expect.

Ralf, I realize that I'm able to remove the tags manually in TagEditor, but I can't find a good way to automate it. (Is there?) I tried editing the TTX file in TextPad (a text editor), and did finally come up with a regular expression that was able to work on the first instance of an "ö" that was actually wrapped with two tags on both sides -- but it didn't work on the other ones, and I saw that there were "df" elements present in the latter that were not present in the first. Developing a set of more robust regular expressions that would do the job would be an interesting challenge, but I'm going to leave it for later. Your observation that the OCR software is using a different font for the special characters is a good insight. I hadn't been thinking about it from a top level like that. By the way, Ralf, if you ever communicate with SDL, tell them that Trados needs a much better search-and-replace function, and there's no excuse for their not having it. I'm sure they could buy the technology for a cheap price if they can't develop it in house. There is no excuse for a software package that costs $1000 to force you to have to jump into and out of the search box to move from one found instance of the search string to the next. Or not to allow you to use regular expressions (or even their watered-down "wildcards") in combination with searching through tag content. Or to have option checkboxes (such as the "search in tag content" checkbox) that disappear altogether (not simply being grayed out) and are hard to figure out how to restore. I would give this as feedback to SDL myself if I thought they would listen to me.

Kevin, there is a ton of useful how-to information on your profile! I already learned something important, which is that it is possible to keep table formatting while removing other formatting, which is what I'd like to do.

But for now, what I'm going to do is tell the client that I'm going to produce files without the extra formatting. If I have extra time afterwards, I'll experiment. But that will be for my benefit. If I happen to come up with a solution quickly, I'll share the formatted documents with the client.

This is quite a community! Thank you all!


Direct link Reply with quote
 

Alan Frankel  Identity Verified
United States
Local time: 10:47
German to English
TOPIC STARTER
Found an acceptable workaround Feb 3, 2009

To recapitulate for those just entering the thread: I was having a problem with DOC files produced by an optical character recognition process that used a different font to represent accented characters. I wanted to preserve table formatting, but I was prepared to lose other forms of formatting (bold, font size, etc.) if keeping that information would have made editing within TagEditor impractical.

I found a workaround that was acceptable to me. Indeed, it loses the character formatting, but none of the table formatting. Most importantly, it doesn't cause the accented characters to end up as separate segments.

I open the DOC files in OpenOffice Writer (a free Word-like tool available from http://openoffice.org , for those who are not familiar with it). I apply default formatting (first item in the Format menu) to the document, optionally replacing each tab with a single space (to avoid segmentation problems at a later stage). Then I save it as "HTML document (OpenOffice Writer)", giving it an HTML extension. Then I open it in TagEditor. After I finish translating, I save the target as an HTML file. (There's no other choice.) Finally, I open the HTML file in Word and save it as a DOC file.

So what happens if I deviate from this procedure?

If I don't apply default formatting, the file can still be opened in TagEditor, but every line has font tags, which are a pain to work with. As far as I can tell, I need to copy them individually to every line, and then they interfere with terminology recognition in the translation memory.

If I try to save the file from OpenOffice Writer as an RTF file instead of an HTML file, I get this message: "This file contains more than one source and/or target language. The language appearing last in the file will be used. Would you still like to open this file?"

If I try to save the file from OpenOffice Writer as a DOC file instead of an HTML file, TagEditor will die when it tries to edit the file.

And if I skip the whole procedure by saving as a text file early on, then I lose all the table formatting.

So using the OpenOffice HTML route worked best for me.

It looks like the editor that Kevin mentioned (ABBYY FineReader) might have allowed me to selectively keep some more formatting information, but the editor wasn't free, I didn't want to buy the product just for this purpose, and I wasn't sure which version I should select if I did end up buying it.

[Edited at 2009-02-03 23:29 GMT]


Direct link Reply with quote
 

Alan Frankel  Identity Verified
United States
Local time: 10:47
German to English
TOPIC STARTER
Examining the TTX file in the Eclipse XML reader Feb 3, 2009

By the way, if I were going to try to edit the TTX file in an automated way, it looks like the Eclipse XML reader (which comes along with the Java IDE version) would be the thing to use, since it lets you look at the elements in a clear way (although first you have to change the extension to "xml"). But I found that the regular expressions for the search-and-replace would have to be pretty complex.

Let's look at the string "Übersicht". The client's OCR tagged the "Ü" with a separate font. The XML looked like this (clearing away the attributes and the values of the df and ut elements and adding spaces around the angle brackets):

< df >
< ut >
< /ut >
Ü
< /df >
< df >
< ut >
< /ut >
bersicht
< /df >

It would have been possible to find a regular expression to search for a single "Ü" preceded by a ut element, both contained in a df element, followed by another string prceded by a ut element, both contained in a df element... and then to strip out everything but the values. But this seemed even riskier and more complicated than I had expected at first, when I thought that the tags were organized in a neater and simpler way.


Direct link Reply with quote
 

ViktoriaG  Identity Verified
Canada
Local time: 10:47
English to French
+ ...
Something good did come out of it Feb 4, 2009

Wow, nice! I guess we have just learned a bit from your misadventure, Alan! Nice job!

About the way you handled this with your client, that is more or less how I would have handled it. You basically let the client know that as a translator, you couldn't provide them with a miracle, and that you can only go so far in making them happy - but you didn't let them down either.

Thanks for keeping us posted on the developments.

All the best!

[Edited at 2009-02-04 16:49 GMT]


Direct link Reply with quote
 

Tomás Cano Binder, CT  Identity Verified
Spain
Local time: 16:47
Member (2005)
English to Spanish
+ ...
DOCX files? Feb 4, 2009

I wonder: As this situation rings a bell, is it possible that the DOC files you received were DOCX files (Office 2007 files)? You might want to download Microsoft's compatibility pack for Office and avoid all the trouble you described:

http://www.microsoft.com/downloads/details.aspx?FamilyID=941b3470-3ae9-4aee-8f43-c6bb74cd1466&displaylang=en


Direct link Reply with quote
 
Pages in topic:   [1 2] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

DOC->RTF->DOC workaround results in bad segments around accented characters

Advanced search


Translation news related to SDL Trados





PerfectIt consistency checker
Faster Checking, Greater Accuracy

PerfectIt helps deliver error-free documents. It improves consistency, ensures quality and helps to enforce style guides. It’s a powerful tool for pro users, and comes with the assurance of a 30-day money back guarantee.

More info »
Déjà Vu X3
Try it, Love it

Find out why Déjà Vu is today the most flexible, customizable and user-friendly tool on the market. See the brand new features in action: *Completely redesigned user interface *Live Preview *Inline spell checking *Inline

More info »



All of ProZ.com
  • All of ProZ.com
  • Term search
  • Jobs