How to mend HTML files translated in MS Word
Thread poster: Abhinav_Hindi

Abhinav_Hindi
Local time: 20:24
English to Hindi
+ ...
Jul 20, 2010

I have translated about 100 HTML files using MS Word 2003 and saved them as web pages. Now the client says it has messed up styling and formatting issues since word adds its own extra code. They want me to review and correct the formatting issues.

Please suggest what to do. I tried using Notepad++ but it seems I will have to do all the translation once again - there are 14000 words in 103 files!

The formatting looks to me but the client says it is all scrambled on his end.
This is a very important client for me. I'm willing to help them but doing everything from the scratch again will be too much.

Please suggest. Thank you!!


Direct link Reply with quote
 

Gerard de Noord  Identity Verified
France
Local time: 15:54
Member (2003)
German to Dutch
+ ...
Did you use CAT? Jul 20, 2010

I'm afraid your client is right. Word does add a lot of unnecessary code and, generally speaking, the Word/HTML files you have delivered are of little use to them. (I presume you have received HTML files in the first place and were asked to deliver translated HTML files.)

If you have used a CAT tool to translate the HTML files you'll be able to reapply your TUs after one or more conversions, if you just have used only Word you'll have to retranslate everything with a (free) CAT tool or in Notepad(++), but you'll have to retranslate everything from scratch.

Cheers,
Gerard


Direct link Reply with quote
 

FarkasAndras
Local time: 15:54
English to Hungarian
+ ...
One idea: align+CAT Jul 20, 2010

I don't have much experience with HTML, but here's a suggestion that should (mostly) work.
Merge your 100+ source and target files into one source file and one target txt file. (Put all source files in a folder, navigate to that folder in Total Commander, type copy *.html all_source.html in the box at the bottom, hit enter, rinse and repeat with target.).
Once you have the two html files containing all your text, align them, e.g. using my aligner from here:
http://sourceforge.net/projects/aligner/

Using the HTML option in the aligner strips all HTML tags from the text, which is probably what you want given that the tags are all wrong.
If you're good with computers, you can probably learn how to use the aligner in 0.5-1 hour and do the job in another 0.5-1 hour.
At the end you'll have a TMX with all your text sans tags. Then you'll have to do the actual translation in your CAT of choice, using the TM. Hopefully you'll get TM hits for (almost) all segments, and you'll only have to insert the HTML tags. Even if you don't get straight TM hits, you can do a concordance search.
If you don't use a CAT, you're pretty much SOL. You could use Notepad++ in conjunction with Apsic Xbench, but that's less than ideal... possibly worse than just copy-pasting straight from your translations.


Any HTML-savvy colleagues see a flaw or have a better suggestion?



[Edited at 2010-07-20 21:01 GMT]


Direct link Reply with quote
 

Tomás Cano Binder, BA, CT  Identity Verified
Spain
Local time: 15:54
Member (2005)
English to Spanish
+ ...
Indeed... Jul 20, 2010

This is a tricky situation. Indeed Word interprets HTML files into Word files to the best it can (which is not very good, to be honest), and then from the Word file it generates new HTML code to try to emulate the HTML code needed to achieve the same visual results (which does not work very good either).

The result with Word (or with any other tool that interprets and generates HTML code from visual documents) will be a total destruction of the coding the customer had done in the original files. Quite basically, the files will not work in the customer's system.

In a normal situation, in order to translate HTML files you have two paths of action:
1. With a CAT tool, opening the HTML files and translating them with the translation memory, since CAT tools don't alter the HTML coding (they simply work around the HTML coding by marking it up with special colours and in a way you cannot --or should not be able to-- alter the code).

2. Without a CAT tool, opening the HTML files with any software that does not generate new code and does not show you the contents in a visual manner. There are editors out there that mark up the HTML code with special colours to help you avoid any damage and work around the text. I always use a CAT tool, so right now I cannot think of a hint about what software to use in this case.

Now, in your current situation, let me ask you whether you have used a CAT tool (a translation memory) to do this work. If you have one and know how to use it, you should be able to retranslate the HTML files (without the use of Word, of course), although it will mean plenty of work adding the HTML codes (the tags) you did not have when you translated the files in Word. In 14.000 words of HTML files, my quick estimate is that you will have to spend some 6-8 hours more just adding the destroyed tags, even if the translated text is all there. Your CAT tool will also show lower matches in sentences which now have tags.

If you haven't used a CAT tool and don't have one or don't know how to use one... my friend I can only see three options:
1. Retranslate everything from scratch (with a plain-text editor, not a visual editor). I assume you don't want to do this.

2. Copy and paste your translations from the Word files to HTML files (opening the latter in a plain-text editor), which will mean a high risk of messing up, as well as all the additional work of adding the destroyed tags.

3. Hire some translator in your language pair who has a good CAT tool, can align your translation, and can use the CAT tool to speed up the process. Even in this case, for that many words my estimate is that a very proficient person will require at a very minimum 4-5 hours of work just to add the destroyed tags...

I am affraid I have to be blunt and say that... if you were not sure about how to handle HTML files or similar tagged formats, you should have asked an expert, asked for training, asked the customer, or rejected the job. Now any solution will cost you either A) lots of time and risk, or B) a big chunk of your income for this job, as well as the damage already suffered by your reputation with your customer.

Good luck!


Direct link Reply with quote
 

Romeo Mlinar  Identity Verified
Portugal
Local time: 14:54
Member (2009)
English to Serbian
+ ...
Some sites, search & replace Jul 20, 2010

1. You could try this: http://www.wordhtmlcleaner.co.uk/ or http://www.algotech.dk/word-html-cleaner-input.htm

2. If you're familiar with HTML you could use Notepad++ or UltraEdit (recommended) to use search and replace function on the code. In this case you would replace with "".

R.


Direct link Reply with quote
 

Soonthon LUPKITARO(Ph.D.)  Identity Verified
Thailand
Local time: 21:54
Member (2004)
English to Thai
+ ...
Alignment Jul 21, 2010

I read that you did translate many sentences and formatting problem made it useless.
If I were you, align the translated text to get TM (using a CAT e.g. Trados). Next, translate the source files with NotePad or a CAT as well as the newly created TM.
Formatting variance makes you save much time if you use your previous translation as TM/CAT tool.


Direct link Reply with quote
 

Piotr Bienkowski  Identity Verified
Poland
Local time: 15:54
Member (2005)
English to Polish
+ ...
Use CAT tools with "tagged" source documents Jul 21, 2010

No good CAT tool should mess up the HTML code. Unfortunately MS Word will do that. OpenOffice Writer, too, but in different ways. Word processors should not be used for editing HTML, if the source code of HTML must not be changed.

A good CAT tool will extract the text intended for translation from the HTML file(s) and will protect the HTML code, or tags.

A free solution, if you don't have any CAT tool. Use (free) Okapi Rainbow to prepare the HTML files for translation with OmegaT (free) [conversion to XLF] or convert to RTF to translate with Wordfast Classic without a license, with a limit of 500 TUs in the TM.

Okapi Rainbow will also take care of reverse conversion of the translated file(s) to the source format.

It's a pity you had to learn about it the hard way. But you will know better next time.

HTH

Piotr


Direct link Reply with quote
 

RashmiP
United States
Local time: 06:54
Try Google Jul 21, 2010

Quite a serious issue. I don't have much knowledge on HTML, but sure that you will find a solution. Search in google, you will get some tool to rectify your problem.

Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 15:54
Member (2006)
English to Afrikaans
+ ...
The client is also at fault Jul 21, 2010

Tomás Cano Binder, CT wrote:
I am affraid I have to be blunt and say that... if you were not sure about how to handle HTML files or similar tagged formats, you should have asked an expert, asked for training, asked the customer, or rejected the job.


I think the client has a responsibility too.

MS Word can open HTML files and it can save HTML files. There is no way that someone who is isn't familiar with the ins and outs of HTML would know instinctively that HTML is a format to be careful with. The same applies to many other formats that we haven't worked with before but which our tools do support. I'll wager that some translators who do HTML in their CAT tools don't know much about HTML itself, and simply they do it because their tool can do it.

The same applies to the OP -- his tool can do it, so how was he supposed to know that his tool (arguably the most widely used text editing program in the world) will produce imperfect files?

The OP's dilemma here is that he doesn't want to lose this client, and therefore he has to accept the client's word that the files have formatting errors (even though he himself can't see the formatting problems). The option of hiring a seasoned HTML translator to align the text into a TM and redo the job is a good idea, IMO.


Direct link Reply with quote
 

Soonthon LUPKITARO(Ph.D.)  Identity Verified
Thailand
Local time: 21:54
Member (2004)
English to Thai
+ ...
Client is ignorance? Jul 22, 2010


I think the client has a responsibility too.
MS Word can open HTML files and it can save HTML files. There is no way that someone who is isn't familiar with the ins and outs of HTML would know instinctively that HTML is a format to be careful with. The same applies to many other formats that we haven't worked with before but which our tools do support. I'll wager that some translators who do HTML in their CAT tools don't know much about HTML itself, and simply they do it because their tool can do it.

I totally agree. I was not paid by a client on my EN>JP HTML file translation project. Client (who is a software expert) insisted on my using of MS Word for HTML but I used NotePad. I struggled a lot to prove that client's instruction was wrong.


Direct link Reply with quote
 

David Russi  Identity Verified
United States
Local time: 07:54
English to Spanish
+ ...
Use DreamWeaver Jul 22, 2010

Abhinav_Hindi wrote:

The formatting looks to me but the client says it is all scrambled on his end.
This is a very important client for me. I'm willing to help them but doing everything from the scratch again will be too much.

Please suggest. Thank you!!


DreamWeaver has a function called Clean Up Wort HTML... it is not perfect, but it can remove a lot of Word's junk code. At that point, the files may be clean enough to extract the translation from them with an alignment tool.

Good luck!


Direct link Reply with quote
 

Tomás Cano Binder, BA, CT  Identity Verified
Spain
Local time: 15:54
Member (2005)
English to Spanish
+ ...
The translator's fault Jul 22, 2010

Samuel Murray wrote:
I think the client has a responsibility too.

I don't agree. The customer had assumed that the translator knew how to handle HTML files properly. The fact that the project consisted of HTML files was surely stated when the customer asked for a quotation.

Now knowing that HTML files cannot be processed with Word reveals an insufficient knowledge about HTML, and the translator should have tried to gather information about how to handle these files properly before accepting the job.

BTW: This looks like a dead topic as the OP does not seem to care much about what we have to say...


Direct link Reply with quote
 

FarkasAndras
Local time: 15:54
English to Hungarian
+ ...
No need Jul 22, 2010

David Russi wrote:

Abhinav_Hindi wrote:

The formatting looks to me but the client says it is all scrambled on his end.
This is a very important client for me. I'm willing to help them but doing everything from the scratch again will be too much.

Please suggest. Thank you!!


DreamWeaver has a function called Clean Up Wort HTML... it is not perfect, but it can remove a lot of Word's junk code. At that point, the files may be clean enough to extract the translation from them with an alignment tool.

Good luck!


The quantity of code (tags) is not a factor. My aligner (see link above) strips every tag, so this intermediate step wouldn't affect the end result all.
In some cases, it might be better to leave tags in, but I'm not sure if you need to do something special for that to work, and that's another story anyway.


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

How to mend HTML files translated in MS Word

Advanced search






BaccS – Business Accounting Software
Modern desktop project management for freelance translators

BaccS makes it easy for translators to manage their projects, schedule tasks, create invoices, and view highly customizable reports. User-friendly, ProZ.com integration, community-driven development – a few reasons BaccS is trusted by translators!

More info »
PerfectIt consistency checker
Faster Checking, Greater Accuracy

PerfectIt helps deliver error-free documents. It improves consistency, ensures quality and helps to enforce style guides. It’s a powerful tool for pro users, and comes with the assurance of a 30-day money back guarantee.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search