Translating Web sites
Today, being able to translate HTML is crucial, for
obvious reasons, and about every translator will accept HTML files. Yet,
although it's not politically correct to mention this here, truth is that
many translators don't know enough about HTML and websites to do a professional
There are LOTS of good HTML tutorials around, but they
are all intended for webmasters wannabes or even professional webmasters,
and skip important issues a translator should be aware of. I hope this
fills in the gap and helps you do a better job.
If you are already well familiar with HTML, Keywords
handling and style sheets, go straight to page
4 for more on preparing an HTML file for translation and doing the
(Basic and not so basic)
What is HTML and how does it work? HTML stands for HyperText
Markup Language. Hypertext is text characterized
by the presence of links. Take a book. You read from the beginning and
move toward the end. With hypertext, you can have access immediately to
the information you are looking for by clicking on links.
An HTML file is a simple text file with an “htm”
or “html” extension. Do the following experience:
Take a simple text file, “whatever.txt" and rename it to “whatever.htm”.
Double click on it and it will display in your default web browser. Now,
you will note that there are no links. There are no bold, no underlines,
no tables, no pictures and not even paragraph marks.
HTML is the "language" that you use to tell
the browser (Internet Explorer, Netscape, Mozilla, Opera...) how the page
should be displayed and what it should do in different situations (the
user click on a link, the navigator finds the page and display it, for
instance). To do that, it uses “markups”. A markup - or tag
- is a small piece of code that provides this information. In HTML, tags
are made of a “<” sign, some code and a “>”
sign. Case is not important.
For instance “<b>” tells the browser
that whatever information follows that tag should be displayed in bold.
Now, unless you want everything to be displayed in bold, there must be
another tag to tell the browser where it should stop to display the text
in bold. That tag is “</b>”. Note the “/”
sign. The tag triggering the bold display (<b>) is called an opening
tag. The tag canceling the action of the opening tag (</b>)
is called a closing tag. There are tags for about every
formatting option: italics, underline, color, size… You will find
them very easily on the net, like here for instance.
There are other types of tags in an HTML document.
For instance, there are tags detailing the structure
of the page and its general behavior. An HTML page is usually as follow:
<HTML> (To tell the browser that this page is in HTML)
<HEAD> (Header. Contains information about the page that will not
be displayed, but can nevertheless influence the display.)
</HEAD> (Closes the “<head>” tag. Most tags should
be opened and closed.)
<BODY> (The actual page. This is what you see when you open the
page in the browser)
</BODY> (Closing tag for <body>)
</HTML> (Closing tag for <html>)
You need not change the structure tags when you translate.
Another type of tag is the Meta tag.
These are located in the header and give information on the page, used
mostly by search engines, like keywords, description of the page, author
and copyrights… You will need to translate the contents of some
of these tags. Bearing in mind that these tags are mostly intended for
search engines, you have to translate the keywords and
description using words that people will use to find
the web site. It’s not a matter of just translating those.
You have to think a little bit about which terms are
applicable to the page and will be the most popular. You are likely to
find misspellings in the Meta tags. They are there on purpose, so that
people who misspell their search terms in the search engine find the page
anyway. If so, misspell too. Google listed the misspellings it found for
“Britney Spears”. There are hundreds, and they have been searched
for by thousands of people, so misspelling on popular searches could amount
to a significant trafic.
If you find well thought of descriptions and several
typos in the Meta tags, be extra careful, for this is
evidence that your customer has attempted some search engine optimization,
and perhaps paid a lot of money to do so. Don’t ruin it.
There is one other important item in the Meta tags:
The charset. It tells the browser which character set
is used in the page. If you translate from a language with a character
encoding different of yours, you may have to change the encoding for the
page to display properly. Here is what that Meta tag looks like:
<meta http-equiv="Content-Type" content="text/html;
The TITLE tag (in the header. Shows
in the title bar of the web browser when you display the page) <title>.
THIS is the single most important piece of text in your web page.
Why? Because Search Engines value it above everything else, when they
analyze the page. “Welcome to Whatever.inc” is probably the
most stupid title you can come up with. A title should contain the keywords
that will be used to find the page. If the page talks about Blue widgets,
the title should have “Blue widget” in it! Now, of course,
you are translating. That means you have to follow the original Web page,
and if the original name is “Welcome to Whatever.inc”, then
keep it, but if you can see the author has put some thought on the title
to include keywords in a specific sequence, give it some thought yourself.
Links. In HTML, a link looks like this:
<a href=“http://www.website.com” title=“Good
web site”> Web Site </a>
“a” stands for “Anchor”, and
“href” tells the browser where that “anchor” is
located (here, “http://www.website.com”). “Title”
gives a title for the link, so that when you pass the mouse over the link,
a small note will display, “Good web site”, in this example.
You have to translate it. “Web Site” is the text of the link.
You may or may not have to translate it. “</a>” is the
Images. Although you see images in
web pages, they are not really inside the HTML document. It’s a
simple text file, right? In fact, you have a tag that tells the web browser
where the picture is stored and how to display it (what size, with or
without a border, where in the screen…). The image tag is <img
of a blue widget”>. It has no closing tag. You should not change
the image tag except for the content of the "alt" tag. “Alt”
stands for “Alternate text”.
In the early days of Internet, many browsers were not
able to display pictures, or it was too slow, so many users disabled the
pictures to surf faster. To enable those users to understand what picture
should be there, the alt text is displayed instead. Even if the image
is displayed, the alt text shows when you move the mouse over the image.
You have to translate it.
The “alt" and the “title” are
usually loaded with keywords for the search engines. If this is the case,
make sure that the translation is the same way.
HTML has evolved a lot from the first version. Nowadays,
a web designer can decide exactly the size of the text, create styles
(a concept similar to styles in a word processor – more on that
later), set the position and so on. But in the early days, HTML was much
The web was used for text. You had a series of tags
to identify the document’s hierarchy, called the “heading
tags” <h1>, <h2>, <h3>… and their closing
tags, </h1>, </h2>, </h3>. H1 is the main heading. It's
big, bold, often too big, in fact. H2 is a secondary heading, slightly
smaller. H3 is again small... You got the idea.
Although there are much better ways in current HTML
to arrange the display, the H tags have remained and are used by search
engines when they analyze a page, the rationale being that if a word is
in a heading, it is more relevant to the page content. This is the main
reason why many web sites still use those tags even if that means a little
bit more work. As a translator, these tags tell you that you are translating
a heading, and its position in the document's hierarchy.
They are also a warning that you have to be aware that
the words inside these tags. Exactly. Keywords. Usually, you will see
the same keywords used in the H tags and in the “keywords”
Meta tag. Make sure that you use the same keywords. Search Engines analyze,
amongst other things, the number of times a specific keyword appears compared
to the total number of words in the page, and where. Try to keep the same
proportion as the original document, and if a keyword is in a header,
make sure your translation leaves a keyword in that same header.
For the same reason, HTML contains a number of redundant
tags, like <b> and <strong>, or old ones that you almost don’t
see anymore, like “<big>” (self explanatory, I think).
Look for these. Too easy to concentrate on the “standard”
<b>, <i>... and forget to handle those old things. you may
need to move them, too.
Next, styles and style sheets. A “style”
is a series of attributes defined in advance, either in the header
of the document, or in a separate file called a style sheet.
To understand styles, you need to understand what problems
Suppose you want the big titles in your web site to
be bold, italic, blue, and centered. In good old HTML, you would write:
Pretty clumsy, isn’t it? And that's just 4 simple
attributes. The solution is to define a style with all these specifications:
It’s bold, it’s blue, it's centered, and you give it a name,
i.e.: bbc (For Bold Blue Centered. Just an example. It’s normally
named so that one remembers easily what it is). Then, you don't need to
write it every time. In the header of the page, you write:
Then, anytime you have a title, you write
<h1 class=“bbc”>Title 1</h1>
<h1 class=“bbc”>Title 2</h1>
<h1 class=“bbc”>Title 3</h1>
But the best is that if after all is done, you decide
that it would be nicer in red, or that italics would be cool, you don’t
have to look all over the document and change all the tags, each time.
You simply change 1 word in the style definition and every instance change
at once. This not only saves a lot of time when you design the page, but
also make the page size smaller, and thus faster to load.
Now, if you want to use a style in several pages, or
even the whole site, you have to copy the same styles in the header of
each page. Not too smart. The solution was to write all the styles in
a separate file, called a style sheet, then to link each
page to the style sheet. That way, you write the styles only one time,
and in each page, you have a link in the header that looks like this:
<link href="/stylesheet.css" rel="stylesheet"
A style sheet file’s extension is “*.css”.
Now, as a translator, this is relatively important to know because it
determines how the text will be displayed and where. The same page can
look completely different with and without the style sheet. With experience,
you can look at the source code and “see” the page (No, this
ain’t the Matrix yet ;-). That helps a lot, because you don’t
need to check out the page in the browser every few minutes.
Anyway, this should cover the basic HTML you need to
translate. When you get a bit more time, pick one of the many HTML tutorials
on the Web and learn about tables and frames.
How to translate HTML
There are two reliable, proven methods and many wrong
methods. Amongst the wrong methods, the most populars
• Opening the HTML file in Word, working there and “Save as
a web page”. This changes the code and turns it into a complete
mess that is twice the size of the original page, cause display issues
no-end and is about as popular for search engines as a dead cat at a wedding.
If you want to hear a knowledgeable customer scream, go ahead.
• Translating in other WYSIWYG editors (What You See Is What You
Get). They mess up the code as well, usually, while I don’t know
any as bad as Word for that matter, save perhaps frontpage. Dreamweaver
is an exception to that rule, but a costly one if you are simply translating.
• Using a translation software that hides the tags. That can be
very attractive for beginners, but if you understood the section above
properly, you will see why this is not a good solution at all. An example
of such software is Catscraddle. That software is very smooth but will
cause problems because you don't know what is what, and the sentences
are cut midway if the page use formating. If it was doing a correct job,
I would be the first to use it because I love the interface and it's very
fast. Unfortunately, the basic concept is VERY flawed and if you want
to do a professional job, just don’t.
The correct methods include :
• Open the page in an HTML editor, preferably one that support color
coding of the tags. There are many freewares. I like very much AceHTML,
but that's far from the only one available. Either way, translate the
text and move the tags as needed. I.e.:
English: John’s <i>girlfriend</i> is quite cute.
French: La <i>petite amie</i> de John est plutôt mignone.
As you can see, you have to decide where the tags should be in the target
Working that way can be a pain, but if you know your code and are careful,
the output will be irreproachable. However, you must stay very alert not
to forget or erase tags by mistake.
• Preparing the file, then using a CAT like Wordfast or Trados to
translate it, then restoring the HTML format. Not all CAT work the same
way, but remember that professional handling of web sites translation
*requires* quick access to the tags. The ability to move, edit or delete
tags is not optional, it’s a must. With Trados, you can also use
TagEditor, although you may miss the flexibility that comes with working
in Word. Moving/deleting tags can be quite clumsy in TE.
the text for translation:
1. What are tagged files?
What do I mean by “Preparing the text for translation”?
For translation purposes, there are 2 types of tags:
• Tags that you may need to move or edit and that are/could be located
in the middle of a segment
• Tags that you will almost never change and are not (should not)
be in the middle of a segment
Overall, there are very few tags that you may need to
delete during the translation process.
"Preparing files" means modifying the files
so that they can be translated easily using a CAT. What follow is a description
of a file prepared for Wordfast/Trados, a “tagged file”, in
the translator lingo. Since Trados is/was widely used, most professional
CAT can handle this type of files, with more or less success. However,
if you own and use another CAT (SDLX, DV,…), please check your CAT's
documentation. As you will use a CAT to work of the tagged file, I assume
that you are familiar with the basic concepts. (If not, please read the
following pages of this web site before going further: “What
are CATs?” and “First
A tagged file is a RTF file containing the source code
(meaning, tags + text) of the original HTML file. The tags are identified
using 2 styles: tw4winInternal and tw4winExternal. Without getting into
details, the tw4winInternal style is red, and the tw4winExternal is light
grey. Whenever you receive a file with tags in red and grey, it’s
almost a given that the file has been tagged. Although the handling is
very similar, beware that HTML files are not the only tagged files, and
many more exotic formats are tagged for use with CATs, like SGML, XML,
QuarkXpress, FrameMaker, etc.
All tags are protected against deletion by default,
to avoid you deleting one by mistake. Tags that you may need to move,
like <b> (bold), are in tw4winInternal. “Internal” because
they will be included in the segment you have to translate. They are in
red. Tags that you don't need to change or to be concerned about during
the translation process are in tw4winExternal, (like <p> (paragraph
mark), <body>, …) and are in grey. A tag in tw4winExternal
style will end a segment automatically.
Here is an example:
Correct: You are learning to translate <b>Web
By now, you should know that “Web sites”
is in bold, and that the </p> shows the end of a paragraph. When
you open that sentence with Wordfast (or Trados), the segment will end
just after the </b>, although there is no period, because <p>
is in tw4winExternal style.
Incorrect: You are learning to translate <b>Web
Sites</b></p>Bla bla bla
(The segment would stop right after “translate”).
Incorrect: You are learning to translate <b>Web
Sites</b></p>Bla bla bla
(The segment would include everything).
Incorrect: You are learning to translate <b>Web
Sites</b></p>bla bla bla
(The segment would include everything and the tags are
2. Tagging an HTML file?
If you open the source code of virtually any HTML file,
you will see there are a LOT of tags. So changing the styles manually
is just not workable. You need to use another software to tag (prepare)
the file. It’s rather easy to do for HTML, and other relatively
common formats like XML and SGML. My personal preference goes to a software
(freeware). There are other possibilities like +Tools
The process is rather simple and well explained in both
software documentations, so I won’t overkill it. In Rainbow, (once
installed), you click on “Add”, select the HTML files you
need to prepare, go to the Tools menu, select “Prepare for translation”,
fill out the needed options, and under the tab “Package”,
you select where the tagged files should be created.
Some stuff may look complex, but frankly it’s
a no-brainer, when all you have to do is prepare an HTML file.
Find your files, open the rtf file in Word, and you
are ready to translate.
3. Translating a tagged file.
This depends on your CAT. In Wordfast, start the translation
as usual, with your TM and glossaries, the lock bolt on the door, gaffer
tape across the neighbor’s kid mouth, Mozart playing (or AC/DC –
your call), …,whatever your set-up usually is when you translate.
Tags in tw4winInternal are considered as placeables.
You can select them in the source segment using “Ctrl + Alt + Left/Right”
and “Ctrl + Alt + Down” will copy it inside the target segment,
at the insertion point. Type your translation in the target and bring
down the tags at the appropriate points in the target sentence.
Use the tags to know how the text will look like and
do not hesitate to refer to the original HTML file, when in doubt. As
explained, before, keep keywords in mind and balance the text to match
the original’s proportions as closely as possible. (Of course, if
the page is not meant for the general public but for Intranet, that becomes
much less important).
Please refer to the “tagged files” section
of your Wordfast’s manual. In summary, you have to make sure that
you do not forget tags (Wordfast has settings to remind you), that you
keep the internal tags in the tw4winInternal and the translatable text
in whatever is the style originally used.
You are translating an <b>HTML</b> file!
Vous êtes en train de traduire un fichier <b>HTML</b>
4. Done, now, what?
When your translation is done and the file cleaned (meaning
all source segments and segment delimiter have been deleted), you
have a nice …RTF file. If both the source and the target
language do not require Unicode and that you do not have special characters
in the file, save it as txt (or copy all the code in Notepad) and change
the extension to “*.htm” or “*.html”. If you use
a language that requires Unicode (Chinese, Japanese, Russian, Thai,...),
save the file with the appropriate encoding and modify the charset information
in the file header to reflect the new language (i.e.: UTF-8.) See the
HTML links to find out more about encodings and file formats.
If you have respected the tags, the file should look
about right in the browser. However, the translation is seldom the same
size as the original text, and if so, you may have to make a few arrangements
to make it fit nice. If lucky, everything can stay the same.
You are through. I hope these information will help
you tackling HTML files in a professional manner and feel confident with
them. As you can see, there is nothing really hard in HTML files, but
they do require some extra attention too. If it's HTML, it's not just
At times the client wants you to translate the text
with no consideration with the HTML or a potential use on the net. That’s
all right. If so, skip everything and ask him to provide a regular *.doc
file, or open the HTML in word and save it as *.doc.
Good luck. ;-)
*This article is a courtesy of www.your-translations.com. You can find more articles there on CATs, Word, ...