How to extract text from a website
Thread poster: John Detre

John Detre  Identity Verified
French to English
Sep 7, 2011

I have a very basic question about translating websites that has no doubt been asked and answered many times before, but a cursory search of the forums hasn't turned up a solution.

I need to translate a website that is already live, including invisible text (e.g. keywords) and unselectable text, but not including html code. Is there a tool I can use to extract all the relevant text and import it into a word processor?

If anyone can help with this or point me towards a thread in which this has already been discussed, I would be most grateful. Thanks in advance and my apologies for asking such a rudimentary (to some) question.

Direct link Reply with quote

MH TRADUCCIONES  Identity Verified
Local time: 22:18
English to Spanish
+ ...
Tool for websites Sep 7, 2011

Software: Sharepoint Designer 2007 / TRADOS 2007 (Tag Editor)

Direct link Reply with quote

Joakim Braun  Identity Verified
Local time: 03:18
German to Swedish
+ ...
Dynamic HTML ps Sep 7, 2011

Be aware that pages may be modified on-the-fly with scripting. These days, the initial page as sent from the server is very likely NOT what you see.

You need to make sure that the HTML saved is the HTML of the page DOM at that point (not the HTML as initially served).

(But a dynamic website will hardly be translated that way, so on reflection never mind.)

[Bearbeitet am 2011-09-07 19:29 GMT]

Direct link Reply with quote

Local time: 03:18
English to Hungarian
+ ...
Tricky Sep 7, 2011

This comes up pretty regularly here, and the only good answer is: arrange a three-way chat between the client, yourself and the webmaster who set the page up. There are a million ways for things to go wrong or get misunderstood.
You could use httrack or wget to download the website in question, but there are more than a few ways this can go wrong. Then you could translate the resulting html files with a CAT, but your client may not be able to use your HTML files directly.

For instance, you seem to be saying that you're expected to extract text and just deliver plain text to your client instead of HTML, but that doesn't sound like a good idea. If they want to use the translation on the website, they would need to manually copy-paste each phrase or paragraph to the right place... There has to be a better way.

If your client is sending you HTML files to translate, you could try and use a CAT to translate them. If you don't use a CAT, that's too bad...

Direct link Reply with quote

To report site rules violations or get help, contact a site moderator:

You can also contact site staff by submitting a support request »

How to extract text from a website

Advanced search

Déjà Vu X3
Try it, Love it

Find out why Déjà Vu is today the most flexible, customizable and user-friendly tool on the market. See the brand new features in action: *Completely redesigned user interface *Live Preview *Inline spell checking *Inline

More info »
CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use SDL Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

More info »

  • All of
  • Term search
  • Jobs
  • Forums
  • Multiple search