How to extract text from a website
Thread poster: John Detre

John Detre  Identity Verified
Canada
French to English
Sep 7, 2011

I have a very basic question about translating websites that has no doubt been asked and answered many times before, but a cursory search of the forums hasn't turned up a solution.

I need to translate a website that is already live, including invisible text (e.g. keywords) and unselectable text, but not including html code. Is there a tool I can use to extract all the relevant text and import it into a word processor?

If anyone can help with this or point me towards a thread in which this has already been discussed, I would be most grateful. Thanks in advance and my apologies for asking such a rudimentary (to some) question.


 

MH TRADUCCIONES  Identity Verified
Argentina
Local time: 16:46
English to Spanish
+ ...
Tool for websites Sep 7, 2011

Software: Sharepoint Designer 2007 / TRADOS 2007 (Tag Editor)

 

Joakim Braun  Identity Verified
Sweden
Local time: 20:46
German to Swedish
+ ...
Dynamic HTML ps Sep 7, 2011

Be aware that pages may be modified on-the-fly with scripting. These days, the initial page as sent from the server is very likely NOT what you see.

You need to make sure that the HTML saved is the HTML of the page DOM at that point (not the HTML as initially served).

(But a dynamic website will hardly be translated that way, so on reflection never mind.)

[Bearbeitet am 2011-09-07 19:29 GMT]


 

FarkasAndras  Identity Verified
Local time: 20:46
English to Hungarian
+ ...
Tricky Sep 7, 2011

This comes up pretty regularly here, and the only good answer is: arrange a three-way chat between the client, yourself and the webmaster who set the page up. There are a million ways for things to go wrong or get misunderstood.
You could use httrack or wget to download the website in question, but there are more than a few ways this can go wrong. Then you could translate the resulting html files with a CAT, but your client may not be able to use your HTML files directly.


For instance, you seem to be saying that you're expected to extract text and just deliver plain text to your client instead of HTML, but that doesn't sound like a good idea. If they want to use the translation on the website, they would need to manually copy-paste each phrase or paragraph to the right place... There has to be a better way.

If your client is sending you HTML files to translate, you could try and use a CAT to translate them. If you don't use a CAT, that's too bad...


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

How to extract text from a website

Advanced search






CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use SDL Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

More info »
TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search