ProZ.com global directory of translation services
 The translation workplace
Ideas

 
User
Thread poster: John Detre
How to extract text from a website

John Detre  Identity Verified
Canada
French to English
Sep 7, 2011

I have a very basic question about translating websites that has no doubt been asked and answered many times before, but a cursory search of the forums hasn't turned up a solution.

I need to translate a website that is already live, including invisible text (e.g. keywords) and unselectable text, but not including html code. Is there a tool I can use to extract all the relevant text and import it into a word processor?

If anyone can help with this or point me towards a thread in which this has already been discussed, I would be most grateful. Thanks in advance and my apologies for asking such a rudimentary (to some) question.


Direct link Reply with quote
 

MH TRADUCCIONES  Identity Verified
Argentina
Local time: 03:18
Member (2011)
English to Spanish
+ ...
Tool for websites Sep 7, 2011

Software: Sharepoint Designer 2007 / TRADOS 2007 (Tag Editor)

Direct link Reply with quote
 
Joakim Braun  Identity Verified
Sweden
Local time: 08:18
German to Swedish
+ ...
Dynamic HTML ps Sep 7, 2011

Be aware that pages may be modified on-the-fly with scripting. These days, the initial page as sent from the server is very likely NOT what you see.

You need to make sure that the HTML saved is the HTML of the page DOM at that point (not the HTML as initially served).

(But a dynamic website will hardly be translated that way, so on reflection never mind.)

[Bearbeitet am 2011-09-07 19:29 GMT]


Direct link Reply with quote
 
FarkasAndras
Hungary
Local time: 08:18
English to Hungarian
+ ...
Tricky Sep 7, 2011

This comes up pretty regularly here, and the only good answer is: arrange a three-way chat between the client, yourself and the webmaster who set the page up. There are a million ways for things to go wrong or get misunderstood.
You could use httrack or wget to download the website in question, but there are more than a few ways this can go wrong. Then you could translate the resulting html files with a CAT, but your client may not be able to use your HTML files directly.


For instance, you seem to be saying that you're expected to extract text and just deliver plain text to your client instead of HTML, but that doesn't sound like a good idea. If they want to use the translation on the website, they would need to manually copy-paste each phrase or paragraph to the right place... There has to be a better way.

If your client is sending you HTML files to translate, you could try and use a CAT to translate them. If you don't use a CAT, that's too bad...


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Laureana Pavon[Call to this topic]
Alfonso Romero[Call to this topic]

You can also contact site staff by submitting a support request »

How to extract text from a website






SDL Trados Studio 2011
Buy or upgrade today and save up to 15%

SDL Trados Studio 2011 is the latest market-leading translation memory software from SDL. Now with Track Changes, Bilingual Word Files support, new Display Filter, AutoSuggest and more great details.

More info »
SDL provides market-leading translation software to over 185,000 users
SDL offers leading translation management solutions to meet LSPs needs throughout the whole translation supply chain.

With over 185,000 licenses being used by translators and organizations worldwide, our products will help you to connect to a supply chain that guarantees compatibility, making it easier to work with your customers and other users.

More info »