How to extract text from a website
Thread poster: John Detre

John Detre  Identity Verified
Canada
French to English
Sep 7, 2011

I have a very basic question about translating websites that has no doubt been asked and answered many times before, but a cursory search of the forums hasn't turned up a solution.

I need to translate a website that is already live, including invisible text (e.g. keywords) and unselectable text, but not including html code. Is there a tool I can use to extract all the relevant text and import it into a word processor?

If anyone can help with this or point me towards a thread in which this has already been discussed, I would be most grateful. Thanks in advance and my apologies for asking such a rudimentary (to some) question.


Direct link Reply with quote
 

MH TRADUCCIONES  Identity Verified
Argentina
Local time: 15:13
English to Spanish
+ ...
Tool for websites Sep 7, 2011

Software: Sharepoint Designer 2007 / TRADOS 2007 (Tag Editor)

Direct link Reply with quote
 
Joakim Braun  Identity Verified
Sweden
Local time: 20:13
German to Swedish
+ ...
Dynamic HTML ps Sep 7, 2011

Be aware that pages may be modified on-the-fly with scripting. These days, the initial page as sent from the server is very likely NOT what you see.

You need to make sure that the HTML saved is the HTML of the page DOM at that point (not the HTML as initially served).

(But a dynamic website will hardly be translated that way, so on reflection never mind.)

[Bearbeitet am 2011-09-07 19:29 GMT]


Direct link Reply with quote
 
FarkasAndras
Local time: 20:13
English to Hungarian
+ ...
Tricky Sep 7, 2011

This comes up pretty regularly here, and the only good answer is: arrange a three-way chat between the client, yourself and the webmaster who set the page up. There are a million ways for things to go wrong or get misunderstood.
You could use httrack or wget to download the website in question, but there are more than a few ways this can go wrong. Then you could translate the resulting html files with a CAT, but your client may not be able to use your HTML files directly.


For instance, you seem to be saying that you're expected to extract text and just deliver plain text to your client instead of HTML, but that doesn't sound like a good idea. If they want to use the translation on the website, they would need to manually copy-paste each phrase or paragraph to the right place... There has to be a better way.

If your client is sending you HTML files to translate, you could try and use a CAT to translate them. If you don't use a CAT, that's too bad...


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

How to extract text from a website

Advanced search






BaccS – Business Accounting Software
Modern desktop project management for freelance translators

BaccS makes it easy for translators to manage their projects, schedule tasks, create invoices, and view highly customizable reports. User-friendly, ProZ.com integration, community-driven development – a few reasons BaccS is trusted by translators!

More info »
Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search