Mobile menu

Looking for a "site hoover" to extract text from web pages
Thread poster: Eutychus

Eutychus  Identity Verified
Local time: 04:51
Member (2006)
French to English
+ ...
Sep 4, 2007

Does anyone have any experience of what, in French, is termed an "aspirateur de site"? I'm looking at a huge contract to translate a website and wondering what the best way to deal with it is, so I don't waste hours manually indexing and copy/pasting text.

Direct link Reply with quote
 

KathyT  Identity Verified
Australia
Local time: 13:51
Japanese to English
HTTrack Website Copier 3.41 RC1 Sep 4, 2007

Try this one, downloadable free from: http://www.download.com/3000-12779_4-10634972.html.

It had pretty good reviews and has supposedly been tested as spyware-free.

Has been working well for me.
Even large-ish websites can be downloaded in entirety in around 5 mins.

P.S. I love your "site hoover" expression!


Direct link Reply with quote
 

Oliver Walter  Identity Verified
United Kingdom
Local time: 03:51
Member (2005)
German to English
+ ...
I don't think HTTrack does what you want Sep 4, 2007

KathyT wrote:
Try this one, downloadable free from: http://www.download.com/3000-12779_4-10634972.html.
It had pretty good reviews and has supposedly been tested as spyware-free.
Has been working well for me.
Even large-ish websites can be downloaded in entirety in around 5 mins.

All this is true, and I use WinHTTrack (the Windows version) sometimes. But it doesn't extract the text into a text or word-processor file; it only copies the web site (the parts of it defined by the limits that you configure) onto your computer so that you can browse it offline.
HTTrack is free because it is an Open Source project.
Oliver


Direct link Reply with quote
 

Eutychus  Identity Verified
Local time: 04:51
Member (2006)
French to English
+ ...
TOPIC STARTER
Maybe it does enough...? Sep 4, 2007

Hmm. If it downloads in html format I think maybe my CAT app (Déjà Vu X) can extract from there and hopefully spit it back into the same format.

When I get a moment I'll do a few trial runs.

Thanks for the speedy responses. Anyone else got any experience or suggestions?


Direct link Reply with quote
 

Philippe Etienne  Identity Verified
Spain
Local time: 04:51
Member
English to French
Precisely Sep 4, 2007

Eutychus wrote:
...Hmm. If it downloads in html format I think maybe my CAT app (Déjà Vu X) can extract from there and hopefully spit it back into the same format.

WinHTTtrack downloads all files as they are to make up the website. So if you use a CAT tool that handles web files, then the process should be a breeze.

Good luck,
Philippe


Direct link Reply with quote
 

Heinrich Pesch  Identity Verified
Finland
Local time: 05:51
Member (2003)
Finnish to German
+ ...
Ask always for the files as zip Sep 4, 2007

In the old days you could simply download all files from I site and translate, but nowadays most professional sites use databases, where the content is created on the fly. So translating what you get by downloading may not be the right procedure.
Just my 2 c.
Heinrich


Direct link Reply with quote
 

megane_wang  Identity Verified
Spain
Local time: 04:51
English to Spanish
+ ...
I agree with Heinrich Sep 4, 2007

Clearly, you have no experience at that...

If it's such a big web site, you NEED to talk to the customer and make a detailed project analysis, and see how you will get, process and deliver the contents.

I've been both on the developer and translator side, and I can assure you that in a BIG site it's extremely rare that you can go ahead without that...

... at least if you want to do it right

Ruth @ MW

[Edited at 2007-09-04 13:10]


Direct link Reply with quote
 

Eutychus  Identity Verified
Local time: 04:51
Member (2006)
French to English
+ ...
TOPIC STARTER
Thanks for replies so far Sep 4, 2007

I have translated a number of sites and no two have been the same. In most cases I have had access to the source files as explained by Heinrich, but in more than one case the text has had to be inputted online (for example in Flash-based sites for which the original copy is no longer available). In several other cases where I have had the files, these have not included the various menu items and headlines which are added afterwards and often get translated by some non-specialist after the project when they suddenly realise they forgot to ask for that part to be done, thus destroying the effect of the whole thing.

I hope to be able to go down the route suggested by Heinrich and Ruth, but I would like to cover my options.


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 04:51
Member (2006)
English to Afrikaans
+ ...
Get the text from the client Sep 5, 2007

Eutychus wrote:
I'm looking at a huge contract to translate a website and wondering what the best way to deal with it is, so I don't waste hours manually indexing and copy/pasting text.


The client should provide the text for you, either in HTML or in some word processing format. Otherwise you can't know for certain if you got all the pages, or if you perhaps got more pages that the client thought he had. By requiring the client to provide the files, you avoid both nasties.

I use Oleg Chernavin's Web Downloader 2.2, but it is abandonware and you may need to Google hard for it. What I like about Oleg's tool is that it recreates the folder tree for all the objects on the web site.


Direct link Reply with quote
 
Charlotte Blank  Identity Verified
Local time: 04:51
Czech to German
+ ...
Cat's cradle Sep 10, 2007

Bonjour,

I'm not sure whether Cat's cradle is what you are looking for but at least it's a very handy little application (30 days free):

http://www.stormdance.net/software/catscradle/overview.htm

"CatsCradle grabs all the text that requires translating from a web page, puts it into a built in editor for you to translate alongside, then automatically integrates your translated localized text back into the web page - leaving all the sensitive HTML code untouched. ..."

Moreover, Julian Spencer always has an open ear for questions

But Samuel is right, of course the original files with exact specifications what to translate and what not, provided by the client, are always the best solution...

Charlotte


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Looking for a "site hoover" to extract text from web pages

Advanced search






Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »
Déjà Vu X3
Try it, Love it

Find out why Déjà Vu is today the most flexible, customizable and user-friendly tool on the market. See the brand new features in action: *Completely redesigned user interface *Live Preview *Inline spell checking *Inline

More info »



All of ProZ.com
  • All of ProZ.com
  • Term search
  • Jobs