Looking for a "site hoover" to extract text from web pages
Thread poster: David BUICK
David BUICK
David BUICK  Identity Verified
Local time: 07:20
Member (2006)
French to English
+ ...
Sep 4, 2007

Does anyone have any experience of what, in French, is termed an "aspirateur de site"? I'm looking at a huge contract to translate a website and wondering what the best way to deal with it is, so I don't waste hours manually indexing and copy/pasting text.

 
KathyT
KathyT  Identity Verified
Australia
Local time: 15:20
Japanese to English
HTTrack Website Copier 3.41 RC1 Sep 4, 2007

Try this one, downloadable free from: http://www.download.com/3000-12779_4-10634972.html.

It had pretty good reviews and has supposedly been tested as spyware-free.

Has been working well for me.
Even large-ish websites can be downloaded in entirety in around 5 mins.

P.S. I love your "site hoover" expression!...
See more
Try this one, downloadable free from: http://www.download.com/3000-12779_4-10634972.html.

It had pretty good reviews and has supposedly been tested as spyware-free.

Has been working well for me.
Even large-ish websites can be downloaded in entirety in around 5 mins.

P.S. I love your "site hoover" expression!
Collapse


 
Oliver Walter
Oliver Walter  Identity Verified
United Kingdom
Local time: 06:20
German to English
+ ...
I don't think HTTrack does what you want Sep 4, 2007

KathyT wrote:
Try this one, downloadable free from: http://www.download.com/3000-12779_4-10634972.html.
It had pretty good reviews and has supposedly been tested as spyware-free.
Has been working well for me.
Even large-ish websites can be downloaded in entirety in around 5 mins.

All this is true, and I use WinHTTrack (the Windows version) sometimes. But it doesn't extract the text into a text or word-processor file; it only copies the web site (the parts of it defined by the limits that you configure) onto your computer so that you can browse it offline.
HTTrack is free because it is an Open Source project.
Oliver


 
David BUICK
David BUICK  Identity Verified
Local time: 07:20
Member (2006)
French to English
+ ...
TOPIC STARTER
Maybe it does enough...? Sep 4, 2007

Hmm. If it downloads in html format I think maybe my CAT app (Déjà Vu X) can extract from there and hopefully spit it back into the same format.

When I get a moment I'll do a few trial runs.

Thanks for the speedy responses. Anyone else got any experience or suggestions?


 
Philippe Etienne
Philippe Etienne  Identity Verified
Spain
Local time: 07:20
Member
English to French
Precisely Sep 4, 2007

Eutychus wrote:
...Hmm. If it downloads in html format I think maybe my CAT app (Déjà Vu X) can extract from there and hopefully spit it back into the same format.

WinHTTtrack downloads all files as they are to make up the website. So if you use a CAT tool that handles web files, then the process should be a breeze.

Good luck,
Philippe


 
Heinrich Pesch
Heinrich Pesch  Identity Verified
Finland
Local time: 08:20
Member (2003)
Finnish to German
+ ...
Ask always for the files as zip Sep 4, 2007

In the old days you could simply download all files from I site and translate, but nowadays most professional sites use databases, where the content is created on the fly. So translating what you get by downloading may not be the right procedure.
Just my 2 c.
Heinrich


 
megane_wang
megane_wang  Identity Verified
Spain
Local time: 07:20
Member (2007)
English to Spanish
+ ...
I agree with Heinrich Sep 4, 2007

Clearly, you have no experience at that...

If it's such a big web site, you NEED to talk to the customer and make a detailed project analysis, and see how you will get, process and deliver the contents.

I've been both on the developer and translator side, and I can assure you that in a BIG site it's extremely rare that you can go ahead without that...

... at least if you want to do it right...
See more
Clearly, you have no experience at that...

If it's such a big web site, you NEED to talk to the customer and make a detailed project analysis, and see how you will get, process and deliver the contents.

I've been both on the developer and translator side, and I can assure you that in a BIG site it's extremely rare that you can go ahead without that...

... at least if you want to do it right

Ruth @ MW

[Edited at 2007-09-04 13:10]
Collapse


 
David BUICK
David BUICK  Identity Verified
Local time: 07:20
Member (2006)
French to English
+ ...
TOPIC STARTER
Thanks for replies so far Sep 4, 2007

I have translated a number of sites and no two have been the same. In most cases I have had access to the source files as explained by Heinrich, but in more than one case the text has had to be inputted online (for example in Flash-based sites for which the original copy is no longer available). In several other cases where I have had the files, these have not included the various menu items and headlines which are added afterwards and often get translated by some non-specialist after the projec... See more
I have translated a number of sites and no two have been the same. In most cases I have had access to the source files as explained by Heinrich, but in more than one case the text has had to be inputted online (for example in Flash-based sites for which the original copy is no longer available). In several other cases where I have had the files, these have not included the various menu items and headlines which are added afterwards and often get translated by some non-specialist after the project when they suddenly realise they forgot to ask for that part to be done, thus destroying the effect of the whole thing.

I hope to be able to go down the route suggested by Heinrich and Ruth, but I would like to cover my options.
Collapse


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 07:20
Member (2006)
English to Afrikaans
+ ...
Get the text from the client Sep 5, 2007

Eutychus wrote:
I'm looking at a huge contract to translate a website and wondering what the best way to deal with it is, so I don't waste hours manually indexing and copy/pasting text.


The client should provide the text for you, either in HTML or in some word processing format. Otherwise you can't know for certain if you got all the pages, or if you perhaps got more pages that the client thought he had. By requiring the client to provide the files, you avoid both nasties.

I use Oleg Chernavin's Web Downloader 2.2, but it is abandonware and you may need to Google hard for it. What I like about Oleg's tool is that it recreates the folder tree for all the objects on the web site.


 
Charlotte Blank
Charlotte Blank  Identity Verified
Germany
Local time: 07:20
Czech to German
+ ...
Cat's cradle Sep 10, 2007

Bonjour,

I'm not sure whether Cat's cradle is what you are looking for but at least it's a very handy little application (30 days free):

http://www.stormdance.net/software/catscradle/overview.htm

"CatsCradle grabs all the text that requires translating from a web page, puts it into a built in editor for you to translate alongside, then aut
... See more
Bonjour,

I'm not sure whether Cat's cradle is what you are looking for but at least it's a very handy little application (30 days free):

http://www.stormdance.net/software/catscradle/overview.htm

"CatsCradle grabs all the text that requires translating from a web page, puts it into a built in editor for you to translate alongside, then automatically integrates your translated localized text back into the web page - leaving all the sensitive HTML code untouched. ..."

Moreover, Julian Spencer always has an open ear for questions

But Samuel is right, of course the original files with exact specifications what to translate and what not, provided by the client, are always the best solution...

Charlotte
Collapse


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Looking for a "site hoover" to extract text from web pages






Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »
Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio

Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.

More info »