Looking for a "site hoover" to extract text from web pages Thread poster: David BUICK
| David BUICK Local time: 07:20 Member (2006) French to English + ...
Does anyone have any experience of what, in French, is termed an "aspirateur de site"? I'm looking at a huge contract to translate a website and wondering what the best way to deal with it is, so I don't waste hours manually indexing and copy/pasting text. | | | KathyT Australia Local time: 15:20 Japanese to English HTTrack Website Copier 3.41 RC1 | Sep 4, 2007 |
Try this one, downloadable free from: http://www.download.com/3000-12779_4-10634972.html. It had pretty good reviews and has supposedly been tested as spyware-free. Has been working well for me. Even large-ish websites can be downloaded in entirety in around 5 mins. P.S. I love your "site hoover" expression!... See more Try this one, downloadable free from: http://www.download.com/3000-12779_4-10634972.html. It had pretty good reviews and has supposedly been tested as spyware-free. Has been working well for me. Even large-ish websites can be downloaded in entirety in around 5 mins. P.S. I love your "site hoover" expression! ▲ Collapse | | | Oliver Walter United Kingdom Local time: 06:20 German to English + ... I don't think HTTrack does what you want | Sep 4, 2007 |
KathyT wrote: Try this one, downloadable free from: http://www.download.com/3000-12779_4-10634972.html. It had pretty good reviews and has supposedly been tested as spyware-free. Has been working well for me. Even large-ish websites can be downloaded in entirety in around 5 mins. All this is true, and I use WinHTTrack (the Windows version) sometimes. But it doesn't extract the text into a text or word-processor file; it only copies the web site (the parts of it defined by the limits that you configure) onto your computer so that you can browse it offline. HTTrack is free because it is an Open Source project. Oliver | | | David BUICK Local time: 07:20 Member (2006) French to English + ... TOPIC STARTER Maybe it does enough...? | Sep 4, 2007 |
Hmm. If it downloads in html format I think maybe my CAT app (Déjà Vu X) can extract from there and hopefully spit it back into the same format. When I get a moment I'll do a few trial runs. Thanks for the speedy responses. Anyone else got any experience or suggestions? | |
|
|
Eutychus wrote: ...Hmm. If it downloads in html format I think maybe my CAT app (Déjà Vu X) can extract from there and hopefully spit it back into the same format. WinHTTtrack downloads all files as they are to make up the website. So if you use a CAT tool that handles web files, then the process should be a breeze. Good luck, Philippe | | | Heinrich Pesch Finland Local time: 08:20 Member (2003) Finnish to German + ... Ask always for the files as zip | Sep 4, 2007 |
In the old days you could simply download all files from I site and translate, but nowadays most professional sites use databases, where the content is created on the fly. So translating what you get by downloading may not be the right procedure. Just my 2 c. Heinrich | | | megane_wang Spain Local time: 07:20 Member (2007) English to Spanish + ... I agree with Heinrich | Sep 4, 2007 |
Clearly, you have no experience at that... If it's such a big web site, you NEED to talk to the customer and make a detailed project analysis, and see how you will get, process and deliver the contents. I've been both on the developer and translator side, and I can assure you that in a BIG site it's extremely rare that you can go ahead without that... ... at least if you want to do it right... See more Clearly, you have no experience at that... If it's such a big web site, you NEED to talk to the customer and make a detailed project analysis, and see how you will get, process and deliver the contents. I've been both on the developer and translator side, and I can assure you that in a BIG site it's extremely rare that you can go ahead without that... ... at least if you want to do it right Ruth @ MW
[Edited at 2007-09-04 13:10] ▲ Collapse | | | David BUICK Local time: 07:20 Member (2006) French to English + ... TOPIC STARTER Thanks for replies so far | Sep 4, 2007 |
I have translated a number of sites and no two have been the same. In most cases I have had access to the source files as explained by Heinrich, but in more than one case the text has had to be inputted online (for example in Flash-based sites for which the original copy is no longer available). In several other cases where I have had the files, these have not included the various menu items and headlines which are added afterwards and often get translated by some non-specialist after the projec... See more I have translated a number of sites and no two have been the same. In most cases I have had access to the source files as explained by Heinrich, but in more than one case the text has had to be inputted online (for example in Flash-based sites for which the original copy is no longer available). In several other cases where I have had the files, these have not included the various menu items and headlines which are added afterwards and often get translated by some non-specialist after the project when they suddenly realise they forgot to ask for that part to be done, thus destroying the effect of the whole thing. I hope to be able to go down the route suggested by Heinrich and Ruth, but I would like to cover my options. ▲ Collapse | |
|
|
Samuel Murray Netherlands Local time: 07:20 Member (2006) English to Afrikaans + ... Get the text from the client | Sep 5, 2007 |
Eutychus wrote: I'm looking at a huge contract to translate a website and wondering what the best way to deal with it is, so I don't waste hours manually indexing and copy/pasting text. The client should provide the text for you, either in HTML or in some word processing format. Otherwise you can't know for certain if you got all the pages, or if you perhaps got more pages that the client thought he had. By requiring the client to provide the files, you avoid both nasties. I use Oleg Chernavin's Web Downloader 2.2, but it is abandonware and you may need to Google hard for it. What I like about Oleg's tool is that it recreates the folder tree for all the objects on the web site. | | | Cat's cradle | Sep 10, 2007 |
Bonjour, I'm not sure whether Cat's cradle is what you are looking for but at least it's a very handy little application (30 days free): http://www.stormdance.net/software/catscradle/overview.htm "CatsCradle grabs all the text that requires translating from a web page, puts it into a built in editor for you to translate alongside, then aut... See more Bonjour, I'm not sure whether Cat's cradle is what you are looking for but at least it's a very handy little application (30 days free): http://www.stormdance.net/software/catscradle/overview.htm "CatsCradle grabs all the text that requires translating from a web page, puts it into a built in editor for you to translate alongside, then automatically integrates your translated localized text back into the web page - leaving all the sensitive HTML code untouched. ..." Moreover, Julian Spencer always has an open ear for questions But Samuel is right, of course the original files with exact specifications what to translate and what not, provided by the client, are always the best solution... Charlotte ▲ Collapse | | | To report site rules violations or get help, contact a site moderator: You can also contact site staff by submitting a support request » Looking for a "site hoover" to extract text from web pages Anycount & Translation Office 3000 | Translation Office 3000
Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.
More info » |
| Trados Business Manager Lite | Create customer quotes and invoices from within Trados Studio
Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.
More info » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |