Pages in topic:   [1 2] >
Software used to extract HTML files from websites
Thread poster: ViktoriaG

ViktoriaG  Identity Verified
Canada
Local time: 10:57
English to French
+ ...
Apr 5, 2009

I guess many of us translate, among other formats, HTML files. I personally don't like to translate content extracted from webpages and pasted into Excel, Word, etc., just as I don't like to work on text files. I prefer to work directly on HTML files using CAT tools. I also like it when the entire website directory is available, so that I can later test the translated website as though it was on the Web.

Unfortunately, not all clients are Web-savvy and they often don't even have the HTML files. In such cases, it is simpler, both for the client and for me, to just extract the HTML files directly from the website.

Here's my question: what software do you use to extract HTML files from websites? If you can, please provide some detail on what you find really useful about the software, what its drawbacks are and how it compares to other similar software you have tried.

Thanks in advance!


Direct link Reply with quote
 

Lutz Molderings  Identity Verified
Germany
Local time: 16:57
Member (2007)
German to English
+ ...
not sure what you mean Apr 5, 2009

Hi Viktoria

What do you mean by "extract htmls from websites"?

I don't really understand what there is to extract. I would simply connect to the server and download the htmls, for example with Dreamweaver.


Direct link Reply with quote
 

ViktoriaG  Identity Verified
Canada
Local time: 10:57
English to French
+ ...
TOPIC STARTER
Here's what I mean Apr 6, 2009

I think you are correct in your assumption that I mean to retrieve HTML files from a server. To be clearer, I would like to find out what software is used to download HTML files directly from a website, and possibly retain the directory structure as well.

Direct link Reply with quote
 

Williamson  Identity Verified
United Kingdom
Local time: 15:57
Flemish to English
+ ...
Text Editor Apr 6, 2009

I attented a course of website-building a couple of years ago. We had to write html-code in .txt (scrapbook) and import that code in Internet Explorer.
Vice-versa, we had to choose a website and export that website into the text editor.
However, I have forgotten how to upload to internet explorer.


Direct link Reply with quote
 
esperantisto  Identity Verified
Local time: 17:57
Member (2006)
English to Russian
+ ...
I think, what you need is an offline browser Apr 6, 2009

such as HTTrack. Such programs create a local copy of a web site preserving its structure.

Direct link Reply with quote
 

Lutz Molderings  Identity Verified
Germany
Local time: 16:57
Member (2007)
German to English
+ ...
Dreamweaver Apr 6, 2009

I see what you mean Viktoria.
I would use Dreamweaver.
It will preseve the directory structure. You can download the entire website to your local hard disk, then translate the htmls with whatever tool you like and then simply upload the files again.

[Edited at 2009-04-06 07:38 GMT]


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 16:57
Member (2006)
English to Afrikaans
+ ...
Offline browsers or web site rippers Apr 6, 2009

ViktoriaG wrote:
To be clearer, I would like to find out what software is used to download HTML files directly from a website, and possibly retain the directory structure as well.


I know of no download software that would *not* retain the directory structure. HTTrack is an opensource one. Watch out, some offline browsers and web site rippers change the HTML of the pages so that they work better in your browser's offline mode, but that is not what you want.

The best web site ripper of all time was Oleg Chernavin's Web Downloader 2.2, but you'll have a hard time finding it these days (webdown.exe). The new version of it, called Offline Explorer 5.4, is not half bad either, but it's 30-day shareware. What I like about this tool is that it does not alter the HTML (it doesn't even add its own little signature to the bottom of the pages).

http://www.metaproducts.com/mp/mpProducts_Downloads_Current.asp


[Edited at 2009-04-06 08:53 GMT]


Direct link Reply with quote
 

Jan Sundström  Identity Verified
Sweden
Local time: 16:57
English to Swedish
+ ...
Some important settings Apr 6, 2009

Samuel Murray wrote:

I know of no download software that would *not* retain the directory structure. HTTrack is an opensource one. Watch out, some offline browsers and web site rippers change the HTML of the pages so that they work better in your browser's offline mode, but that is not what you want.

The best web site ripper of all time was Oleg Chernavin's Web Downloader 2.2, but you'll have a hard time finding it these days (webdown.exe).


Thanks for the tip, Sam, I'll have to check it out!

Until now, I have sworn by WinHTTrack.

Viktoria, when using this kind of software, there are a few important settings that you have to look out for:

- downloading of external links. If the site links to other domains, do you want to fetch those pages too? How many levels?
- ignore robots.txt. This is a very important setting, some sites typically block web spiders/rippers like this. You'd normally want to override it.
- enter password when requested. Some sites are password protected. If your clients provides you with one, the ripper must have a dialog box where you can enter this password in advance, otherwise it'll be rejected.
- filter by file types. You should be able to ignore certain files, like images, zips, to save time when downloading.

There are a few more tweaks, but the above are the essential ones... Don't know if they are supported by Web Downloader 2.2, maybe Sam can elaborate?!


Direct link Reply with quote
 

Jan Sundström  Identity Verified
Sweden
Local time: 16:57
English to Swedish
+ ...
Download link Apr 6, 2009

Samuel Murray wrote:
The best web site ripper of all time was Oleg Chernavin's Web Downloader 2.2, but you'll have a hard time finding it these days (webdown.exe). The new version of it, called Offline Explorer 5.4, is not half bad either, but it's 30-day shareware.


Funny, looks like WD 2.2 is still available for download in all languages EXCEPT English!
http://www.geocities.com/siliconvalley/vista/2865/download.htm


Direct link Reply with quote
 

Fernando Guimaraes  Identity Verified
Portugal
Local time: 15:57
German to Portuguese
+ ...
You can try.. Apr 6, 2009

You can try PageNest.

I saves a web page and all related links and you can browse offline.


Direct link Reply with quote
 

José Henrique Lamensdorf  Identity Verified
Brazil
Local time: 11:57
English to Portuguese
+ ...
Why extracct? Apr 6, 2009

The idea sounds to me as if the dentist had to pull out a tooth to fix it, and then put it back in place.

Whenever I have to translate HTML files, I use CatsCradle, from http://www.stormdance.net . It won't extract anything, but allow you to work with the text only, while you see what's going on visually. Of course, it won't touch the links that build the site's structure, but I don't think that's a translators job. If the translator is a web site builder too, it's another thing.

The point here is that the strictly-translator is not expected to know, wile e.g. developing the Brazilian site of website-dot-com, if it will be website-dot-com-dot-br, br-dot-website-dot-com, website-dot-com/ptbr or anything else.

CatsCradle has its own internal 'CAT tool' and TM management. It will even give you an itemized word count list of all web pages you select. As I am not a CAT tool fanatic, just a plain vanilla WordFast user, I don't know whether its TMs are compatible with any other CAT tools.


Direct link Reply with quote
 
Laurent KRAULAND  Identity Verified
France
Local time: 16:57
French to German
+ ...
The Cat's Whiskers Apr 6, 2009

José Henrique Lamensdorf wrote:

The idea sounds to me as if the dentist had to pull out a tooth to fix it, and then put it back in place.

Whenever I have to translate HTML files, I use CatsCradle, from http://www.stormdance.net . It won't extract anything, but allow you to work with the text only, while you see what's going on visually. Of course, it won't touch the links that build the site's structure, but I don't think that's a translators job.


and although it is perfectible, it is just the cat's whiskers as far as I am concerned. And as I use the TM's for my internal purposes only, I don't really care that CatsCradle builds them as .CSV files (a conversion to other formats is certainly possible and I can always "align" ST and TT with other tools).

Curiously enough, and for the moment, the translations of web contents I made were based on Excel sheets, with a limitation in the number of characters per cell. Do most outsourcers not know about CatsCradle?

Laurent K.


[Edited at 2009-04-06 11:59 GMT]


Direct link Reply with quote
 
Laurent KRAULAND  Identity Verified
France
Local time: 16:57
French to German
+ ...
Another possibility... Apr 6, 2009

offered by Stormdance (this time for extraction and integration) is Caterpillar, allowing you to use whichever CAT tool you want.

Laurent K.

[Edited at 2009-04-06 12:14 GMT]


Direct link Reply with quote
 

Balasubramaniam L.  Identity Verified
India
Local time: 20:27
Member (2006)
English to Hindi
+ ...
May not work for dynamically generated page content Apr 6, 2009

Most websites these days have a database as a back-end, which generate much of the content on the fly based on user preference, locale of the user, etc. For example, if I open the website in Ahmedabad, India, all the prices will be shown in Rs., whereas if you open it Canada, it will show prices in Canadian dollars.

So although, technically you can download the client side pages using Dreamweaver, etc., it will miss what is in the database.

This is one aspect which you will have to keep in mind while taking on the cumbersome task of generating the source files for translation.

This is best left to the developers of the website.

Most web pages are actually templates which get filled by server side scripts which cannot be accessed by the client.

Also, if the website has plug-ins, like Flash code, this too would not be properly captured by Dreamweaver, etc.


Direct link Reply with quote
 

Anna Sylvia Villegas Carvallo
Mexico
Local time: 09:57
English to Spanish
Have you tried Aquino's WebBudget? Apr 6, 2009

I've found this is the best one. It's belongs to the same creators of Wordfast.

http://www.webbudget.com

There is a 15 days trial.


Direct link Reply with quote
 
Pages in topic:   [1 2] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Software used to extract HTML files from websites

Advanced search






WordFinder
The words you want Anywhere, Anytime

WordFinder is the market's fastest and easiest way of finding the right word, term, translation or synonym in one or more dictionaries. In our assortment you can choose among more than 120 dictionaries in 15 languages from leading publishers.

More info »
Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums