This site uses cookies.
Some of these cookies are essential to the operation of the site,
while others help to improve your experience by providing insights into how the site is being used.
For more information, please see the ProZ.com privacy policy.
MIGUEL JIMENEZ Local time: 23:13 English to Spanish + ...
Nov 29, 2005
Hi, I am trying to compile a big corpus of text extracted from webpages for some research. I was wondering if anybody would know what would be the best tool to extract "translatable" text only from html pages and create a separate text file. This would be for research purpouses only, I would not need to convert back to html, xml or anything. Thanks for your help
Subject:
Comment:
The contents of this post will automatically be included in the ticket generated. Please add any additional comments or explanation (optional)
Marc P (X) Local time: 05:13 German to English + ...
Html text extraction
Nov 29, 2005
Why not simply open the html file in a suitable word processor (such as OpenOffice.org) and "Save As" plain text? Are there too many files?
Marc
Subject:
Comment:
The contents of this post will automatically be included in the ticket generated. Please add any additional comments or explanation (optional)
Why not simply open the html file in a suitable word processor (such as OpenOffice.org) and "Save As" plain text? Are there too many files?
Marc
Word processors usually ignore translateable attributes, such as "alt" in images.
Rodolfo
Subject:
Comment:
The contents of this post will automatically be included in the ticket generated. Please add any additional comments or explanation (optional)
Sonja Tomaskovic (X) Germany Local time: 05:13 English to German + ...
..
Nov 29, 2005
Rodolfo Raya wrote: Word processors usually ignore translateable attributes, such as "alt" in images.
I doubt that someone who needs "to compile a big corpus of text extracted from webpages for some research" needs the alt attribute of an image.
Another solution would be to open the file with a text editor and remove all html tags. This should be able with regexp.
Sonja
Subject:
Comment:
The contents of this post will automatically be included in the ticket generated. Please add any additional comments or explanation (optional)
Samuel Murray Netherlands Local time: 05:13 Member (2006) English to Afrikaans + ...
HTML2TXT and a DOS command
Nov 29, 2005
MIGUEL JIMENEZ wrote: I am trying to compile a big corpus of text extracted from webpages for some research. I was wondering if anybody would know what would be the best tool to extract "translatable" text only from html pages and create a separate text file.
In two steps.
1. Download Bobsoft's HTML2TXT and use it to convert all the html files into text files: http://www.bobsoft.com/h2t/
2. Merge all the text files into a single file, using the following DOS command: copy *.* > /b all.txt
Good luck!
Subject:
Comment:
The contents of this post will automatically be included in the ticket generated. Please add any additional comments or explanation (optional)
Samuel Murray Netherlands Local time: 05:13 Member (2006) English to Afrikaans + ...
Try Caterpillar (shareware)
Nov 29, 2005
Rodolfo Raya wrote: Word processors usually ignore translateable attributes, such as "alt" in images.
Well, in that case, try Caterpillar by Stormdance. The web site says "Extracts all text requiring translation - including hidden text and text within tags etc."
The shareware version is limited to 8 HTML pages per project, though. Cost for full version is GBP 25.00. The author claims it is "Wordfast compatible".
Subject:
Comment:
The contents of this post will automatically be included in the ticket generated. Please add any additional comments or explanation (optional)
Create customer quotes and invoices from within Trados Studio
Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.
Translate faster & easier, using a sophisticated CAT tool built by a translator / developer.
Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools.
Download and start using CafeTran Espresso -- for free