Mobile menu

Html text extraction
Thread poster: MIGUEL JIMENEZ

MIGUEL JIMENEZ  Identity Verified
Local time: 02:09
English to Spanish
+ ...
Nov 29, 2005

Hi,
I am trying to compile a big corpus of text extracted from webpages for some research. I was wondering if anybody would know what would be the best tool to extract "translatable" text only from html pages and create a separate text file. This would be for research purpouses only, I would not need to convert back to html, xml or anything.
Thanks for your help


Direct link Reply with quote
 
xxxMarc P  Identity Verified
Local time: 08:09
German to English
+ ...
Html text extraction Nov 29, 2005

Why not simply open the html file in a suitable word processor (such as OpenOffice.org) and "Save As" plain text? Are there too many files?

Marc


Direct link Reply with quote
 

Rodolfo Raya  Identity Verified
Local time: 03:09
English to Spanish
Word processors miss text Nov 29, 2005

MarcPrior wrote:

Why not simply open the html file in a suitable word processor (such as OpenOffice.org) and "Save As" plain text? Are there too many files?

Marc


Word processors usually ignore translateable attributes, such as "alt" in images.

Rodolfo


Direct link Reply with quote
 
Sonja Tomaskovic  Identity Verified
Germany
Local time: 08:09
English to German
+ ...
.. Nov 29, 2005

Rodolfo Raya wrote:
Word processors usually ignore translateable attributes, such as "alt" in images.


I doubt that someone who needs "to compile a big corpus of text extracted from webpages for some research" needs the alt attribute of an image.

Another solution would be to open the file with a text editor and remove all html tags. This should be able with regexp.


Sonja


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 08:09
Member (2006)
English to Afrikaans
+ ...
HTML2TXT and a DOS command Nov 29, 2005

MIGUEL JIMENEZ wrote:
I am trying to compile a big corpus of text extracted from webpages for some research. I was wondering if anybody would know what would be the best tool to extract "translatable" text only from html pages and create a separate text file.


In two steps.

1. Download Bobsoft's HTML2TXT and use it to convert all the html files into text files:
http://www.bobsoft.com/h2t/

2. Merge all the text files into a single file, using the following DOS command:
copy *.* > /b all.txt

Good luck!


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 08:09
Member (2006)
English to Afrikaans
+ ...
Try Caterpillar (shareware) Nov 29, 2005

Rodolfo Raya wrote:
Word processors usually ignore translateable attributes, such as "alt" in images.


Well, in that case, try Caterpillar by Stormdance. The web site says "Extracts all text requiring translation - including hidden text and text within tags etc."

The shareware version is limited to 8 HTML pages per project, though. Cost for full version is GBP 25.00. The author claims it is "Wordfast compatible".


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Html text extraction

Advanced search


Translation news related to CAT tools





TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »
SDL MultiTerm 2017
Guarantee a unified, consistent and high-quality translation with terminology software by the industry leaders.

SDL MultiTerm 2017 allows translators to create one central location to store and manage multilingual terminology, and with SDL MultiTerm Extract 2017 you can automatically create term lists from your existing documentation to save time.

More info »



All of ProZ.com
  • All of ProZ.com
  • Term search
  • Jobs