Pages in topic:   [1 2] >
Off topic: Website aligner
Thread poster: Pablo Bouvier

Pablo Bouvier  Identity Verified
Local time: 01:06
German to Spanish
+ ...
Nov 1, 2008

Someone knows an easy way to align webpages?
The goal is to use thehaligned text to get terminology.

Maybe it will be better understood, if I explain what I want to do on a sample basis:

I get a a german webpage like this one: http://www.bdi-deutschland-liefert.de/cl/sid.php
This page is already translated at the web, lets say into spanish: http://www.bdi-deutschland-liefert.de/cl/sid.php?PHPSESSID=0f5jj4kvidle1rv523bn42d9mg7agt9d&f_lang=esp.

Now, I want to align this both pages (or others) in order to get a glossary like this:

Bergbau - Minería
Mineralölproduckte und Gase - Productos del petróleo y gases
...
...

Produktionverbindungshandel - Comercio ligado a la producción

Thanks for your suggestiones


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 01:06
Member (2006)
English to Afrikaans
+ ...
Manually Nov 1, 2008

Pablo Bouvier wrote:
The goal is to use thehaligned text to get terminology.


Yep, been there, done that. Our local government has web pages in multiple languages and I was able to extract and align some of it. I also do that when I want to translate in a new field -- I find official documents in both languages and then align them.

The procedure is all very manual:

1. Download the HTML files (use a web site ripper if you can).
2. Extract the text (I find that most htm2txt programs are rubbish, so I wrote a script to do it better. Google for abbzz_HTM2TXT.)
3. Optionally check to see if your TXT files have the same lines (I wrote a script for that too, called Checklines, on the same page as above-mentioned script).
4. Use an aligner such as PlusTools to align each pair of files.

Let me know if you struggle.

[Edited at 2008-11-01 14:29]


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 01:06
Member (2006)
English to Afrikaans
+ ...
I spoke a little too soon Nov 1, 2008

Pablo Bouvier wrote:
I get a a german webpage like this one: http://www.bdi-deutschland-liefert.de/cl/sid.php
This page is already translated at the web, lets say into spanish: http://www.bdi-deutschland-liefert.de/cl/sid.php?PHPSESSID=0f5jj4kvidle1rv523bn42d9mg7agt9d&f_lang=esp .


Hmm, that darned web site has frames protection, so you can't download the frames individually. You're just gonna have to sit there and copy the text page by page.


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 01:06
Member (2006)
English to Afrikaans
+ ...
Try this little script Nov 1, 2008

Pablo Bouvier wrote:
Someone knows an easy way to align webpages?


Extract the text. Try this little script:
http://leuce.com/tempfile/omtautoit/txtligner.zip


Direct link Reply with quote
 

Marco Cevoli  Identity Verified
Spain
Local time: 01:06
Spanish to Italian
+ ...
Httrack should do the trick Nov 1, 2008

Hi,

you could download the whole site (limiting the pages you are interested in) using an off-line browser (best one: www.httrack.com) then proceed with the manual alignment.

Regards

Marco Cevoli
Qabiria
www.qabiria.com


Direct link Reply with quote
 
Noe Tessmann  Identity Verified
Local time: 01:06
English to German
+ ...
On my wishlist Nov 1, 2008

Hello,

I was always dreaming about a programm that can roughly align two given webpages without having to download them, open Winalign and so on.

Just entering two URLs and getting instantly an aligned text to use some chunks here and there.

Have to dream on

Noe


Direct link Reply with quote
 

Harry Bornemann  Identity Verified
Mexico
English to German
+ ...
Dreaming in Perl.. Nov 1, 2008

Noe Tessmann wrote:

I was always dreaming about a programm that can roughly align two given webpages without having to download them, open Winalign and so on.

Just entering two URLs and getting instantly an aligned text to use some chunks here and there.

Have to dream on

This sounds like 1 day of Perl programming for me, so it would probably take me 3. (HTML is actually a nice object oriented database structure like XML.)

The result could be an .exe file, or an online service (+ another week for a registration and payment system).

The question is who could guarantee me 900 EUR for 3 long days of hard work?
Maybe other programmers could do it cheaper; you can find them on sites similar to Proz.com.


Direct link Reply with quote
 

KSL Berlin  Identity Verified
Portugal
Local time: 00:06
Member (2003)
German to English
+ ...
Get rich quick ;-) Nov 1, 2008

Harry Bornemann wrote:
The question is who could guarantee me 900 EUR for 3 long days of hard work?
Maybe other programmers could do it cheaper...


If you or someone else could put together a functioning tool that would work if fed two URLs, I think you could make more than 900 euros.

I wonder if the WinAlign API is accessible.


Direct link Reply with quote
 

Pablo Bouvier  Identity Verified
Local time: 01:06
German to Spanish
+ ...
TOPIC STARTER
Don't worry about Nov 1, 2008

Samuel Murray wrote:

Pablo Bouvier wrote:
I get a a german webpage like this one: http://www.bdi-deutschland-liefert.de/cl/sid.php
This page is already translated at the web, lets say into spanish: http://www.bdi-deutschland-liefert.de/cl/sid.php?PHPSESSID=0f5jj4kvidle1rv523bn42d9mg7agt9d&f_lang=esp .


Hmm, that darned web site has frames protection, so you can't download the frames individually. You're just gonna have to sit there and copy the text page by page.




Many thanx for your tips. I'll check them tomorrow moprning.
Don't worry about frames. Firefox has an option to read frames as individual webpages
ad using frames is neither the case of most of webpages.


Direct link Reply with quote
 
Daniel García
English to Spanish
+ ...
WinAlign API Nov 1, 2008


I wonder if the WinAlign API is accessible.


Yes, it is accessible and very user-friendly but, at least until version 7, it requires a special licence, at least for the professional version of Trados. I don't know how it works with the Freelance version.

It can potentially be very useful but it works only locally.

You would still have to find a way to download the files from the web and copy them to your disk.

You could have a script or something to download the files and copy them locally. Then use the WinAlign API to create a project, do the alignment and then export the results.

You could even have a quick review of the WinAlign projects before they are exported.

Daniel


Direct link Reply with quote
 

Harry Bornemann  Identity Verified
Mexico
English to German
+ ...
Nice to meet you, investor. :-) Nov 1, 2008

Kevin Lossner wrote:

If you or someone else could put together a functioning tool that would work if fed two URLs, I think you could make more than 900 euros.

I wonder if the WinAlign API is accessible.

I don't think the WinAlign API is accessible, I only know an old Trados Server version (4?) was accessible, and even if it is accessible, I wonder whether I could integrate it, and I am not sure it would be helpful.

The tool I imagine would compare the 2 HTML & hypertext trees and where their structures start to differ, any further tags would be included as content of this last hypertext level. (Does anyone understand me?)

I think this would work quite well, but the .exe file would be difficult to protect. It would not take long until one of the customers would start to distribute copies, compromising your investment.

This could probably be prevented for a year or so by programming an 'online dongle'-system, monitoring and limiting the use of individual licences, which is another task to work out, and which would be adjacent to the mentioned automatic registration and payment system.
(As soon as you conceive any useful (i.e. sellable) software, for every satisfied need you will get 3 additional ones...)

BTW, some years ago I aligned help.sap.com for SAP versions R/3 - 64b, 64c, and 470 into a 52 MB MultiTerm database extracted from 72,806 web pages (imagine doing it manually!), and consulted a lawyer concerning any copyright issues for selling it. In short, he said that it would be ok if I would only sell the tool and let the users ponder about any copyright issues, which is a pity because using a server to download pages from other servers runs much faster than using your local machine - it is SAP in 2 hours vs. SAP in 2 days.
dgmaga wrote:

I wonder if the WinAlign API is accessible.

Yes, it is accessible and very user-friendly but, at least until version 7, it requires a special licence, at least for the professional version of Trados.

Interesting - another item on the investment list.

[Edited at 2008-11-02 00:11]


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 01:06
Member (2006)
English to Afrikaans
+ ...
No need to protect the EXE Nov 2, 2008

Harry Bornemann wrote:
The tool I imagine would compare the 2 HTML & hypertext trees and where their structures start to differ, any further tags would be included as content of this last hypertext level. (Does anyone understand me?)


I think you're taking on too much. Keep it simple, I say, and add features if and when you have resources.

My opinion is that you should leave it up to the user to find bilingual texts, and not expect your program to figure out which page is the translation of which other page. Let the user find the pages. You can, however, make it easier for the user to specify two base URLs and give your program some clues about how each language's tree stems from that base, so that the user need not first create a list of *all* the URLs. Well, this could even be a separate module, in which the user supplies two URLs and the stem clues, and the program then creates a list of URLs for the user (a table with two columns, for example). The user then checks the URLs randomly (or all of them if he wants to) to see if the two trees do in fact match. If so, the program continues. If not, well, perhaps there should be away for the user to "correct" the file list and upload it.

I have to go to church now but I hope to write more comments when I come back.


Direct link Reply with quote
 

Harry Bornemann  Identity Verified
Mexico
English to German
+ ...
It gets complicated by itself Nov 2, 2008

Samuel Murray wrote:

Keep it simple, I say, and add features if and when you have resources.

My opinion is that you should leave it up to the user to find bilingual texts, and not expect your program to figure out which page is the translation of which other page. Let the user find the pages. You can, however, make it easier for the user to specify two base URLs and give your program some clues about how each language's tree stems from that base, so that the user need not first create a list of *all* the URLs. Well, this could even be a separate module, in which the user supplies two URLs and the stem clues, and the program then creates a list of URLs for the user (a table with two columns, for example). The user then checks the URLs randomly (or all of them if he wants to) to see if the two trees do in fact match. If so, the program continues. If not, well, perhaps there should be a way for the user to "correct" the file list and upload it.


Such stem clues may be required for some websites, but here it gets complicated for the user who originally only wanted to enter 2 URLs. Otherwise this is another nice idea for an extension.

Can you tell me more about "No need to protect the EXE"?


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 01:06
Member (2006)
English to Afrikaans
+ ...
Quick reply about stem clues Nov 2, 2008

Harry Bornemann wrote:
Such stem clues may be required for some websites, but here it gets complicated for the user who originally only wanted to enter 2 URLs. Otherwise this is another nice idea for an extension.


True, but I think somewhere there is a line between a cute toy and a useful tool. I don't think the idea is that a translator should be able to align only two pages at a time (both of whose URLs are given to the program). And having done such alignments myself before, I can tell you that the naming scheme of such URLs is rarely straight-forward. You'd have to build in a lot of intelligence just to get the two sets of pages right. Better ask the translator to lend a hand.

Sure, if a translator wants to align just two pages, he can enter both URLs and set the stem clue setting as "nothing" and the link follow depth setting as "0". This may even be the default setting, which will help the site act as its own demo.


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 01:06
Member (2006)
English to Afrikaans
+ ...
Continuing my previous**2 post Nov 2, 2008

Harry Bornemann wrote:
The tool I imagine would compare the 2 HTML & hypertext trees and where their structures start to differ, any further tags would be included as content of this last hypertext level. (Does anyone understand me?)


On second thought, no, I don't think I know what you mean.

I think this would work quite well, but the .exe file would be difficult to protect. It would not take long until one of the customers would start to distribute copies, compromising your investment.


If you offer an online service, there is no need to protect the EXE, is there? Anyway, protecting the EXE is a separate issue -- I'm sure there are ways of doing it. I don't think it should be the main point of discussion.

I'm sure it's fairly clear that the segmentation system would have to be designed from scratch. Your next question would be whether you want to support SRX or some other format, or just design your own format.

Another issue would text extraction. Just how much of the formatting would you like to retain? I favour a minimalistic approach whereby the aligned text is formatless. This means that you need to create extractors for HTML (to start off with). I have yet to see an HTML2TXT module that does not break lines -- perhaps you can be the first to create one.


Direct link Reply with quote
 
Pages in topic:   [1 2] >


To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Maria Castro[Call to this topic]

You can also contact site staff by submitting a support request »

Website aligner

Advanced search







memoQ translator pro
Kilgray's memoQ is the world's fastest developing integrated localization & translation environment rendering you more productive and efficient.

With our advanced file filters, unlimited language and advanced file support, memoQ translator pro has been designed for translators and reviewers who work on their own, with other translators or in team-based translation projects.

More info »
PDF Translation - the Easy Way
TransPDF converts your PDFs to XLIFF ready for professional translation.

TransPDF converts your PDFs to XLIFF ready for professional translation. It also puts your translations back into the PDF to make new PDFs. Quicker and more accurate than hand-editing PDF. Includes free use of Infix PDF Editor with your translated PDFs.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums