Pages in topic:   [1 2] >
website alignment – is it possible, and what tools do you use?
Thread poster: TSDM

TSDM
Russian Federation
Local time: 09:09
Russian to English
Mar 13, 2008

Are there any tools that can batch align bi-lingual websites? (With the aim of creating themed, or client specific TMs)

It's quite easy to download full websites, and this would be a great way to make translation memories.

I found a description of a linux tool in development:
"bitextor – Builds parallel text corpora from webpages. Uses websites as the source of text. Analyzes webpage text for bitexts. Presently works with es, ca, gl, pt, and en languages. Can easily be extended to support new languages."

I assume a tool like this would produce two aligned documents that could be easily turned into a translation memory using the CAT tool of your choice.

Is there anything out there commercially available for Windows or Mac OS X that does this? I've spent easily half a day searching and trying different CAT products, but haven't found anything close to satisfactory.


Direct link Reply with quote
 

Wolfgang Jörissen  Identity Verified
Belize
Member
Dutch to German
+ ...
Interesting approach, but I'm afraid not Mar 14, 2008

The problem with websites could be the different technologies used. Think about CMS, java applets, flash etc. And if it is not technology, it will certainly be the structure, which could be different at each and every website. Creating a tool for all of that would be a _very_ smart challenge. However, you might want to use one of those grabbers that download all pages of a website to your harddisk (I used HTTP Weazel years ago, it did a good job), and then check for alignable material. Not fully automated, but at least a step in the right direction.

Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 08:09
Member (2006)
English to Afrikaans
+ ...
I wrote one, but it's geeky Mar 14, 2008

chacher wrote:
Are there any tools that can batch align bi-lingual websites? (With the aim of creating themed, or client specific TMs)


You need two things:

* An exctracor
* An aligner

For the aligner, you can use any alignment program. I suggest PlusTools from the Wordfast people.

For the extractor, take a look at my humble collection of scripts:
http://leuce.com/tempfile/omtautoit/
...and search the page for "large alignments". There are two versions -- the older version is less sophisticated and therefore less likely to fail you.

I used this when I aligned the text from a multilingual government web site. It's geeky, but it worked for me (you may have to watch it, though, so you can kill it the moment it misbehaves).

On that same page there is also a script named "Abbzz" which is used in conjunction with Abbyy Finereader to bulk extract text from PDFs (sometimes you get web sites offering PDFs in many languages).


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 08:09
Member (2006)
English to Afrikaans
+ ...
Yes, of course, that too... Mar 14, 2008

Wolfgang Jörissen wrote:
However, you might want to use one of those grabbers that download all pages of a website to your harddisk (I used HTTP Weazel years ago, it did a good job), and then check for alignable material.


Yes, of course... I took for granted that the OP would have the web pages on his hard disk already. For ripping a web site, you could also look at Oleg Arny Chernavin's "Web Downloader" (webdown.exe) (abandonware, but excellent). If you're into FLOSS, you could go with HTTrack.


Direct link Reply with quote
 

TSDM
Russian Federation
Local time: 09:09
Russian to English
TOPIC STARTER
responses to Samuel and Wolfgang (My Mum almost named me Wolfgang...) Mar 14, 2008

Samuel – great info, and I'm looking forward to trying your scripts. One question: Won't the extraction process remove valuable html tag information that would be helpful in alignment?

Wolfgang – downloading sites is not a problem (I'm on a Mac using a great program – with a great name – SiteSucker) the trick is finding an alignment program. I'm not looking for something that would capture 100% (java applets, flash, etc.), just basic text content.

I found that Multitrans has a alignment tool that does html/php –and does a good job – but only up to 10 documents at a time, and they have to be manually paired one by one.

What we're missing is a program that can batch files, and of course, handle subdirectories.

Any other suggestions out there?


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 08:09
Member (2006)
English to Afrikaans
+ ...
Answers Mar 15, 2008

chacher wrote:
Samuel – great info, and I'm looking forward to trying your scripts. One question: Won't the extraction process remove valuable html tag information that would be helpful in alignment?


There are two ways of looking at HTML tag information. You can see it as helpful, or you can see it as unhelpful. I take the latter view. Well, I suppose one could write a very fancy program that actually makes use of the HTML structure to improve the initial alignment, but if your alignment tool is good and if you know both languages well, then I don't think you should be concerned.

...and they have to be manually paired one by one.


Yes, well, that is what you do in an aligner. The aligner presents a table with two columns and you go through them to see that the segments on the left all match up with a segment on the right.

I have little faith in fully automated procedures. Alignment is only useful if you invest time in it.


Direct link Reply with quote
 
David Turner  Identity Verified
Local time: 08:09
French to English
+ ...
Logiterm Mar 15, 2008


What we're missing is a program that can batch files, and of course, handle subdirectories.
Any other suggestions out there?


Logiterm or Alignment Factory must be among the best batch aligners.
http://www.terminotix.com/index.asp?name=Professional&content=item&brand=2&item=12&lang=en

David Turner


Direct link Reply with quote
 

TSDM
Russian Federation
Local time: 09:09
Russian to English
TOPIC STARTER
comparing commercial alignment software Mar 17, 2008

David – Alignment Factory/Logiterm may be exactly what we're looking for.

Anyone else have recommendations or can comment on experiences with this or other batch/semi-automated alignment software?


Direct link Reply with quote
 

TSDM
Russian Federation
Local time: 09:09
Russian to English
TOPIC STARTER
automated alignment – a farce? Mar 17, 2008

Samuel –
I'd like to hear more about your approach to alignment. It's pretty hard to justify going through line by line to align years accumulated of documents perfectly.

Do you do all your alignment in advance, or do you use software that shows full-text TMs (rather than units) and allows alignment on the fly?


Direct link Reply with quote
 

TSDM
Russian Federation
Local time: 09:09
Russian to English
TOPIC STARTER
extraction script not working? Mar 17, 2008

Samuel,
The script you have is listed as not working. Can you clarify? Is it possible to use?


Direct link Reply with quote
 

TSDM
Russian Federation
Local time: 09:09
Russian to English
TOPIC STARTER
maybe you can recommend another extraction tool? Mar 17, 2008

I read the script's readme file, and it is just too complex for me.

Can anyone recommend a tool to extract text from websites?


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 08:09
Member (2006)
English to Afrikaans
+ ...
If you find any... Mar 17, 2008

chacher wrote:
Can anyone recommend a tool to extract text from websites?


If you find any, please let us know. All of the HTM2TXT software that I have seen so far puts line breaks in the middle of sentences when converting to text, thereby rendering the extraction useless.

But if you open the HTM file in a browser and then go Ctrl+A, Ctrl+C in it, and then Ctrl+V in a text editor, the sentences remain intact. However, doing that one file at a time will take a long time to complete (unless you pay a student to do it for you).


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 08:09
Member (2006)
English to Afrikaans
+ ...
Alignment must be 100% or zero Mar 17, 2008

chacher wrote:
It's pretty hard to justify going through line by line to align years accumulated of documents perfectly.


Well, it depends at what level the alignment is. If you align paragraphs, it is easier to do it in a semi-automated way (theoretically speaking). But if you want to have segment matching (fuzzy matching etc) then sentence segmentation is pretty much what you're looking for, right?

If your automated alignment tool missegments one sentence at the top of a file (eg by creating two sentences instead of one, or because the one language editor added a sentence to the file), then all subsequent segments on that file will be misaligned. Don't you agree?


Direct link Reply with quote
 

Felipe Gútiez  Identity Verified
Germany
Local time: 08:09
German to Spanish
+ ...
Do you know Multitrans or any other new good aligment tool? Oct 6, 2008

Samuel Murray wrote:

chacher wrote:
It's pretty hard to justify going through line by line to align years accumulated of documents perfectly.


Well, it depends at what level the alignment is. If you align paragraphs, it is easier to do it in a semi-automated way (theoretically speaking). But if you want to have segment matching (fuzzy matching etc) then sentence segmentation is pretty much what you're looking for, right?

If your automated alignment tool missegments one sentence at the top of a file (eg by creating two sentences instead of one, or because the one language editor added a sentence to the file), then all subsequent segments on that file will be misaligned. Don't you agree?


I am looking for a good alignment tool for In-Design and XML files.
Can Trados do several language in one fly? how many?


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 07:09
Member (2009)
Dutch to English
+ ...
Try this: May 20, 2010

http://www.youalign.com/

Direct link Reply with quote
 
Pages in topic:   [1 2] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

website alignment – is it possible, and what tools do you use?

Advanced search







Déjà Vu X3
Try it, Love it

Find out why Déjà Vu is today the most flexible, customizable and user-friendly tool on the market. See the brand new features in action: *Completely redesigned user interface *Live Preview *Inline spell checking *Inline

More info »
Across v6.3
Translation Toolkit and Sales Potential under One Roof

Apart from features that enable you to translate more efficiently, the new Across Translator Edition v6.3 comprises your crossMarket membership. The new online network for Across users assists you in exploring new sales potential and generating revenue.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search