Pages in topic:   [1 2 3 4 5 6 7 8 9] >
New free & open source aligner (for Windows, OS X and linux)
Thread poster: FarkasAndras
FarkasAndras
Local time: 07:30
English to Hungarian
+ ...
Nov 6, 2010

Hello everyone, this is just a message to let you know that I just made available a new, greatly improved version of my open source aligner.

Download (readme included in package):
http://sourceforge.net/projects/aligner/

Direct link to Windows version:
http://sourceforge.net/projects/aligner/files/LF_aligner_2.0_win.zip/download


Features include:
Autoalignment of docx, pdf, txt or html files and webpages.
Automatic downloading and alignment of various kinds of EU documents.
Review of the autoaligned material in formatted xls spreadsheets generated by the program.
Output: TMX, xls and tab delimited txt.
Batch alignment of any number of docx, pdf, txt or html files. Provide a file list, get a translation memory.
The aligner needs no installation and it doesn't require you to install any other software, either. It makes no changes to the registry or other system settings.

If you've been using aligner.bat, the main improvements you'll find apart from the stuff listed above are: drag & drop of the input files and extensive customization of the aligner's behaviour through a setup file.

Feedback, bug reports and feature requests are all welcome here or on the sourceforge page.

I know some of you dabble in programming, so here's some information for you: the aligner was written in perl and then packaged into a self-contained executable for windows. It's a modular system of sorts: I wrote the main script in perl, which does pre-and postprocessing and tmx generation among other things. It relies on other open source programs like hunalign, a sentence segmenter lifted out from the europarl corpus project, wget, pdftotext, docx2txt etc. for the various conversion/alignment tasks, so you can replace/tune things very easily. If you have a look under the hood and need some help finding your way, let me know. Also let me know if you have fiddled with the code and added features or made some other improvements.


Direct link Reply with quote
 

Piotr Bienkowski  Identity Verified
Poland
Local time: 07:30
Member (2005)
English to Polish
+ ...
Seg rules? Nov 6, 2010

Does it use any seg rules?

If yes, can they be specified?

Thanks.

Piotr


Direct link Reply with quote
 
FarkasAndras
Local time: 07:30
English to Hungarian
+ ...
TOPIC STARTER
Segmenter Nov 6, 2010

Piotr Bienkowski wrote:

Does it use any seg rules?

If yes, can they be specified?

Thanks.

Piotr


It uses the segmenter from the Europarl corpus project.
You can customize it to a certain extent by changing/adding language-specific "rule" files at scripts\sentence_splitter\nonbreaking_prefixes\
The segmenter's own readme is also included, so check that for details.

Alternatively, you can use your CAT tool of choice for segmentation if you want to make sure you get 100% hits from the TM. I never bother with that, but I put approximate instructions for Trados and WF Classic in the readme. Of course, if you go down this route, skip the built-in segmenter. Search for "Sentence splitter:" in the readme for details.


Direct link Reply with quote
 
FarkasAndras
Local time: 07:30
English to Hungarian
+ ...
TOPIC STARTER
Mac & linux users please report back Nov 6, 2010

I had limited opportunities for testing this on macs and on linux systems, so I'd like to hear back from you if you've tested the thing on either of those platforms.

Some possible errors:
If the program doesn't start on double clicking, right click and associate the program with the terminal.
If you're getting permission errors on hunalign, go to scripts/hunalign and make the hunalign binary executable using the right click context menu. The same applies to pdftotext.
If the hunalign binary fails to run, you probably need to recompile. Shouldn't be too hard: Unpack the source code, navigate to the src folder in terminal and issue the make command. The new hunalign executable should be generated in the hunalign folder. Rename it to hunalign_linux or hunalign_mac and move it to the appropriate folder.


Direct link Reply with quote
 

Adam Bojan  Identity Verified
Poland
Local time: 07:30
Dutch to Polish
+ ...
Great work! Thank you Nov 6, 2010

I have just aligned two docx files of 10 pages, not so long, for a test. Everything went smoothly. The instructions in the black window are very clear.
Then I corrected the file in Excel and it also was very nice, compared to WinAlign in my case. Thank you for the clear instructions in the second sheet. From Excel I know very well how to create the tmx file myself, but I tried and used the program further, copying the rows from excel to txt. In this way a tmx file appeared in my folder.
The problems began when I tried to import the tmx to Studio. First, I got an error like:
"unknown" is not allowed for "segtype".
So I opened the tmx in notepad and changed "unknown" next to "segtype" to "sentence".
Then once again to Studio and 2nd error:
[< '] is not allowed .... the expected symbol is [;]
WT***?
I gave up. Opened the SDLX TM maintain (s42tmman.exe) and imported the tmx file without any problems, then exported to tmx again. Then once again to Studio and it worked!
I will certainly test it more and let you know.
Greetings,
Adam

[Zmieniono 2010-11-06 21:45 GMT]

[Zmieniono 2010-11-06 21:45 GMT]

[Zmieniono 2010-11-06 21:45 GMT]


Direct link Reply with quote
 
FarkasAndras
Local time: 07:30
English to Hungarian
+ ...
TOPIC STARTER
fix Nov 6, 2010

That'll be an easy fix. I only tested it on earlier trados versions, and apparently studio is pickier. I'll fix the segtype; for the other issue, could you give me the error message verbatim and the header from a tmx as exported from studio? You can mail it to quca at freemail dot hu.

Direct link Reply with quote
 
FarkasAndras
Local time: 07:30
English to Hungarian
+ ...
TOPIC STARTER
2.01 up Nov 7, 2010

Thanks for the detailed bugreport, I uploaded an updated (windows) version that should make Studio happy.

I also got a TMX validator and checked the TMX files generated by the new aligner version, they seem to be fine. Could you check with Studio?


BTW, one of the things Studio complained about directly contradicts the official TMX spec, which clearly says that *all* is a valid srclang value. Either way, it'll hopefully be okay now.

The issue with < versus ; is basically a character encoding/representation issue. The & sign can be represented by &amp; (e.g. in HTML). I thought this was optional in TMX, i.e. both & and &amp; were valid representations of the ampersand. Apparently, Studio only accepts &amp; so it complained about the lack of the closing ; - somehow ignoring the fact that "amp" wasn't there, either. Older Trados versions accepted &, but the TMX validator also tells me I need to use &amp;.

Note about abbreviations re: your email: you probably just got a message that says there is no segmentation "rule" file (nonbreaking prefix file) for your language. In that case, the segmenter defaults to English, which is mostly fine. The only drawback is that the segmentation may be off in a couple of places due to this. You can add files for your languages; they contain a list of words like Mr., vs. and approx. that end in a period but don't end the sentence. So just make a copy of the english file, rename it and add things like "Mr" and "Hr" in your language.

The English-Hungarian thing is just a small quirk of Hunalign, pay no attention to it. I'm trying to get the author of hunalign to correct it.

[Edited at 2010-11-07 17:00 GMT]


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 06:30
Member (2009)
Dutch to English
+ ...
@FarkasAndras Nov 7, 2010

I have a question....

I finally had some time to try out your aligner, and so far it is pretty good. I am having a problem however, which I can't seem to figure out.

I have been testing:

1. the new memoQ 4.5 aligner in LiveDocs,
2. LF Aligner (LF_aligner_2.01.exe), and
3. AlignFactory Light (Terminotix)

And so far, AlignFactory is the only one getting almost everything in my EUconst alignments right. But yours isn't limited to 100 pairs, and I really want to figure out how to get it to achieve the same quality as AlignFactory.

The problem seems to be with Paragraphs. AlignFactory is getting almost all of the larger paragraphs that should be kept together, correct.
memoQ and LF Aligner are cutting them up at the wrong places.

Should I be tweaking a setting somewhere re. sentences/paragraphs/lines and segmentation? There must be a way to get these Europarl files aligned. In AlignFactory, you can select from various different segmentation rules types: "Paragraph-based", "Sentence-based", and "Line-to-line" ...

Here are the two files that I am using as a test pair in all of the different programs:

http://beijer.mx/storage/ep-00-10-05_nl.txt
http://beijer.mx/storage/ep-00-10-05_en.txt

If you have any time to spare, could you maybe look at them and tell me what I am doing wrong?

Thanks!

Michael

[Edited at 2010-11-08 00:06 GMT]


Direct link Reply with quote
 
FarkasAndras
Local time: 07:30
English to Hungarian
+ ...
TOPIC STARTER
Segmentation revisited Nov 8, 2010

Michael J.W. Beijer wrote:

I have a question....

The problem seems to be with Paragraphs. AlignFactory is getting almost all of the larger paragraphs that should be kept together, correct.
memoQ and LF Aligner are cutting them up at the wrong places.

Should I be tweaking a setting somewhere re. sentences/paragraphs/lines and segmentation? There must be a way to get these Europarl files aligned. In AlignFactory, you can select from various different segmentation rules types: "Paragraph-based", "Sentence-based", and "Line-to-line"


Well, I take it that you want paragraph-level segmentation, right? I.e. you want line breaks to become the segment boundaries and you don't want the aligner to chop up your text into sentences. Just pick "n" when the aligner asks if you want to "Segment text to sentences".

Here's how segmentation works in LF aligner: line breaks are always taken to be segment delimiters*. On top of that, if you do segmentation, the europarl segmenter is used to chop up the stuff between the line breaks into sentences. In 99% of cases, it's probably better to sentence segment text... you get more TM lookup hits (supposing you translate sentence by sentence) and you get more conveniently usable concordance hits - admittedly, at the price of more misaligned segments. How much more depends on the source text and your "nonbreaking prefix" file, but it's usually not bad at all. The autoaligner also subsequently corrects some of the segmenting errors by merging segments.

I'm not sure how alignfactory differentiates between paragraph breaks and line breaks - the two are difficult to tell apart in most circumstances.

Michael J.W. Beijer wrote:
Here are the two files that I am using as a test pair in all of the different programs:

http://beijer.mx/storage/ep-00-10-05_nl.txt
http://beijer.mx/storage/ep-00-10-05_en.txt



Those EP plenary transcripts are available online in HTML, you might want to use the URLs directly in the aligner. For instance, yours are from 5 October 2000:
http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT%20CRE%2020001005%20ITEM-001%20DOC%20XML%20V0//EN&language=EN
and
http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT%20CRE%2020001005%20ITEM-001%20DOC%20XML%20V0//NL&language=EN
Note the language codes and the date in the URL. You can change the URL to get the transcript of any plenary in any language. Just feed the two URLs into LF aligner in "w" mode and you're set.

Your txt files seem to have some character corruption: "Mr President, Mrs PĂŠry, Commissioner," this should say "Mrs Péry". Using the aligner in "w" mode should help avoid these issues.




* With PDF files files exported to txt using Acrobat Reader, there is an extra wrinkle: Acrobat Reader wraps lines with line breaks, and that means you end up with double line breaks at paragraph boundaries. So if you run the aligner on an exported pdf, single line breaks are ignored and double line breaks become segment boundaries. You could use the "p" filetype on txt files if you wanted this kind of "paragraph level segmenting" that ignores single line breaks.


Direct link Reply with quote
 

Adam Bojan  Identity Verified
Poland
Local time: 07:30
Dutch to Polish
+ ...
generated TMX accepted by Studio Nov 8, 2010

I have just done two test alignments and Studio accepted the TMX files without complaints. Thank you.
I also compared the alignment to SDLX align module and LF aligner 2.01 did it faster and in one case without any misalignments, better than SDLX align. It is also easier to use so now I have a very nice tool and I will use it quite a lot.

The issue with abbreviations is no problem to me, but thanks for the explanation.
greetings,
Adam


Direct link Reply with quote
 
FarkasAndras
Local time: 07:30
English to Hungarian
+ ...
TOPIC STARTER
SDLX Align Nov 8, 2010

Adam Bojan wrote:

I have just done two test alignments and Studio accepted the TMX files without complaints. Thank you.
I also compared the alignment to SDLX align module and LF aligner 2.01 did it faster and in one case without any misalignments, better than SDLX align.

Thanks, that's good to know. I'd appreciate a thumbs up or a review on the sourceforge page if you like it, I'm vain that way.

Does SDLX align have an autoaligner or does it just segment everything and expect you to do all the actual aligning like WinAlign? I'm not even sure where and how you can get SDLX Align, to be honest.


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 06:30
Member (2009)
Dutch to English
+ ...
I'll try "n" Nov 8, 2010

I'll try using "n", and see what happens. Thanks!

I looked, and the files I uploaded to my server look like they have character corruption, in Chrome and Firefox, but the files, when downloaded to the computer are fine. I don't really understand why this is.

Michael


Direct link Reply with quote
 
FarkasAndras
Local time: 07:30
English to Hungarian
+ ...
TOPIC STARTER
character encoding Nov 8, 2010

Michael J.W. Beijer wrote:

I'll try using "n", and see what happens. Thanks!

I looked, and the files I uploaded to my server look like they have character corruption, in Chrome and Firefox, but the files, when downloaded to the computer are fine. I don't really understand why this is.

Michael


That's definitely some character encoding issue. Encoding the files in utf-8 will probably fix it.


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 06:30
Member (2009)
Dutch to English
+ ...
UTF-8 Nov 8, 2010

Fixed it by resaving the files as utf-8 and re-uploading them.

Strange, I downloaded them from the OPUS site (http://urd.let.rug.nl/tiedeman/OPUS/). I would have thought that they would be using utf-8.

Michael


Direct link Reply with quote
 

Mette Melchior  Identity Verified
Sweden
Local time: 07:30
English to Danish
+ ...
Nice web features! Nov 8, 2010

The idea with the web feature that allows you to automatically align EU documents based on the CELEX number is really nice. I will definitely try that out next time I need to align a reference document.

I also think the explanations in the readme files are great and very informative.

The general aligner readme file also includes links to the DGT TM and Europarl corpus. Just in case some of you aren't aware of it, the OPUS collection is another great resource where you can find bilingual material. (Please note that the Europarl corpus on this page is release 3. For the newest release you should download it from the page which is linked to in the readme file.)


Direct link Reply with quote
 
Pages in topic:   [1 2 3 4 5 6 7 8 9] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

New free & open source aligner (for Windows, OS X and linux)

Advanced search







memoQ translator pro
Kilgray's memoQ is the world's fastest developing integrated localization & translation environment rendering you more productive and efficient.

With our advanced file filters, unlimited language and advanced file support, memoQ translator pro has been designed for translators and reviewers who work on their own, with other translators or in team-based translation projects.

More info »
WordFinder
The words you want Anywhere, Anytime

WordFinder is the market's fastest and easiest way of finding the right word, term, translation or synonym in one or more dictionaries. In our assortment you can choose among more than 120 dictionaries in 15 languages from leading publishers.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search