LF Aligner questions
Thread poster: MikeTrans
MikeTrans
Germany
Local time: 02:01
Italian to German
+ ...
Jun 29, 2012

Hi,

The Open Source Aligner from FarkasAndras is best suited for very large documents and shows very good alignement results.
As starting files, it can handle txt, doc, docx, xls, tmx, pdf.
The output files are: tab-text, tmx, xls (features that most commercial products don't have!)
Special reviewing filters for reviewing in EXCEL.

If you are interested, give it a try and download it at

http://sourceforge.net/projects/aligner/

My questions about LF Aligner:

I have 2 very large source and target documents in txt. LF Aligner can build a tab-text from them, but I would like to tell LF Aligner that all phrases terminating with a Carriage Return in the source document are *exactly* corresponding to those segments in the target document. In fact, I don't want to align, but just transform these 2 huge documents to a single tab-text doc. How to do that, so that LF doesn't try to align by skipping some sentences?

Note: if these docs were somewhat smaller, I would use UltraEdit, but with these giants I get only a "Out of Memory".

Thanks very much for your feedback,
Mike


[Edited at 2012-06-29 17:29 GMT]


Direct link Reply with quote
 
FarkasAndras
Local time: 02:01
English to Hungarian
+ ...
generate tabbed Jun 29, 2012

Well, the aligner has no such feature. It's not a "normal" use case.
I already have the code necessary for this, so here you go:
https://dl.dropbox.com/u/16377950/maketabbed.exe

It's obviously pretty spartan, but it should work (UTF-8 only, and it doesn't do any preprocessing. If your input file has tab characters in it, the output file will include those as well.)

By the way, I use notepad++ for large text files. It has worked really well for me up to a couple hundred MB, which is as far as I needed it to go - although I don't know if it offers an easy way to merge two files into a single tabbed text file.


Direct link Reply with quote
 
MikeTrans
Germany
Local time: 02:01
Italian to German
+ ...
TOPIC STARTER
Thanks! Jun 30, 2012

Hi Farkas,

thanks very much. The problem with UltraEdit when creating tab documents, is: I must first create columns where UE adds first the necessary trailing spaces. This may blow up a document 20x or more of its size, not to speak about the RAM requirements. I had to kill everything with the task manager and purge all my temp dirs which had a multiple Gigabyte size...

The sense of all this:
When I build a very large database, (EMA, DGT, EuroParl etc. and even extracted chunks of them) I always build 2 versions of any DB:

1) One that I purge from duplicates and other garbage to be used by a CAT tool
2) The same with all complete plain text available, not necessary well-aligned, which I send to XBench.

XBench has the great feature to show searches in context (it displays +/- 10 TM segments of any search). This helps very much for DBs which are not well-aligned or where the original text is broken because of conversion limits (the EMEA medical DB is a disaster in this regard, but still a big help for me).

So, to realize point 2) it's just enough to create a tab-text which can be send to XBench. What's important is to check that both files have the same number of sentences (which in my case, for both language texts of EMEA is 1.116.368 lines). Note that LF will only drop about 150.000 segments in this case if I align, but the result is very good for my purpose 1). Once purged, only 372.000 segments will remain!

Anyhow, I very much appreciate your file link and your time. LF will surely be a big help if used together with my Olifant TM manager or even when preparing TMX files to be used for CAT hopping. I still have to read your docs and do some experiments for such scenarios.

Thank you very much!
Mike



[Edited at 2012-06-30 10:34 GMT]


Direct link Reply with quote
 
FarkasAndras
Local time: 02:01
English to Hungarian
+ ...
TMX -> tabbed Jun 30, 2012

I use xbench in a similar way, so I get the use case.
I'm not sure how you end up with two separate txt files, though. You can take a tmx file such as the ones from OPUS, the DGT-TM etc. and convert it to a tabbed txt file in one step with a tool from the grab bag on sourceforge.
I don't use the EMEA corpus but I'd assume they also provide either tabbed text or TMX...?


Direct link Reply with quote
 
MikeTrans
Germany
Local time: 02:01
Italian to German
+ ...
TOPIC STARTER
@Farkas, Jun 30, 2012

I've just tryed the maketabbed.exe with about 40 lines in each UTF-8 txt file.
I'm getting the error:

Undefined subroutine &main::abort called at script/maketabbed.pl line 11, line 3.

Possible cause:
Must maketabbed be placed in a special directory? Maybe in "other tools" of LF ?
I had to take out some paragraph bullets of my txt, but there are still some slashes (webpage adresses). Can maketabbed handle these or special characters like \ @ (c) ö etc. ?
If not, no problem, I can convert the text with a macro of mine getting rid of all such and re-change it afterwards.

In LF Aligner, I think if Im chosing "Revert to paragraph segments" in the dialog box, then I should get exactly each carriage return processed as the end of a segment (thus a tabbed text). Or has this nothing to do?

Thanks,
Mike


Direct link Reply with quote
 
FarkasAndras
Local time: 02:01
English to Hungarian
+ ...
error handling Jun 30, 2012

The problem is that the input file you specified can't be opened for some reason. Maybe it doesn't exist, or it has accented letters, special characters or spaces in the file name or path.
I coded this tool in 5 minutes and the error handling is... not very robust, that's why you didn't get a more sensible error message.


Direct link Reply with quote
 
MikeTrans
Germany
Local time: 02:01
Italian to German
+ ...
TOPIC STARTER
Opus download: tmx or txt language files Jun 30, 2012

FarkasAndras wrote:

I use xbench in a similar way, so I get the use case.
I'm not sure how you end up with two separate txt files, though. You can take a tmx file such as the ones from OPUS, the DGT-TM etc. and convert it to a tabbed txt file in one step with a tool from the grab bag on sourceforge.
I don't use the EMEA corpus but I'd assume they also provide either tabbed text or TMX...?


Long ago I've downloaded from Opus Corpora a TMX compilation of the EMEA (European medicines Agency) En-De and Fr-De. You can chose TMX, but this requires a lot of editing work to correct line mistakes causing the file not to import into anything; XBench fortunately was displaying the errors with the line number, so I was able to correct mistakes and get 303.000+ segments for Fr-De. I remember that it has taken me more than 4 hours, even with intelligent search/replace operations in the TMX.

After this experience, I would strongly recommend to download the native txt files which come separately for any language, called "Moses format". Those can then easily be aligned with LF Aligner after converting to UTF-8.

Without LF Aligner, I just wouldn't know what to do with these 2 huge separate files! Splitting them into 65.000 segments each to be handled by Excel would still get me 40 files (20 per language), also no Unicode support. Not the best to do, but I don't see anything else.
Once you have TMX or tab-text, XBench can import/export such files without problems.

[EDITED]
maketabbed.exe:
My files to open are in a path with a long filename (> 8 characters); will try to put the files in a short dir name after C:, also changing filenames.
It works now!

Mike


[Edited at 2012-06-30 18:05 GMT]


Direct link Reply with quote
 
mikhailo
Local time: 04:01
English to Russian
+ ...
re May 4, 2015

FarkasAndras

Cause different source files require a little bit different segmenting I prefer segmenting text with regexes sets in text editor. At segmenting I divide from text paragraph numbering, bulleting etc but do not remove them cause these are good alignment markers for manual alignment

Is there any way not to segment such files (for example adding special extension stx(t) - segmented text)?
And how can I continue work with files, created by aligner (open existing project)?

Another one good idea - to add in excel similarity index to each segment.
this allows to extract quickly only good segments.


Direct link Reply with quote
 
FarkasAndras
Local time: 02:01
English to Hungarian
+ ...
easy May 4, 2015

mikhailo wrote:

FarkasAndras

Cause different source files require a little bit different segmenting I prefer segmenting text with regexes sets in text editor. At segmenting I divide from text paragraph numbering, bulleting etc but do not remove them cause these are good alignment markers for manual alignment

Is there any way not to segment such files (for example adding special extension stx(t) - segmented text)?
And how can I continue work with files, created by aligner (open existing project)?

Another one good idea - to add in excel similarity index to each segment.
this allows to extract quickly only good segments.


1) Just reject sentence segmenting. You can also disable it in setup IIRC.
2) Depends on what you mean by continue to work with files. Launch other_tools/alignedit.exe to review/edit a tabbed txt.
3) Set 'Remove match confidence value' to n in the setup


Direct link Reply with quote
 
mikhailo
Local time: 04:01
English to Russian
+ ...
re May 5, 2015

FarkasAndras wrote:
1) Just reject sentence segmenting. You can also disable it in setup IIRC.
2) Depends on what you mean by continue to work with files. Launch other_tools/alignedit.exe to review/edit a tabbed txt.
3) Set 'Remove match confidence value' to n in the setup


Thanks a lot for Your answers


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

LF Aligner questions

Advanced search







LSP.expert
You’re a freelance translator? LSP.expert helps you manage your daily translation jobs. It’s easy, fast and secure.

How about you start tracking translation jobs and sending invoices in minutes? You can also manage your clients and generate reports about your business activities. So you always keep a clear view on your planning, AND you get a free 30 day trial period!

More info »
Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search