Use parallel corpora to train TM?
Thread poster: DoItInSpanish

DoItInSpanish  Identity Verified
United States
Local time: 04:53
English to Spanish
+ ...
Nov 22, 2012

Dear Wordfast user,

I'm new to Wordfast and the whole machine-assisted matter.
I've read that people use parallel corpora (pairs of already-translated texts in the source and target language) to train a translation memory.
Is there a way to import my [thousands of] previously translated documents, along with their originals, to create a translation memory, using Wordfast?

Any advice would be greatly appreciated. If I'm headed in the wrong direction, please advise.

Thank you and happy thanksgiving!

A.


Direct link Reply with quote
 

Dominique Pivard  Identity Verified
Local time: 11:53
Finnish to French
Alignment Nov 22, 2012

DoItInSpanish wrote:
Is there a way to import my [thousands of] previously translated documents, along with their originals, to create a translation memory, using Wordfast?

The process of creating a TM from original source documents and their corresponding translations is called alignment. You can use the Wordfast Online Aligner available in Wordfast Anywhere: just create an account for yourself at www.freetm.com

You can find a video about it here:

http://wordfast.net/wiki/Wordfast_Anywhere_Videos

How to use Wordfast Online Aligner to create translation memories from old translations. By Yasmin Moslem - [2:59]:
http://youtu.be/Y53DS5xWqQg?hd=1

There are many other aligners available out there, both free ones and commercial ones.


Direct link Reply with quote
 

Dominique Pivard  Identity Verified
Local time: 11:53
Finnish to French
translating vs. aligning Nov 22, 2012

DoItInSpanish wrote:
I've read that people use parallel corpora (pairs of already-translated texts in the source and target language) to train a translation memory.

Actually, you don't really "train" a translation memory, you rather feed it. You can feed it either by translating new documents in a CAT tool (eg. Wordfast) or by aligning old (already translated) documents with an aligner (see my previous reply). The easiest way to do it is by translating. It's also more motivating, as you usually get paid for doing it. This is why if you are new to CAT tools, I would recommend you start feeding your TM by translating new documents. You can always align old documents later on, when you have nothing else more exciting to do.


Direct link Reply with quote
 

DoItInSpanish  Identity Verified
United States
Local time: 04:53
English to Spanish
+ ...
TOPIC STARTER
massive alignment? Nov 22, 2012

Dominique Pivard wrote:

DoItInSpanish wrote:
I've read that people use parallel corpora (pairs of already-translated texts in the source and target language) to train a translation memory.

Actually, you don't really "train" a translation memory, you rather feed it. You can feed it either by translating new documents in a CAT tool (eg. Wordfast) or by aligning old (already translated) documents with an aligner (see my previous reply). The easiest way to do it is by translating. It's also more motivating, as you usually get paid for doing it. This is why if you are new to CAT tools, I would recommend you start feeding your TM by translating new documents. You can always align old documents later on, when you have nothing else more exciting to do.


Hi Dominique,

First off, thank you for your detailed answers.
I just tried the alignment tool on freetm.com. However, it only lets me feed up to 3 pairs of documents. Is there another tool for massive alignment (thousands of pairs docs at a time)? Perhaps with the offline WordFast version?

Also, is it possible to merge the output of this alignment with a pre-existing TM? We already have a TM, and we'd like to add the product of the alignment to it.

Thanks again,

A.


Direct link Reply with quote
 

Dominique Pivard  Identity Verified
Local time: 11:53
Finnish to French
Massive alignment = $$$ Nov 22, 2012

DoItInSpanish wrote:
I just tried the alignment tool on freetm.com. However, it only lets me feed up to 3 pairs of documents. Is there another tool for massive alignment (thousands of pairs docs at a time)? Perhaps with the offline WordFast version?

I'm afraid "massive alignment" is a case of what you pay is what you get. You may want to have a look at AlignFactory, which is among the best commercial aligners on the market.
DoItInSpanish wrote:
Also, is it possible to merge the output of this alignment with a pre-existing TM? We already have a TM, and we'd like to add the product of the alignment to it.

The output of an alignment is (or can be) a TM (typically in the industry-standard TMX format), and you can normally merge TM's. At least Wordfast will let you do it.


Direct link Reply with quote
 

Richard Hill  Identity Verified
Mexico
Local time: 04:53
Member (2011)
Spanish to English
I recommend SuperAlign and LF Aligner Nov 23, 2012

found here:
http://sourceforge.net/projects/superalign/

http://sourceforge.net/projects/aligner/


Direct link Reply with quote
 

Richard Hill  Identity Verified
Mexico
Local time: 04:53
Member (2011)
Spanish to English
EU TM Nov 23, 2012

You may already have it but just in case here's the DGT-TM
http://ipsc.jrc.ec.europa.eu/index.php?id=197#c2744


Direct link Reply with quote
 

Siegfried Armbruster  Identity Verified
Germany
Local time: 10:53
Member (2004)
English to German
+ ...
Money won't always give you the best results Nov 23, 2012

Dominique Pivard wrote:
I'm afraid "massive alignment" is a case of what you pay is what you get. You may want to have a look at AlignFactory, which is among the best commercial aligners on the market.


Yes, AlignFactory is good and easy to use. However, after massive testing of various aligners, the results are clear. The free LF Aligner http://sourceforge.net/projects/aligner/ gives better results in all scenarios.

I used LF Batch Aligner to produce massive TMs and it has a better recognition rate and produces a higher number and more reliable aligned segments compared to the quite expensive AlignFactory.


Direct link Reply with quote
 

Dominique Pivard  Identity Verified
Local time: 11:53
Finnish to French
LF Aligner vs. AlignFactory Nov 23, 2012

Siegfried Armbruster wrote:
Yes, AlignFactory is good and easy to use. However, after massive testing of various aligners, the results are clear. The free LF Aligner http://sourceforge.net/projects/aligner/ gives better results in all scenarios.

I used LF Batch Aligner to produce massive TMs and it has a better recognition rate and produces a higher number and more reliable aligned segments compared to the quite expensive AlignFactory.

Good point: I have never made (nor seen) formal comparisons of LF Aligner vs. AlignFactory. Maybe I'll make one when I have some time. Have you tested if there are differences in the maximum size of documents they can handle. I remember that past a certain threshold, AlignFactory would no longer be able to cope and make a proper alignment, so very large documents needed to be broken down into more manageable chunks.


Direct link Reply with quote
 

Siegfried Armbruster  Identity Verified
Germany
Local time: 10:53
Member (2004)
English to German
+ ...
Formal comparison Nov 23, 2012

Dominique Pivard wrote:
Good point: I have never made (nor seen) formal comparisons of LF Aligner vs. AlignFactory.


We compared various aligners with an identical batch of documents, including PDf files, converted PDF files and standard Word files. After running the alignment with each tool we manually checked the resulting aligned segments and documented:

- the number of correct aligned segments,
- the number of discarded segments and
- the number of segments containing errors.

We also tried do identify the source of the errors etc.
My current plan is to do some additional tests and to present the results in form of a Webinar in January 2013.


Direct link Reply with quote
 
FarkasAndras
Local time: 10:53
English to Hungarian
+ ...
some info Nov 23, 2012

Dominique Pivard wrote:

Siegfried Armbruster wrote:
Yes, AlignFactory is good and easy to use. However, after massive testing of various aligners, the results are clear. The free LF Aligner http://sourceforge.net/projects/aligner/ gives better results in all scenarios.

I used LF Batch Aligner to produce massive TMs and it has a better recognition rate and produces a higher number and more reliable aligned segments compared to the quite expensive AlignFactory.

Good point: I have never made (nor seen) formal comparisons of LF Aligner vs. AlignFactory. Maybe I'll make one when I have some time. Have you tested if there are differences in the maximum size of documents they can handle. I remember that past a certain threshold, AlignFactory would no longer be able to cope and make a proper alignment, so very large documents needed to be broken down into more manageable chunks.

I designed LF Aligner specifically to handle very large files. In principle, there should be no upper limit. I tested it up to about 400,000 segments (in one pair of txt files). The key is that a) the files are not loaded into memory in full at any point and b) very large files (by default, 15000 segments and up) are chopped up and aligned in pieces, then reassembled. Of course, if your source material is made up of many documents, you should align the files in batch mode instead of merging them and aligning them as one document (pair).

BTW LF Aligner doesn't skip any segments. One could program it to discard segments that the autoalignment engine scored as low-confidence matches, but I didn't implement that feature because I'm not sure how reliable that score is. Even the segments that are not paired with anything are left in (currently, they are printed even in TMX files. Future versions will likely leave them out of TMX files because they just get reported as errors by CAT tools.)

[Edited at 2012-11-23 20:01 GMT]


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Use parallel corpora to train TM?

Advanced search


Translation news related to Wordfast





BaccS – Business Accounting Software
Modern desktop project management for freelance translators

BaccS makes it easy for translators to manage their projects, schedule tasks, create invoices, and view highly customizable reports. User-friendly, ProZ.com integration, community-driven development – a few reasons BaccS is trusted by translators!

More info »
TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search