Pages in topic:   [1 2] >
Convert paragraph TM to sentence TM
Thread poster: Samuel Murray

Samuel Murray  Identity Verified
Netherlands
Local time: 16:35
Member (2006)
English to Afrikaans
+ ...
Oct 28, 2013

Hello everyone

I received a TM from the client, but the segmentation in the TM is paragraph-based, and it would yield many more matches if the TM could be segmented by sentence instead. Do you know of free utilities that can convert a paragraph-based TM into a sentence-based TM? The conversion doesn't have to be perfect.

The only tool that I know of that can do this (and which I can't use right now) is posegment from the Translate Toolkit, which converts a paragraph-based PO file into a sentence-based PO file.

Does anyone know of such a tool?

Thanks
Samuel


Direct link Reply with quote
 
FarkasAndras
Local time: 16:35
English to Hungarian
+ ...
Any aligner Oct 28, 2013

Samuel Murray wrote:

Hello everyone

I received a TM from the client, but the segmentation in the TM is paragraph-based, and it would yield many more matches if the TM could be segmented by sentence instead. Do you know of free utilities that can convert a paragraph-based TM into a sentence-based TM? The conversion doesn't have to be perfect.

The only tool that I know of that can do this (and which I can't use right now) is posegment from the Translate Toolkit, which converts a paragraph-based PO file into a sentence-based PO file.

Does anyone know of such a tool?

Thanks
Samuel


I don't think there's a tool that was designed specifically for this.
Workaround: Export to TMX, convert to tabbed txt, separate into two files, align them with the aligner of your choice. There's a tool for TMX->txt conversion in my grab bag of goodies on sourceforge if you need it.
BTW Hunalign has a feature that maintains the alignment of paragraphs. In principle, you could insert <p> tags between paragraphs after you convert the TM to txt and run hunalign on the texts that way. This would make sure that each paragraph is handled as a separate unit and no sentence is (mis)aligned with a sentence from a neighbouring paragraph. In practice it would be a bit of a hassle to do and more than likely not worth bothering with.


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 16:35
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
POsegment Oct 28, 2013

Samuel Murray wrote:
The only tool that I know of that can do this (and which I can't use right now) is posegment from the Translate Toolkit, which converts a paragraph-based PO file into a sentence-based PO file.


Actually, I was able to use posegment after all. The latest version of posegment is only available as source code, and the web site does not link to old versions, but the previous version has binaries and is still available if you know where to look:

http://sourceforge.net/projects/translate/files/Translate%20Toolkit/1.9.0/


Direct link Reply with quote
 
FarkasAndras
Local time: 16:35
English to Hungarian
+ ...
Does it autoalign? Oct 28, 2013

The question is whether it autoaligns the segments or just chops up the paragraphs and relies on the optimistic assumption that each paragraph will be segmented identically in the two languages. If it doesn't attempt to do any autoalignment and the paragraphs are long, there could be quite a few misalignments after a simple sentence segmentation. Of course the problem is not nearly as serious if paragraphs tend to be made up of 2 or 3 sentences.

Direct link Reply with quote
 
xxxnrichy
France
Local time: 16:35
French to Dutch
+ ...
I would also convert into tab aligned txt Oct 28, 2013

that is, in a Wordfast TM. Then read it in Excel, copy the two columns in two different Word files, convert table into text, and align the two files with an aligner. I did it once, it was not a lot of work.

Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 16:35
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
@Farkas Oct 28, 2013

FarkasAndras wrote:
The question is whether it autoaligns the segments or just chops up the paragraphs and relies on the optimistic assumption that each paragraph will be segmented identically in the two languages.


If you're talking about POsegment, then it is my understanding that some linguistic differences are taken into account by the program. However, the purpose of POsegment is to increase the usefulness of a reference TM and not to produce a perfect working TM that can be trusted as-is.

http://docs.translatehouse.org/projects/translate-toolkit/en/latest/commands/posegment.html


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 16:35
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
Realignment Oct 28, 2013

FarkasAndras wrote:
Workaround: Export to TMX, convert to tabbed txt, separate into two files, align them with the aligner of your choice. There's a tool for TMX->txt conversion in my grab bag of goodies on sourceforge if you need it.


nrichy wrote:
Convert paragraph TM to sentence TM, that is, in a Wordfast TM. Then read it in Excel, copy the two columns in two different Word files, convert table into text, and align the two files with an aligner. I did it once, it was not a lot of work.


The way I usually do this is indeed similar to nRichy's method, namely to use Wordfast's alignment tools. I just don't use Excel for it, but convert the Wordfast TM to a table in MS Word itself. However, in this case the TM was so long and complex that the task would have taken me hours. Wordfast's own aligner isn't very good -- it doesn't respect paragraphs, so if you have a missegmented chunk of text somewhere in the middle, your entire table becomes misaligned.

I assume Farkas' aligner would be better.


Direct link Reply with quote
 
xxxnrichy
France
Local time: 16:35
French to Dutch
+ ...
... Oct 28, 2013

Samuel Murray wrote:

Wordfast's own aligner isn't very good -- it doesn't respect paragraphs, so if you have a missegmented chunk of text somewhere in the middle, your entire table becomes misaligned.



I still use Plustools, with which one can correct missegmentations (combine sentences or cut them in two parts).

[Edited at 2013-10-28 20:46 GMT]


Direct link Reply with quote
 

Heartsome Support
Local time: 23:35
Reduce your TM match or realign Oct 29, 2013

You may reduce your TM match to 40%, this may help you get matches from your TM. In addition, you can search your TM when you translate each sentence.

Or you can convert your TMX to Doc in table form. Copy the source column and target column to two file, then align them with abbyy aligner which is very intelligent.


Direct link Reply with quote
 

Didier Briel  Identity Verified
France
Local time: 16:35
Member (2007)
English to French
+ ...
OmegaT Oct 29, 2013

Samuel Murray wrote:
I received a TM from the client, but the segmentation in the TM is paragraph-based, and it would yield many more matches if the TM could be segmented by sentence instead. Do you know of free utilities that can convert a paragraph-based TM into a sentence-based TM?

You can use OmegaT:
- Create a paragraph-based project.
- Quit OmegaT.
- Put your TMX in the omegat folder, named project_save.tmx.
- Load the project, and change it to a sentence-based project.
- Change at least one translation (which means you must have at least one segment to translate in the project) and save.
- Your TMX will have been segmented.


The conversion doesn't have to be perfect.

The conversion uses the segmentation rules, so it can be fine tuned.

Didier


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 16:35
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
@Didier Oct 29, 2013

Didier Briel wrote:
You can use OmegaT:
- Create a paragraph-based project.
- Quit OmegaT.
- Put your TMX in the omegat folder, named project_save.tmx.
- Load the project, and change it to a sentence-based project.
- Change at least one translation (which means you must have at least one segment to translate in the project) and save.
- Your TMX will have been segmented.


I did not know that OmegaT could do that. I had thought that if one changes from paragraph segmentation to sentence segmentation in mid-project that one would have to retranslate all the sentences.

PS: No... I tried that trick but the TMX is still paragraph based after that procedure. Even in segments that contain nothing fancy but sentences, and the same number of sentences too.


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 16:35
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
@Heartsome Oct 29, 2013

Heartsome Support wrote:
You may reduce your TM match to 40%, this may help you get matches from your TM. In addition, you can search your TM when you translate each sentence.


Unfortunately my current CAT tool's lowest match value is 50% (a design decision made several years ago by the developer who thought that most people wanted more speed). Besides, my current CAT tool displays only one TM match (the highest one) unless I press a button to see the rest of them, and this makes the "low fuzzy match percentage" trick fairly useless.

For example:

TM contains: "AAA will not be synchronized. The BBB CCC application has not properly separated the Full Name into First Name and Last Name fields for DDD. Please open and resave the DDD."

If my source text contains "AAA will not be synchronized." (which is a 100% match for one of the sentences in the TM segment), then I probably won't get that TM entry as a fuzzy match. And if the TM additionally contains segments like "BBB will be synchronized." and "CCC will never be synchronized." etc, then the TM segment will practically never appear.

Or you can convert your TMX to Doc in table form. Copy the source column and target column to two file, then align them with abbyy aligner which is very intelligent.


I can do the alignment trick if the aligner does not attempt to match sentences from paragraph B with sentences from paragraph C, simply because paragraph A is a snafu in one of the languages. PlusTools (from Wordfast Classic) unfortunately does that, which makes it practically useless, particularly if one of the languages have poor end-of-segment markings in some paragraphs.

Anyway, thanks everyone -- I'm quite happy that I was able to get POsegment working. For my language pair, it works fine. I just have to convert the TM to PO beforehand.


Direct link Reply with quote
 
FarkasAndras
Local time: 16:35
English to Hungarian
+ ...
Misalignment Oct 29, 2013

Samuel Murray wrote:

I can do the alignment trick if the aligner does not attempt to match sentences from paragraph B with sentences from paragraph C, simply because paragraph A is a snafu in one of the languages. PlusTools (from Wordfast Classic) unfortunately does that, which makes it practically useless, particularly if one of the languages have poor end-of-segment markings in some paragraphs.

Yes, that's the crux of the matter. The solution is a system that keeps the paragraphs separated and thus 'localizes' any misalignments (which is what POsegment does) or a decent autoaligner (e.g. LF Aligner) - or a combination of the two (which can be achieved by adding <P> tags and running the texts through hunalign manually). Another option is a manual review of course, but that's usually a huge waste of time unless the text is very short.
I know you have a solution you're happy with, I'm just noting this for posterity as this is an issue that other people may face every now and then.


Direct link Reply with quote
 

Didier Briel  Identity Verified
France
Local time: 16:35
Member (2007)
English to French
+ ...
It can work Oct 29, 2013

Samuel Murray wrote:

Didier Briel wrote:
You can use OmegaT:
- Create a paragraph-based project.
- Quit OmegaT.
- Put your TMX in the omegat folder, named project_save.tmx.
- Load the project, and change it to a sentence-based project.
- Change at least one translation (which means you must have at least one segment to translate in the project) and save.
- Your TMX will have been segmented.


I did not know that OmegaT could do that. I had thought that if one changes from paragraph segmentation to sentence segmentation in mid-project that one would have to retranslate all the sentences.

That's precisely why that feature was developed.

PS: No... I tried that trick but the TMX is still paragraph based after that procedure. Even in segments that contain nothing fancy but sentences, and the same number of sentences too.

Are you sure you changed at least one translation in the project, and recorded it?

I'm pretty sure it can work, because I tested before posting my comment.

Didier

P.S.: you can send me your TMX, and I'll convert it.


Direct link Reply with quote
 
xxxbenk
Gambia
Local time: 15:35
Abkhazian to Chamorro
Screencast Oct 29, 2013

Is it possible that you make a screencast of this useful procedure? No voice necessary, just the steps?

Direct link Reply with quote
 
Pages in topic:   [1 2] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Convert paragraph TM to sentence TM

Advanced search







PerfectIt consistency checker
Faster Checking, Greater Accuracy

PerfectIt helps deliver error-free documents. It improves consistency, ensures quality and helps to enforce style guides. It’s a powerful tool for pro users, and comes with the assurance of a 30-day money back guarantee.

More info »
WordFinder
The words you want Anywhere, Anytime

WordFinder is the market's fastest and easiest way of finding the right word, term, translation or synonym in one or more dictionaries. In our assortment you can choose among more than 120 dictionaries in 15 languages from leading publishers.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search