Mobile menu

OmegaT segmentation behaviour
Thread poster: valerius
valerius  Identity Verified
Latvia
Local time: 17:04
English to Latvian
+ ...
Oct 14, 2008

Can anyone give advice on how to make OmegaT (version 1.7.3, update 2) ignore a full stop (.) after a numeral and not break a sentence in separate segments? Is it possible at all and might it lead to any other complications?
In Latvian, ordinal numbers are followed by a full stop and I am experiencing a lot of trouble when sentences are broken in half where they shouldn't be.
Tried reading the Help, but my brain seems to be not that programming-oriented.
Many thanks in advance for any suggestions!

Valery


Direct link Reply with quote
 
xxxMarc P  Identity Verified
Local time: 16:04
German to English
+ ...
OmegaT segmentation behaviour Oct 14, 2008

In the segmentation rules dialog:

Uncheck the "Break/Exception" option
Pattern before: [0-9]\.
Pattern after: \s

Move this rule so that it is above the generic break rule (e.g. pattern before: [\.\?\!]+, pattern after: \s[A-Z]).

This should work. You may however prefer to do it the other way around, i.e. modify your generic break rule so that it only breaks after [any letter] followed by [any punctuation symbol], e.g.:

Check the "Break Exception option
Pattern before: [a-z][\.\?\!]
Pattern after: \s

This is the "simple" any lower-case letter variant - you may want to experiment, e.g. including characters with diacritics, and to make sure that I have got the syntax right.

HTH,
Marc


[Edited at 2008-10-14 12:29]


Direct link Reply with quote
 

Susan Welsh  Identity Verified
United States
Local time: 10:04
Member (2008)
Russian to English
+ ...
It's in the user's manual under "segmentation" Oct 14, 2008

There's a section that tells how to specify exceptions to the default segmentation.

Susan


Direct link Reply with quote
 

Didier Briel  Identity Verified
France
Local time: 16:04
Member (2007)
English to French
+ ...
Adding an exception Oct 14, 2008

valerius wrote:

Can anyone give advice on how to make OmegaT (version 1.7.3, update 2) ignore a full stop (.) after a numeral and not break a sentence in separate segments? Is it possible at all

Of course.
To quote the manual:
"Given the flexibility you may consider defining more exception rules for the language you translate from, to give you more meaningful and coherent segments."


and might it lead to any other complications?

No reason for that.


In Latvian, ordinal numbers are followed by a full stop and I am experiencing a lot of trouble when sentences are broken in half where they shouldn't be.
Tried reading the Help, but my brain seems to be not that programming-oriented.

Regular expressions are not only useful for "programming-oriented" people. They can be used in lot of situations, e.g., in Word.

What you need is to add an exception in the segmentation rules (Options/Segmentation).

Add a new rule to an existing set of rules, or create a new set, for instance for Latvian.

Don't check Break/Exception.
In Before, enter \d\. (number followed by a dot)
In After, enter \s (space)

That's all.

Didier


Direct link Reply with quote
 
valerius  Identity Verified
Latvia
Local time: 17:04
English to Latvian
+ ...
TOPIC STARTER
Thanks to the experts! Oct 14, 2008

Thank you for the advice! it actually was a matter of 30 seconds, when you know what to do, and it seems to be working so far.
I did it according to Didier's explanation, only because I had noticed in the help file previously that d stands for numerals. Im just curious, what difference it would make, if I put [0-9]?
And also, for some reason, when I had created a new set of rules with the same exception for Latvian and moved it above the default, it did not work. So I just added a new rule in the Default group. Any ideas?
And thank you Marc for the insight in the functionality of this feature. I, however, do not feel like experimenting at this moment because the converting of .doc into .odt and vice versa already seems dodgy enough to me with all the layout changes, etc. There have been cases when I have been unable to open the translated document because of some accidentally deleted tags in the OmegaT interface (the meaning of which I have no clue about, but that is probably worth another forum thread...).

Once again, your help is very much appreciated!


Direct link Reply with quote
 

Didier Briel  Identity Verified
France
Local time: 16:04
Member (2007)
English to French
+ ...
There is more than one way to do it Oct 14, 2008

valerius wrote:

I did it according to Didier's explanation, only because I had noticed in the help file previously that d stands for numerals. Im just curious, what difference it would make, if I put [0-9]?

None. That's two different ways of saying the same thing.


And also, for some reason, when I had created a new set of rules with the same exception for Latvian and moved it above the default, it did not work. So I just added a new rule in the Default group. Any ideas?

One possibility is that the language you entered for the new set of rules didn't match the source language of the project.
The default rules match anything, because they are ".*".
When you create a new set, by default it's "LN-CO", which of course doesn't match anything.

Didier


Direct link Reply with quote
 
xxxMarc P  Identity Verified
Local time: 16:04
German to English
+ ...
OmegaT segmentation behaviour Oct 14, 2008

valerius wrote:

I, however, do not feel like experimenting at this moment because the converting of .doc into .odt and vice versa already seems dodgy enough to me with all the layout changes, etc.


When you open a .doc file in OOo (and convert it to .odt and edit the text), the layout may *appear* to be different, but you are likely to find after converting back and re-opening it in Word that the layout has remained the same, or that changes are only minor.

There have been cases when I have been unable to open the translated document because of some accidentally deleted tags in the OmegaT interface (the meaning of which I have no clue about, but that is probably worth another forum thread...).


Ctrl+T in OmegaT to check for tag errors.

Marc


Direct link Reply with quote
 
valerius  Identity Verified
Latvia
Local time: 17:04
English to Latvian
+ ...
TOPIC STARTER
Tags Nov 10, 2008

Indeed, I had not realised this Ctrl+T functionality previously. I had been wondering what it was, but apparently the document I tried it in did not contain any tags, so, obviously, nothing happened.
Thank you all for your support and the time you spent explaining these things in this forum. As I see now, the manual contains this information, yet, I think it is difficult to read, if one does not have some previous knowledge/understanding of "what it's all about".

Valery


Direct link Reply with quote
 


There is no moderator assigned specifically to this forum.
To report site rules violations or get help, please contact site staff »


OmegaT segmentation behaviour

Advanced search






CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use SDL Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

More info »
LSP.expert
You’re a freelance translator? LSP.expert helps you manage your daily translation jobs. It’s easy, fast and secure.

How about you start tracking translation jobs and sending invoices in minutes? You can also manage your clients and generate reports about your business activities. So you always keep a clear view on your planning, AND you get a free 30 day trial period!

More info »



All of ProZ.com
  • All of ProZ.com
  • Term search
  • Jobs