OmegaT segmentation rules - splitting or merging segments in OmegaT
Thread poster: Souni
Souni
Local time: 10:41
German to English
+ ...
Jan 9, 2010

Hi,

I use OmegaT for Mac, and I would very much like to know if there is a way to split or merge segments in a source text once it has been imported/dragged and dropped into the source folder.

If there's a simple way to explain, I would also like to know what these segmentation rules refer to. I am using mostly German to English, and I activate the sentence segmentation. But every time I try to take a more active approach to my segmentation requirements, I retreat in baffled impotence! What are these exceptions? Do they mean that they will exceptionally segment, or that they will exceptionally not segment? Can anyone explain?

Thanks!

Souni


Direct link Reply with quote
 

Dragomir Kovacevic  Identity Verified
Italy
Local time: 10:41
Italian to Serbian
+ ...
seg. rules refer to... Jan 10, 2010

... splitting paragraph into sentences.

Expception means that you don't want a break in the middle of a sentence just because there is a word, an abbreviation: Mr. | Prof. You would naturally need a fluid sentence with these abbreviations in it, not being cut at it.

The rules for most elementary punctuation marks and abbreviations, work on the principle: mark + space. In case you have a sentence like this: "Today the sun is shinning.I'll practice some jogging" - it won't be split, since there is no space between the two.

The most elementary punctuation mark used for splitting par. into segments: . | ! | ? | : | ; | even you can put a comma. You can examine them in the default line of seg rules in Options menu.

For German, you will find many abbreviations already present. Interruption/Exception with no ticking, means that the rule will not be used. Example: Abb\. After there is a space: \s.
In case you tick the rule, the phrase will be broken after the said word, and that is what you don't want to obtain.

Dragomir

Souni wrote:

Hi,

I use OmegaT for Mac, and I would very much like to know if there is a way to split or merge segments in a source text once it has been imported/dragged and dropped into the source folder.

If there's a simple way to explain, I would also like to know what these segmentation rules refer to. I am using mostly German to English, and I activate the sentence segmentation. But every time I try to take a more active approach to my segmentation requirements, I retreat in baffled impotence! What are these exceptions? Do they mean that they will exceptionally segment, or that they will exceptionally not segment? Can anyone explain?

Thanks!

Souni


[Edited at 2010-01-10 08:14 GMT]


Direct link Reply with quote
 

Vito Smolej
Germany
Local time: 10:41
Member (2004)
English to Slovenian
+ ...
about segmentation rules in OmegaT Jan 10, 2010

Souni wrote:
If there's a simple way to explain, I would also like to know what these segmentation rules refer to. I

As Dragomir already indicated, you have to deal with
a) a given situation, found in the stream of characters, forming the text
b) what to do at that point, split the text at that point or make an exception from a more general rule (i.e. NOT split the text).

Example: in case of "?????Dr.???? " for a) we normally would NOT want to split after this period, i.e. we need an exception from the more general rule of "split after a period and before a whitespace character" (blank, tab etc...)

The Break/Exception check box in the segmentation rules window determines whether it is a break rule (check box set) or an exception rule (check box unset).

See more in the documentation (chapter Source segmentation *).

Please note that rules are there for all the segments: you can not make a special rule that would be valid for just one specific case in the source and not for the rest. Changing or expanding the rules thus changes the whole ballgame: the input text may after a change be structured in a quite different fashion and the segments in the translation memory, you may have collected before, may not fit anymore - one of the reasons for orphan segments for instance; they are in the TM, but nowhere to be found in the source text under the new rules.

The default rules - language-specific, as you may have noticed- are an evolutionary product. This means, that there's always room for improvement. If you are missing a specific case for German, tell us about it - either here or in the Yahoo OmegaT group.

Hih

Regards

smo

* I would appreciate to hear from you about the indicated chapter, as regards its contents, legibility etc. See the PDF file in the Documentation section of Files in the OmegaT Yahoo Thread:

http://tech.groups.yahoo.com/group/OmegaT/files


Direct link Reply with quote
 

traductorchile  Identity Verified
Chile
Local time: 06:41
English to Spanish
+ ...
Hope you can help Vito Feb 15, 2012

I have a text with lists of short sentences, i.e.:

Jack went up the hill
Jill didn't follow him,
Jack got in a fuss
Jill had a laugh
Jack came tumbling down.

Default segmentation rules consider these five lines as one segment. I tried to create segmentation exceptions as: /end of line...........[A-Z] and then disallow it (I understand end of line = n or r). But it doesn't work, probably because I have activated segmentation by sentence.
What options do I have to be able to have each line as a different sentence? Puting a sentence breaker at the end of each line, throughout the text, or is there some easier way?


Direct link Reply with quote
 

traductorchile  Identity Verified
Chile
Local time: 06:41
English to Spanish
+ ...
Sorry to bother Feb 15, 2012

Sorry I had saved the PDF as a text document so all the (line end) format had dissapeared.

I copied the PDF on to .docx and know the sentences got segmented correctly.

Sorry.


Direct link Reply with quote
 

Didier Briel  Identity Verified
France
Local time: 10:41
Member (2007)
English to French
+ ...
There are specific options for text files Feb 15, 2012

traductorchile wrote:

I have a text with lists of short sentences, i.e.:

Jack went up the hill
Jill didn't follow him,
Jack got in a fuss
Jill had a laugh
Jack came tumbling down.

Default segmentation rules consider these five lines as one segment. I tried to create segmentation exceptions as: /end of line...........[A-Z] and then disallow it (I understand end of line = n or r). But it doesn't work, probably because I have activated segmentation by sentence.
What options do I have to be able to have each line as a different sentence? Puting a sentence breaker at the end of each line, throughout the text, or is there some easier way?

In Options > File Filters > Text Files > Options, you can set how end of lines will be processed for text files.

Didier


Direct link Reply with quote
 
Paul Klassen
Canada
Local time: 05:41
French to English
Cannot get OmegaT customized segmentation to work Dec 14, 2013

I have a sentence that contains:
… Malthus’, 23 fev. 1816 …
OmegaT insists on segmenting this after fev. I tried creating an exception with:
fev\. in the "Pattern before"
\s in the "Pattern after"
Break/Exception checked
in FR-CA (which is the language I am using).

I have tried a large number of variations on this, but nothing supresses the segmentation. Any thoughts?

I am running version 3.0.4 on Windows 7.

Thank you,

Paul


Direct link Reply with quote
 

Didier Briel  Identity Verified
France
Local time: 10:41
Member (2007)
English to French
+ ...
Uncheck Dec 15, 2013

Paul Klassen wrote:

I have a sentence that contains:
… Malthus’, 23 fev. 1816 …
OmegaT insists on segmenting this after fev. I tried creating an exception with:
fev\. in the "Pattern before"
\s in the "Pattern after"
Break/Exception checked


If checked, it means you want to segment.

in FR-CA (which is the language I am using).

I have tried a large number of variations on this, but nothing supresses the segmentation. Any thoughts?

Try unchecking.

I am running version 3.0.4 on Windows 7.

You should upgrade to 3.0.7.

Didier


Direct link Reply with quote
 
Paul Klassen
Canada
Local time: 05:41
French to English
No luck Dec 16, 2013

I've upgraded to 3.0.7. I had tried both checked and unchecked, but didn't realize that the one I mentioned was not the right one.

I tried retyping the phrase
Ricardo to Malthus, 23 fev. 1816
into a document (as the only text), which I then saved and loaded. OmegaT still inserts a segment split. This was all using MS Word and .docx format.
Then I did the same thing using LibreOffice, same result.

Any more ideas?


Direct link Reply with quote
 

Didier Briel  Identity Verified
France
Local time: 10:41
Member (2007)
English to French
+ ...
Rule position Dec 17, 2013

Paul Klassen wrote:

I've upgraded to 3.0.7. I had tried both checked and unchecked, but didn't realize that the one I mentioned was not the right one.

I tried retyping the phrase
Ricardo to Malthus, 23 fev. 1816
into a document (as the only text), which I then saved and loaded. OmegaT still inserts a segment split. This was all using MS Word and .docx format.
Then I did the same thing using LibreOffice, same result.
Any more ideas?

I just tried
Before: fev\.
After: \s
Unchecked, and it worked first time, all the other rules being the default rules.

When you created your FR-CA set of rules, did you left it at the bottom?

Rules are executed in sequential order. So, your exceptions must be before the default rules, if you want them to do anything.

Didier


Direct link Reply with quote
 


There is no moderator assigned specifically to this forum.
To report site rules violations or get help, please contact site staff »


OmegaT segmentation rules - splitting or merging segments in OmegaT

Advanced search






Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

More info »
TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search