Help: Setup segmentation rule for Chinese
Thread poster: Jessicaliu

Jessicaliu  Identity Verified
Hong Kong
Local time: 13:26
Chinese to English
+ ...
Jan 18, 2011

Hi! I find there is no segmentation rule for Chinese in OmegaT. Therefore, my source text is segmented in paragraph.

I tried to setup my own segmentation rule as follows.

Intention: set a segment after "。" (Chinese period)
Before: 。
After:

But, it's not working. All the sentence are still stick together. Could anyone tell me what to do? Many thanks.


 

Didier Briel  Identity Verified
France
Local time: 07:26
Member (2007)
English to French
+ ...
Did you check Break/Exception? Jan 18, 2011

Jessicaliu wrote:
Hi! I find there is no segmentation rule for Chinese in OmegaT. Therefore, my source text is segmented in paragraph.

I tried to setup my own segmentation rule as follows.

Intention: set a segment after "。" (Chinese period)
Before: 。
After:

But, it's not working. All the sentence are still stick together. Could anyone tell me what to do? Many thanks.

Is the box Break/Exception in your rule checked?

Is your project set to sentence segmentation (check in Project Properties)?

Didier


 

Jessicaliu  Identity Verified
Hong Kong
Local time: 13:26
Chinese to English
+ ...
TOPIC STARTER
box checked (break) Jan 19, 2011

Thank you for reply. I check the box.

The version I use is 2.2.0_2.

I also tried several source texts and move up the rule to the first one on the list. But, it seems that OmegaT does not recoginze my segmentation rule.


 

Didier Briel  Identity Verified
France
Local time: 07:26
Member (2007)
English to French
+ ...
A simple test: set your source language to Japanese Jan 19, 2011

Jessicaliu wrote:

Thank you for reply. I check the box.

The version I use is 2.2.0_2.

I also tried several source texts and move up the rule to the first one on the list. But, it seems that OmegaT does not recoginze my segmentation rule.

You can do a simple test: set your source language to Japanese (just temporarily for the test), as the end of sentence character is the same.

If it works, then there's something wrong in your rule (for instance, the end of sentence character is not the right one).

If it doesn't work, then there's another issue.

For instance, you did not answer my other question:
Is your project set to sentence segmentation (check in Project Properties)?

Didier


 

Jessicaliu  Identity Verified
Hong Kong
Local time: 13:26
Chinese to English
+ ...
TOPIC STARTER
It works. Thank you a lot. Jan 19, 2011

Thank you Didier.

I checked the sentence-level segmenting.

But, your reply remind me that I forget to change "language pattern" from default to a specific Chinese language code that OmegaT is able to recoginize. I changed it to ZH-HK. And, it works!


 

Didier Briel  Identity Verified
France
Local time: 07:26
Member (2007)
English to French
+ ...
Glad it works Jan 19, 2011

Jessicaliu wrote:
I checked the sentence-level segmenting.

But, your reply remind me that I forget to change "language pattern" from default to a specific Chinese language code that OmegaT is able to recoginize. I changed it to ZH-HK. And, it works!

Thank you for the feedback.
It might help finding the issue for another user in the future.

Didier


 

Pierret Adrien  Identity Verified
China
Local time: 13:26
Chinese to French
+ ...
Related question Mar 26, 2013

I am trying to set up some more segmentation rules for Chinese, and this is what I'd like to get :

No segmentation exception after a period 。is followed by a closing quotation mark ” (when said period is set up to be segmented in all other instances).

Example : “我想此句不要分开。”译员说。Shouldn't be segmented.
Under current rules, it gets segmented as follows :
“我想此句不要分开。[segment]”译员说。[segment]

My set up :
Break/Exception : unchecked
Pattern Before : [。?!]
Pattern After : [’”"]

But it doesn not seem to work, when all other rules set up for this language are all working fine. Any clue ?

Thank you !

[Edited at 2013-03-26 05:47 GMT]


 

Didier Briel  Identity Verified
France
Local time: 07:26
Member (2007)
English to French
+ ...
Move your non-breaking rules to the top Mar 26, 2013

Pierret Adrien wrote:

I am trying to set up some more segmentation rules for Chinese, and this is what I'd like to get :

No segmentation exception after a period 。is followed by a closing quotation mark ” (when said period is set up to be segmented in all other instances).

Example : “我想此句不要分开。”译员说。Shouldn't be segmented.
Under current rules, it gets segmented as follows :
“我想此句不要分开。[segment]”译员说。[segment]

My set up :
Break/Exception : unchecked
Pattern Before : [。?!]
Pattern After : [’”"]

But it doesn not seem to work, when all other rules set up for this language are all working fine. Any clue ?

The only thing I can thing of right now (except if you're not using the right quotation marks), is that perhaps your non-breaking rule is below the breaking rule.
You have to move your non-breaking rule above all breaking rules.

If it still doesn't work, I recommend asking the question in the Yahoo support group:
http://tech.groups.yahoo.com/group/OmegaT/
where there are knowledgeable people in your time zone (so you would get faster answers), and where you could express yourself in Chinese or French if needed.

Didier

[Edited at 2013-03-26 15:35 GMT]


 

Pierret Adrien  Identity Verified
China
Local time: 13:26
Chinese to French
+ ...
Look no more Mar 27, 2013

Yes, my non-breaking rule was placed under my breaking rule. I didn't know it mattered. Problem solved, thank you.

By the way, is there a place where we can consult a comprehensive list of signs used in the making of segmentation rules ?

I understood that [] means "any one of those signs", but what about +, or {} ? I couldn't get the grasp of it, and documentation doesn't mention it as far as I checked.

Anyway, I'll be sure to check out that Yahoo group, thank you Didier.


 

Didier Briel  Identity Verified
France
Local time: 07:26
Member (2007)
English to French
+ ...
The documentation is a good starting point Mar 27, 2013

Pierret Adrien wrote:
By the way, is there a place where we can consult a comprehensive list of signs used in the making of segmentation rules ?

I understood that [] means "any one of those signs", but what about +, or {} ? I couldn't get the grasp of it, and documentation doesn't mention it as far as I checked.

Chapter 16. Regular expressions is a good starting point, although it doesn't cover everything. The same chapter gives a link to http://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Pattern.html which contains a comprehensive list of all expressions used. For instance, for '+', look at 'Greedy quantifiers'.

For a beginner's approach to regular expressions, searching for 'regular expressions tutorial' gives plenty of links in a search engine. Note that OmegaT uses Java regular expressions (as mentioned above), which syntax may vary slightly compared with other dialects.

Didier


 

Pierret Adrien  Identity Verified
China
Local time: 13:26
Chinese to French
+ ...
Very helpful Mar 29, 2013

Thank you, your link is actually helpful enough so I can improve Chinese segmentation a bit (I'll give you my feedback should you be interested in including it in a future release).

 

Weedy Tan  Identity Verified
Taiwan
Local time: 13:26
Chinese to English
+ ...
Chinese sentence segmentation rules Jan 7, 2014

Pierret Adrien wrote:

Thank you, your link is actually helpful enough so I can improve Chinese segmentation a bit (I'll give you my feedback should you be interested in including it in a future release).



Hi Pierret,

I am trying to make my Chinese sentence segmentation rules in OmegaT and just read about your effort here.

Would it be too much to ask if you can give me a more comprehensive and detailed segmentation rules you have done so far?

Thanks in advance,

Weedy


 


There is no moderator assigned specifically to this forum.
To report site rules violations or get help, please contact site staff »


Help: Setup segmentation rule for Chinese

Advanced search






SDL Trados Studio 2019 Freelance
The leading translation software used by over 250,000 translators.

SDL Trados Studio 2019 has evolved to bring translators a brand new experience. Designed with user experience at its core, Studio 2019 transforms how new users get up and running and helps experienced users make the most of the powerful features.

More info »
SDL Trados Studio 2019 Freelance
The leading translation software used by over 250,000 translators.

SDL Trados Studio 2019 has evolved to bring translators a brand new experience. Designed with user experience at its core, Studio 2019 transforms how new users get up and running, helps experienced users make the most of the powerful features, ensures new

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search