Help: Setup segmentation rule for Chinese
Thread poster: Jessicaliu
Jessicaliu  Identity Verified
Hong Kong
Local time: 13:54
Chinese to English
+ ...
Jan 18, 2011

Hi! I find there is no segmentation rule for Chinese in OmegaT. Therefore, my source text is segmented in paragraph.

I tried to setup my own segmentation rule as follows.

Intention: set a segment after "。" (Chinese period)
Before: 。
After:

But, it's not working. All the sentence are still stick together. Could anyone tell me what to do? Many thanks.


Direct link Reply with quote
 

Didier Briel  Identity Verified
France
Local time: 07:54
Member (2007)
English to French
+ ...
Did you check Break/Exception? Jan 18, 2011

Jessicaliu wrote:
Hi! I find there is no segmentation rule for Chinese in OmegaT. Therefore, my source text is segmented in paragraph.

I tried to setup my own segmentation rule as follows.

Intention: set a segment after "。" (Chinese period)
Before: 。
After:

But, it's not working. All the sentence are still stick together. Could anyone tell me what to do? Many thanks.

Is the box Break/Exception in your rule checked?

Is your project set to sentence segmentation (check in Project Properties)?

Didier


Direct link Reply with quote
 
Jessicaliu  Identity Verified
Hong Kong
Local time: 13:54
Chinese to English
+ ...
TOPIC STARTER
box checked (break) Jan 19, 2011

Thank you for reply. I check the box.

The version I use is 2.2.0_2.

I also tried several source texts and move up the rule to the first one on the list. But, it seems that OmegaT does not recoginze my segmentation rule.


Direct link Reply with quote
 

Didier Briel  Identity Verified
France
Local time: 07:54
Member (2007)
English to French
+ ...
A simple test: set your source language to Japanese Jan 19, 2011

Jessicaliu wrote:

Thank you for reply. I check the box.

The version I use is 2.2.0_2.

I also tried several source texts and move up the rule to the first one on the list. But, it seems that OmegaT does not recoginze my segmentation rule.

You can do a simple test: set your source language to Japanese (just temporarily for the test), as the end of sentence character is the same.

If it works, then there's something wrong in your rule (for instance, the end of sentence character is not the right one).

If it doesn't work, then there's another issue.

For instance, you did not answer my other question:
Is your project set to sentence segmentation (check in Project Properties)?

Didier


Direct link Reply with quote
 
Jessicaliu  Identity Verified
Hong Kong
Local time: 13:54
Chinese to English
+ ...
TOPIC STARTER
It works. Thank you a lot. Jan 19, 2011

Thank you Didier.

I checked the sentence-level segmenting.

But, your reply remind me that I forget to change "language pattern" from default to a specific Chinese language code that OmegaT is able to recoginize. I changed it to ZH-HK. And, it works!


Direct link Reply with quote
 

Didier Briel  Identity Verified
France
Local time: 07:54
Member (2007)
English to French
+ ...
Glad it works Jan 19, 2011

Jessicaliu wrote:
I checked the sentence-level segmenting.

But, your reply remind me that I forget to change "language pattern" from default to a specific Chinese language code that OmegaT is able to recoginize. I changed it to ZH-HK. And, it works!

Thank you for the feedback.
It might help finding the issue for another user in the future.

Didier


Direct link Reply with quote
 

Pierret Adrien  Identity Verified
China
Local time: 13:54
Chinese to French
+ ...
Related question Mar 26, 2013

I am trying to set up some more segmentation rules for Chinese, and this is what I'd like to get :

No segmentation exception after a period 。is followed by a closing quotation mark ” (when said period is set up to be segmented in all other instances).

Example : “我想此句不要分开。”译员说。Shouldn't be segmented.
Under current rules, it gets segmented as follows :
“我想此句不要分开。[segment]”译员说。[segment]

My set up :
Break/Exception : unchecked
Pattern Before : [。?!]
Pattern After : [’”"]

But it doesn not seem to work, when all other rules set up for this language are all working fine. Any clue ?

Thank you !

[Edited at 2013-03-26 05:47 GMT]


Direct link Reply with quote
 

Didier Briel  Identity Verified
France
Local time: 07:54
Member (2007)
English to French
+ ...
Move your non-breaking rules to the top Mar 26, 2013

Pierret Adrien wrote:

I am trying to set up some more segmentation rules for Chinese, and this is what I'd like to get :

No segmentation exception after a period 。is followed by a closing quotation mark ” (when said period is set up to be segmented in all other instances).

Example : “我想此句不要分开。”译员说。Shouldn't be segmented.
Under current rules, it gets segmented as follows :
“我想此句不要分开。[segment]”译员说。[segment]

My set up :
Break/Exception : unchecked
Pattern Before : [。?!]
Pattern After : [’”"]

But it doesn not seem to work, when all other rules set up for this language are all working fine. Any clue ?

The only thing I can thing of right now (except if you're not using the right quotation marks), is that perhaps your non-breaking rule is below the breaking rule.
You have to move your non-breaking rule above all breaking rules.

If it still doesn't work, I recommend asking the question in the Yahoo support group:
http://tech.groups.yahoo.com/group/OmegaT/
where there are knowledgeable people in your time zone (so you would get faster answers), and where you could express yourself in Chinese or French if needed.

Didier

[Edited at 2013-03-26 15:35 GMT]


Direct link Reply with quote
 

Pierret Adrien  Identity Verified
China
Local time: 13:54
Chinese to French
+ ...
Look no more Mar 27, 2013

Yes, my non-breaking rule was placed under my breaking rule. I didn't know it mattered. Problem solved, thank you.

By the way, is there a place where we can consult a comprehensive list of signs used in the making of segmentation rules ?

I understood that [] means "any one of those signs", but what about +, or {} ? I couldn't get the grasp of it, and documentation doesn't mention it as far as I checked.

Anyway, I'll be sure to check out that Yahoo group, thank you Didier.


Direct link Reply with quote
 

Didier Briel  Identity Verified
France
Local time: 07:54
Member (2007)
English to French
+ ...
The documentation is a good starting point Mar 27, 2013

Pierret Adrien wrote:
By the way, is there a place where we can consult a comprehensive list of signs used in the making of segmentation rules ?

I understood that [] means "any one of those signs", but what about +, or {} ? I couldn't get the grasp of it, and documentation doesn't mention it as far as I checked.

Chapter 16. Regular expressions is a good starting point, although it doesn't cover everything. The same chapter gives a link to http://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Pattern.html which contains a comprehensive list of all expressions used. For instance, for '+', look at 'Greedy quantifiers'.

For a beginner's approach to regular expressions, searching for 'regular expressions tutorial' gives plenty of links in a search engine. Note that OmegaT uses Java regular expressions (as mentioned above), which syntax may vary slightly compared with other dialects.

Didier


Direct link Reply with quote
 

Pierret Adrien  Identity Verified
China
Local time: 13:54
Chinese to French
+ ...
Very helpful Mar 29, 2013

Thank you, your link is actually helpful enough so I can improve Chinese segmentation a bit (I'll give you my feedback should you be interested in including it in a future release).

Direct link Reply with quote
 

Weedy Tan  Identity Verified
Taiwan
Local time: 13:54
Chinese to English
+ ...
Chinese sentence segmentation rules Jan 7, 2014

Pierret Adrien wrote:

Thank you, your link is actually helpful enough so I can improve Chinese segmentation a bit (I'll give you my feedback should you be interested in including it in a future release).



Hi Pierret,

I am trying to make my Chinese sentence segmentation rules in OmegaT and just read about your effort here.

Would it be too much to ask if you can give me a more comprehensive and detailed segmentation rules you have done so far?

Thanks in advance,

Weedy


Direct link Reply with quote
 


There is no moderator assigned specifically to this forum.
To report site rules violations or get help, please contact site staff »


Help: Setup segmentation rule for Chinese

Advanced search






Déjà Vu X3
Try it, Love it

Find out why Déjà Vu is today the most flexible, customizable and user-friendly tool on the market. See the brand new features in action: *Completely redesigned user interface *Live Preview *Inline spell checking *Inline

More info »
LSP.expert
You’re a freelance translator? LSP.expert helps you manage your daily translation jobs. It’s easy, fast and secure.

How about you start tracking translation jobs and sending invoices in minutes? You can also manage your clients and generate reports about your business activities. So you always keep a clear view on your planning, AND you get a free 30 day trial period!

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search