What Is the Full Stop Rule; Codes for creating segmentation rules
Thread poster: JW Narins

JW Narins  Identity Verified
United States
Local time: 11:28
Member (2015)
Russian to English
+ ...
Nov 20, 2015

Hello! I'm trying to make segmentation do what I want. I'm new to this.

I see that segmentation is set in each separate TM, not in Studio overall.

For a chosen TM, when editing the Full Stop Rule in advanced view, I see that there's a code for each thing that will be recognized as a segment ending. (Why those codes are better than something immediately understandable I can't imagine, but leave that aside.)

Questions:
1) Is there a list of these codes somewhere I can refer to so I can see what I'm doing here?

2) Is there a list of available rules, like the Full Stop Rule, that actually tells me exactly what they consist of? That would be nice!

2) Can I take something like hard line breaks out of my segmentation rules to avoid the problem where you can't unite segments over a paragraph break? Or, in other words, can I take their idea of a paragraph break out of segmentation and just have end-of-sentence markers? I can always split a segment, after all.

Thanks in advance!


 

Erik Freitag  Identity Verified
Germany
Local time: 17:28
Member (2006)
Dutch to German
+ ...
Some answers and hints Nov 21, 2015

Dear JW,

JW Narins wrote:

Hello! I'm trying to make segmentation do what I want. I'm new to this.

I see that segmentation is set in each separate TM, not in Studio overall.

For a chosen TM, when editing the Full Stop Rule in advanced view, I see that there's a code for each thing that will be recognized as a segment ending. (Why those codes are better than something immediately understandable I can't imagine, but leave that aside.)


Sorry, I can't leave that aside, because it's an easy one. It's a whole lot better than something immediately understandable because it means a huge shortcut for programmers. As a software user, I would expect more or less something like a window where I can click options like: "Select segmentation characters: [ ] full stop [ ] colon [ ] soft return [ ] hard return [ ] (list of other characters)" etc., possibly with further options like "not before: " or "only after: " etc. I think you get my drift. What SDL programmers have done here is: Oh, we need the user to be able to create his own segmentation rules! - Ok, but let them do the actual programming themselves, I mean, come on, everybody knows how to do that! If those noobs find that too complicated, we can always sell it a as "powerful feature".

Anticipating possible reactions: Yes, I know regexes are indeed very powerful, but I still think it's pathetic to completely rely on this, instead of offering a useable function to translators (which most of us are) for the relatively straightforward cases we usually have, and maybe add the possibility of using regexes to those who want to dig deeper. Or at least have some preprogrammed regexes for the most frequent use cases.

Sorry, rant over, I know this doesn't help you.

To answer your questions:

JW Narins wrote:
Questions:
1) Is there a list of these codes somewhere I can refer to so I can see what I'm doing here?


A simple list won't do, I'm afraid. These codes are called regular expressions, or short: regex. It's basically a kind of programming language. Some basic info can be found on wikipedia (https://en.wikipedia.org/wiki/Regular_expression). If you want to dig deeper, you could start here: http://www.regular-expressions.info/tutorial.html

To make things worse: There seem to be different flavours of regex, and Studio seems to use a rather peculiar one. However, this isn't really documented anywhere, so there's a lot of trial and error involved, even for SDL staff.

JW Narins wrote:
2) Is there a list of available rules, like the Full Stop Rule, that actually tells me exactly what they consist of? That would be nice!


I doubt it. If you know your regex, you don't need a list, and if you don't know regex, such a list isn't going to help you much. Again: http://www.regular-expressions.info/tutorial.html

JW Narins wrote:
2) Can I take something like hard line breaks out of my segmentation rules to avoid the problem where you can't unite segments over a paragraph break?


No, you can't. Users have been begging for this to be made possible for years/several major releases of the software, but to no avail. In my opinion, this is the single most productivity killer in Studio.



[Edited at 2015-11-21 00:50 GMT]


 

JW Narins  Identity Verified
United States
Local time: 11:28
Member (2015)
Russian to English
+ ...
TOPIC STARTER
Awesome Nov 21, 2015

Well, that was an awesome response.

Yes, Studio 2015 does, in fact, have exactly that feature, a window where I can click options like: "Select segmentation characters: [ ] full stop [ ] colon [ ] soft return [ ] hard return [ ] (list of other characters)" etc., possibly with further options like "not before: " or "only after: " etc.

BUT it only allows us to use 5 (I believe it's 5) options as segment endings. Would it really be so hard to let us choose any damned character(s) we want from a chart?

So - my only remaining question (that occurs to me right now) is this: is there a way for me to indicate characters to take as segment endings OTHER than the mere 5 options they give us?

I never understood why programs prohibit users from making manipulations they want to make - if users make them, fine; if it turns out to be a bad idea, they could undo them (here, simply split the segment again). Once Studio has the text for two segments, why would they have any opinion at all as to whether or not they should be merged? Once it's in the editor, it's just split-up text, after all. Baffling.


 

Nora Diaz  Identity Verified
Mexico
Local time: 09:28
Member (2002)
English to Spanish
+ ...
Some examples of what you'd like to do? Nov 21, 2015

Hi JW,

What kind of source document are you dealing with?

Hard returns/paragraph breaks cannot be eliminated from Studio's segmentation rules, but if you have a Word file, for example, you could replace your hard line breaks with soft line breaks and then you'd be in a better position to achieve what I think you want to do.


A few notes about segmentation and regex in Studio:

- Studio uses the NET flavor of Regex, in case you want to check out some info about regex.

- The full stop rule in Studio simply says that a period is interpreted as a segment break when it is not followed by a lowercase letter or by any of the Unicode characters listed in there, so for example, U002C is a comma. You can check a list of Unicode characters here: http://unicode-table.com/en/#control-character, although that's probably beyond what you need at this point.

But let me say that you would be spinning your wheels if you try to solve your issue with a segmentation rule, unless you can first make some changes to your source file to remove unnecessary hard breaks.

I think if you could share some examples with the forum, there are several people here who would be able to give you some specific ideas.

To answer your last question, yes, you can definitely set any character you want as a segment break point, via the Advanced window. For example, I have a rule to segment on bullet points for some files I get often that have bulleted lists presented horizontally (on the same line).


 

JW Narins  Identity Verified
United States
Local time: 11:28
Member (2015)
Russian to English
+ ...
TOPIC STARTER
segmentation issue Nov 21, 2015

Thanks, as always!

The issue is, of course, with the formatting of the original document. The problem, though, is that I have to reproduce that formatting. If I change the original, then Studio's output will reproduce only my changed version, right?

As for rule creation - I understand what you're saying.
I need to think about segmentation rules that might give me phrases more than sentences. Has anyone successfully done that?


 

Nora Diaz  Identity Verified
Mexico
Local time: 09:28
Member (2002)
English to Spanish
+ ...
Soft returns Nov 21, 2015

In terms of formatting, in Word for example, a soft return does the same thing as a hard return, so replacing hard returns with soft returns wouldn't have a visible effect on the format of your document, but it would definitely be helpful for what you're trying to do, as Studio doesn't segment on soft returns out of the box.

To your question, if you have a specific word or character at the end of a phrase that you want to segment on, then yes, you can segment on that. You would need to add a segmentation rule that tells Studio that word or phrase X comes right before the break, or right after the break, so for example, you could add a segmentation rule to create a new segment whenever the word "and" appears, or any other word or phrase, for that matter.


 

JW Narins  Identity Verified
United States
Local time: 11:28
Member (2015)
Russian to English
+ ...
TOPIC STARTER
Understood - one last question Nov 21, 2015

That all makes sense.

Last question: I'm about 100 pages into a 600 page document.
If I reformat the word file... well, I guess I could do that, keep the part I've already translated, and translate the rest from that point on.

If I change segmentation rules, is there any way to apply that to a file in progress - or better yet, to the as-yet-untranslated portion of a file in progress?


 

Walter Blaser  Identity Verified
Switzerland
Local time: 17:28
French to German
+ ...
Segmentation occurs when the document is prepared Nov 21, 2015

JW Narins wrote:
If I change segmentation rules, is there any way to apply that to a file in progress - or better yet, to the as-yet-untranslated portion of a file in progress?


JW

The segmentation rules are applied when the document is converted from the source format to the bilingual SDLXLIFF format. This means that changing the segmentation rules will not have any effect on an existing project, because your document is already segmented. In order to change the segmentation, you need to convert the source file again, which will generate a SDLXLIFF file with different segmentation.

Walter


 

Nora Diaz  Identity Verified
Mexico
Local time: 09:28
Member (2002)
English to Spanish
+ ...
A copy of the file Nov 21, 2015

Like Walter says, only newly added documents will be affected, so what I would do is make a copy of the source file, add a suffix to the name, something like "New Segmentation", and add it to the project once you've created the new segmentation rules, then process this new file without touching the file you had originally added. That way you can get the benefit for the remainder of the document and keep what you've done so far.

In addition to this, I would also create a separate TM only for the new segmentation rules. I have a TM called simply Segmentation that I use for this. I would apply the new segmentation rules to this TM and add it to the project, making sure it appears before the other TMs to ensure that the rules of this new TM are applied.



[Edited at 2015-11-21 19:52 GMT]


 

Nora Diaz  Identity Verified
Mexico
Local time: 09:28
Member (2002)
English to Spanish
+ ...
A sample segmentation rule Nov 25, 2015

For those who may be interested, here's an example of how to add custom segmentation rules:

http://noradiaz.blogspot.mx/2015/11/beyond-punctuation-creating-custom.html


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

What Is the Full Stop Rule; Codes for creating segmentation rules

Advanced search







memoQ translator pro
Kilgray's memoQ is the world's fastest developing integrated localization & translation environment rendering you more productive and efficient.

With our advanced file filters, unlimited language and advanced file support, memoQ translator pro has been designed for translators and reviewers who work on their own, with other translators or in team-based translation projects.

More info »
CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use SDL Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search