Useful segmentation rules for Trados Studio and memoQ
Thread poster: Bogdan Dusa

Bogdan Dusa  Identity Verified
Romania
Local time: 19:02
English to Romanian
+ ...
Jun 6, 2014

Hi all,

After doing some research on the Regex codes and with the final help of a good friend of mine, expert in this field, I managed to create some useful segmentation rules for a better import of documents in projects and I’d like to share them.

Let's say we have:
(a) This is a sentence.
1) This is a sentence.
B) This is a sentence
1.1 This is a sentence.
1.1This is a sentence

Normally, these would be imported as five segments, because there is no automatic numbering there. But it would be much more convenient to have them segmented as follows:
(a)
This is a sentence.
1)
This is a sentence.
etc. etc.

To do that, in Trados Studio:

Go to Project settings - Language Pairs - select your TM - Settings - Language Resources - Segmentation Rules - Edit - Add - Advanced view and Add the 2 codes below in the Before break field (consequently, two new rules), leaving the After break field empty:

^\(?[a-zA-Z0-9]+\)[\s\t]*

It means: Look for all segments that start with any lowercase letter or uppercase letter or any number between 0 and 9, which repeats one or more times, is preceded or not by a left parenthesis, is followed by a right parenthesis, then by a space character or a tab character, which repeats zero or more times.

^\d{1,}\.\d{1,}[\s\t]*

It means: Look for all segments that start with a digit, which repeats one or more times, is followed by a dot character, then by another digit which repeats one or more times, then by a space character or a tab character which repeats zero or more times.

In memoQ:

Under the Resource console - Segmentation rule, create your own rule or Clone and then Edit a default rule and Add the 2 codes below in the Rule field:

^\(?[a-zA-Z0-9]+\)[\s\t]*#!#

It means: Look for all segments that start with any lowercase letter or uppercase letter or any number between 0 and 9, which repeats one or more times, is preceded or not by a left parenthesis, is followed by a right parenthesis, then by a space character or a tab character, which repeats zero or more times and apply a segment break there.

^\d{1,}\.\d{1,}[\s\t]*#!##cap#

It means: Look for all segments that start with a digit, which repeats one or more times, is followed by a dot character, then by another digit which repeats one or more times, then by a space character or a tab character which repeats zero or more times, then by a capital letter (defined under the #cap# group) and apply the segment break before the capital letter.

To make things even easier, in memoQ you have in addition the option of simply excluding any numbering a segment starts with.
From the original:
(a) This is a sentence.
You can simply import only
This is a sentence.

To do that, import the file using Import with option – Change filter and configuration – Add cascading filter. From the Filter drop-down menu, select Regex text filter, go to Include/Exclude tab and in the Rule field add the same code(s) as above, stopping at the “segment here” part (#!#):

^\(?[a-zA-Z0-9]+\)[\s\t]*
and/or
^\d{1,}\.\d{1,}[\s\t]*

It’s up to you to add the second code in order to exclude the segments that start with 1.1 or 11.2 etc. It depends on the context. It may be paragraph numbering or just numbers (such as 1.1 million), case in which you don’t want to exclude them, as they have to be localized.

Hope you find it useful!

Bogdan


Direct link Reply with quote
 

Philippe Etienne  Identity Verified
Spain
Local time: 18:02
Member
English to French
Thanks for sharing Jun 6, 2014

I've been hearing about regex expressions for years and I have an idea about their usefulness, but I still haven't come round to sit and look at it. This is a good incentive to try and understand what they actually are.

Thank you,

Philippe

[Edited at 2014-06-06 08:47 GMT]


Direct link Reply with quote
 

Bogdan Dusa  Identity Verified
Romania
Local time: 19:02
English to Romanian
+ ...
TOPIC STARTER
You're very welcome! Jun 6, 2014

There is always a beginning

Direct link Reply with quote
 

Laura Harrison  Identity Verified
United Kingdom
Local time: 17:02
French to English
+ ...
Thank you for sharing! Jun 6, 2014

Will see if I can work it into my MemoQ codes

Direct link Reply with quote
 

Chunyi Chen
United States
Local time: 09:02
English to Chinese
How to modify these segmentation rules so that... Jun 7, 2014

the numbers for bullet items can be excluded from importing to MemoQ?

Hi Bogdan,

I added the two rules you provided to the resource console and was able to exclude most of the number items in the file. The ones that failed to be excluded are:

1.(tab)text
2.(tab)text
3.(space space)text
...

As you can see, the source file format is not good, with some items using tab and others using manual space. Can you tell me how to modify one of your rules to exclude the numbers (plus the dots) above and only import the text part to MemoQ?

Thanks a lot for your help!

Chun-yi

Bogdan Dusa wrote:

In memoQ:

Under the Resource console - Segmentation rule, create your own rule or Clone and then Edit a default rule and Add the 2 codes below in the Rule field:

^\(?[a-zA-Z0-9]+\)[\s\t]*#!#

It means: Look for all segments that start with any lowercase letter or uppercase letter or any number between 0 and 9, which repeats one or more times, is preceded or not by a left parenthesis, is followed by a right parenthesis, then by a space character or a tab character, which repeats zero or more times and apply a segment break there.

^\d{1,}\.\d{1,}[\s\t]*#!##cap#

It means: Look for all segments that start with a digit, which repeats one or more times, is followed by a dot character, then by another digit which repeats one or more times, then by a space character or a tab character which repeats zero or more times, then by a capital letter (defined under the #cap# group) and apply the segment break before the capital letter.

To make things even easier, in memoQ you have in addition the option of simply excluding any numbering a segment starts with.
From the original:
(a) This is a sentence.
You can simply import only
This is a sentence.

To do that, import the file using Import with option – Change filter and configuration – Add cascading filter. From the Filter drop-down menu, select Regex text filter, go to Include/Exclude tab and in the Rule field add the same code(s) as above, stopping at the “segment here” part (#!#):

^\(?[a-zA-Z0-9]+\)[\s\t]*
and/or
^\d{1,}\.\d{1,}[\s\t]*

It’s up to you to add the second code in order to exclude the segments that start with 1.1 or 11.2 etc. It depends on the context. It may be paragraph numbering or just numbers (such as 1.1 million), case in which you don’t want to exclude them, as they have to be localized.

Hope you find it useful!

Bogdan


Direct link Reply with quote
 

Chunyi Chen
United States
Local time: 09:02
English to Chinese
problem solved Jun 7, 2014

I found out that by setting Tab to "Start new segment" in Document import settings, the item numbers can be separated from the main text. It looks like I don't need special segmentation rules to achieve this goal, so problem is solved!

Chun-yi

[quote]Chun-yi Chen wrote:

the numbers for bullet items can be excluded from importing to MemoQ?

Hi Bogdan,

I added the two rules you provided to the resource console and was able to exclude most of the number items in the file. The ones that failed to be excluded are:

1.(tab)text
2.(tab)text
3.(space space)text
...

As you can see, the source file format is not good, with some items using tab and others using manual space. Can you tell me how to modify one of your rules to exclude the numbers (plus the dots) above and only import the text part to MemoQ?



[Edited at 2014-06-07 20:23 GMT]

[Edited at 2014-06-07 20:24 GMT]


Direct link Reply with quote
 

Bogdan Dusa  Identity Verified
Romania
Local time: 19:02
English to Romanian
+ ...
TOPIC STARTER
Better solution Jun 7, 2014

Hi Chun-yi,

I do hope this is your first name

Setting Tab to ""Start new segment" is usually a good solution, but it could be tricky. Think of documents where there are intentional tabs in order to align text from a single phrase on two or several rows or even typing errors, with unintentional tabs. You would have to check the file first.

There could be a better solution, i.e. you could change the Regex codes I indicated:

^\(?[a-zA-Z0-9]+\)*[\s\t]{1,}
Modified: * inserted after \)
Meaning: the numbering may or may not be followed by a right parenthesis
Modified: {1,} after [\s\t]
Meaning: the space character or tab character repeats at least one time

^\d{1,}\.\d{0,}[\s\t]{1,}
Modified: {0,} after \.\d (instead of {1,})
Meaning: the dot character may or may not be followed by a digit
Modified: {1,} after [\s\t]
Meaning: the space character or tab character repeats at least one time

It should work for contexts like this:

(a)(tab)Text
(1)(tab)Text
1(tab)Text
1.(tab)Text
1(space space)Text

[Editat la 2014-06-07 19:09 GMT]


Direct link Reply with quote
 

Chunyi Chen
United States
Local time: 09:02
English to Chinese
Thank you Jun 7, 2014

Hi Bogdan,

Yes, that's my first name:)

Thank you so much for the modified regex. I will add them to MemoQ segmentation rules and see how the files turn out in the MemoQ grid. I did think of another question to ask: how would these files look when they are passed to the editor who does not have such segmentation rules in his/her MemoQ program? Once the MemoQ XLIFF files are imported to his or her MemoQ program, would the numbers still be separated from the main text as they were sent out?

Chun-yi

Bogdan Dusa wrote:

Hi Chun-yi,

I do hope this is your first name

Setting Tab to ""Start new segment" is usually a good solution, but it could be tricky. Think of documents where there are intentional tabs in order to align text from a single phrase on two or several rows or even typing errors, with unintentional tabs. You would have to check the file first.

There could be a better solution, i.e. you could change the Regex codes I indicated:

^\(?[a-zA-Z0-9]+\)*[\s\t]{1,}
Modified: * inserted after \)
Meaning: the numbering may or may not be followed by a right parenthesis
Modified: {1,} after [\s\t]
Meaning: the space character or tab character repeats at least one time

^\d{1,}\.\d{0,}[\s\t]{1,}
Modified: {0,} after \.\d (instead of {1,})
Meaning: the dot character may or may not be followed by a digit
Modified: {1,} after [\s\t]
Meaning: the space character or tab character repeats at least one time

It should work for contexts like this:

(a)(tab)Text
(1)(tab)Text
1(tab)Text
1.(tab)Text
1(space space)Text

[Editat la 2014-06-07 19:09 GMT]


Direct link Reply with quote
 

Bogdan Dusa  Identity Verified
Romania
Local time: 19:02
English to Romanian
+ ...
TOPIC STARTER
Yes, they are Jun 8, 2014

Hi Chun-yi,

You're welcome Yes, the numbers are still separated because you send him your imported file. Regardless of whether or not the editor defined the same segmentation rules, he will only see what you will send him.

Just to be on the safe side, I ran two tests, one with Export bilingual as memoQ XLIFF and another one with Export bilingual as Two-column RTF. The result was the same. The exported file contained only the plain text, excluding any numbering from the original file.

Bogdan


Direct link Reply with quote
 

Chunyi Chen
United States
Local time: 09:02
English to Chinese
Hi Bogdan, Jun 8, 2014

Thank you so much for the additional information! I have decided to add these rules to the resource console in MemoQ. I was just adding the new rules but MemoQ told me it's invalid. Can you tell me if these are the ones I should add?

^\(?[a-zA-Z0-9]+\)*[\s\t]{1,}#!#
^\d{1,}\.\d{0,}[\s\t]{1,}#!##cap#

These rules didn't seem to do what they were supposed to do. I must have messed these up but don't know how to fix it.

Thank you again!

Chun-yi

Bogdan Dusa wrote:

Hi Chun-yi,

You're welcome Yes, the numbers are still separated because you send him your imported file. Regardless of whether or not the editor defined the same segmentation rules, he will only see what you will send him.

Just to be on the safe side, I ran two tests, one with Export bilingual as memoQ XLIFF and another one with Export bilingual as Two-column RTF. The result was the same. The exported file contained only the plain text, excluding any numbering from the original file.

Bogdan


Direct link Reply with quote
 

Bogdan Dusa  Identity Verified
Romania
Local time: 19:02
English to Romanian
+ ...
TOPIC STARTER
Segmentation rules or Regex text filter? Jun 8, 2014

Hi Chun-yi,

I ran a test with your modified codes, apparently there is nothing wrong with them as long as you use them under the Segmentation rules. #!# means "segment break here".

Otherwise, if you want to use them under Regex text filter in order to exclude any numbering, delete the final part of the codes (#!# and #!##cap# respectively) as it is useless.

Bogdan


Direct link Reply with quote
 

Chunyi Chen
United States
Local time: 09:02
English to Chinese
segmentation rules Jun 8, 2014

Hi Bogdan,

Thank you for not giving up on me. I was adding the regex rules under segmentation rules.
The modified rules chopped up sentences, such as material[seg]sensitivity reactions, infection[seg] or allergic reaction.

Since I can separate numbers from text with MemoQ's existing feature (start as new segment), I will just use it for this project and come back to try these ones when I am more familiar with regex.

Chun-yi

Bogdan Dusa wrote:

Hi Chun-yi,

I ran a test with your modified codes, apparently there is nothing wrong with them as long as you use them under the Segmentation rules. #!# means "segment break here".

Otherwise, if you want to use them under Regex text filter in order to exclude any numbering, delete the final part of the codes (#!# and #!##cap# respectively) as it is useless.

Bogdan


Direct link Reply with quote
 

Bogdan Dusa  Identity Verified
Romania
Local time: 19:02
English to Romanian
+ ...
TOPIC STARTER
Everybody learns Jun 8, 2014

Hi Chun-yi,

No problem, we all learn from each other here, how do you think I started with the Regex codes myself?

Anyway, if you want to dig further, you can take at loot at these sites (among many others):

http://www.regular-expressions.info/

http://www.jedit.org/users-guide/regexps.html

http://www.dreambank.net/regex.html

Good luck!

Bogdan


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Useful segmentation rules for Trados Studio and memoQ

Advanced search







Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »
Déjà Vu X3
Try it, Love it

Find out why Déjà Vu is today the most flexible, customizable and user-friendly tool on the market. See the brand new features in action: *Completely redesigned user interface *Live Preview *Inline spell checking *Inline

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search