Segment-based and sentence-based segmentation?
Thread poster: Richard Hill

Richard Hill  Identity Verified
Mexico
Local time: 11:21
Member (2011)
Spanish to English
Jul 18, 2011

Hi.

My main point in this post is to find out peoples preferences (and the corresponding reasons) between segment-based and sentence-based segmentation. It would love to hear your opinions.
The thing that brought on the question, is because I was using sentence based segmentation and my personalized abbreviations were getting segmented where they shouldn’t have. So I went in to edit the segmentation but the “Basic View” in “Edit Segmentation Rules” is grayed out (any ideas why?), and I don’t understand the advanced view so I changed to paragraph based segmentation to check out the results.

On thing that occurs to me is that perhaps sentence based segmentation (being smaller) may throw up more results in future translations having stored smaller TU’s when the new segments are added to the TM’s? This question arises out of reading, in the Trados guide, “If you intend to populate the TM by importing TUs, the rule should match the rule used to create the imported TUs”. I’m not sure I understand, but does this mean that if my imported TMs were created with the “sentence-based segmentation” rule and my project is set up with the “paragraph based” rule, that I won’t get such good results?

I’m trying to get some understanding of how this works because as far as I can see, having the right setup will throw up better results. Maybe I shouldn’t worry about it so much or maybe it is really important?

Any comments would be much appreciated.

Thanks all.

rich


Direct link Reply with quote
 

Riccardo Schiaffino  Identity Verified
United States
Local time: 09:21
Member (2003)
English to Italian
+ ...
Smaller segments result in more matches. Jul 19, 2011

Smaller segments (i.e. segmentation at the sentence level) will definitely result in many more matches.

Say that you have the following two paragraphs (taken, for simplicity sake, from the posting rules here):

When posting about a CAT tool or other software program, include the product name, version, your operating system, and any other relevant details. This will avoid confusion and allow you to get or give better assistance. This forum is dedicated to providing and receiving support on software. If you are seeking information on training relating to software, please submit a support request.


and

If you are seeking information on training relating to software, please submit a support request. This forum is dedicated to providing and receiving support on software. When posting about a TM tool or other software program, include the product name, version, your operating system, and any other relevant details. This will avoid confusion and allow you to get or give better assistance.


If you are segmenting at the paragraph level, you will probably get no match at all, since the order of the sentences is so different. On the other hand, if you are segmenting at the sentence level, you'll get three sentences suggested as 100% matches, and one as a high fuzzy match (only one word changed).


Direct link Reply with quote
 

Richard Hill  Identity Verified
Mexico
Local time: 11:21
Member (2011)
Spanish to English
TOPIC STARTER
Problem with abbreviations in sentence based segmentation Jul 19, 2011

Riccardo Schiaffino wrote:
If you are segmenting at the paragraph level, you will probably get no match at all, since the order of the sentences is so different. On the other hand, if you are segmenting at the sentence level, you'll get three sentences suggested as 100% matches, and one as a high fuzzy match (only one word changed).


Makes PFS or should I say perfect sense. Thanks for your clear example.

I had applied sentence based segmentation and changed to PBS because my personalized abbreviations were getting segmented where they shouldn’t have. I had a long list of company names ending in S.A. de C.V. and they were all getting segmented after S.A. and both "S.A." and "S.A. de C.V." are added to my Abbreviation List in the TM settings. Any ideas as to why this is?

Thanks

rich


Direct link Reply with quote
 

Richard Hill  Identity Verified
Mexico
Local time: 11:21
Member (2011)
Spanish to English
TOPIC STARTER
addendum Jul 19, 2011

I should add that s.a. de c.v. does not get segmented at "s.a." but S.A. DE C.V does get split at "S.A." while seems totally illogical to me!

Direct link Reply with quote
 

SDL Community  Identity Verified
United Kingdom
Local time: 17:21
English
Segmentation Jul 19, 2011

Hi Rich,

I took your text and opened it in a text file using the default segmentation rules in a TM created for es(MX) to en(US)



This doesn't seem to match your description. So, can you confirm the language pairs you are using and also give us some real examples of the actual strings you are translating.

Regards

Paul


Direct link Reply with quote
 

Richard Hill  Identity Verified
Mexico
Local time: 11:21
Member (2011)
Spanish to English
TOPIC STARTER
confirmation Jul 19, 2011

SDL Support wrote:
Hi Rich,

I took your text and opened it in a text file using the default segmentation rules in a TM created for es(MX) to en(US)
This doesn't seem to match your description. So, can you confirm the language pairs you are using and also give us some real examples of the actual strings you are translating.
Paul


Hi Paul

My project and TM's are set up as es-MX to en-US. I hope these screen shots cover the info you need to check it out.

translated strings:


segmentation rules:


abbreviation list


thanks a lot for looking into this Paul

rich


Direct link Reply with quote
 

SDL Community  Identity Verified
United Kingdom
Local time: 17:21
English
One thing missing... Jul 19, 2011

... an example of the source text before it's been segmented in Studio. Maybe also confirm the filetype you are using for this.

And finally, and possibly more importantly, what changes did you make to the segmentation rules? I can see they are in bold so this means you have edited something.

Thanks

Paul


Direct link Reply with quote
 

Richard Hill  Identity Verified
Mexico
Local time: 11:21
Member (2011)
Spanish to English
TOPIC STARTER
source text .docx Jul 19, 2011

SDL Support wrote:

... an example of the source text before it's been segmented in Studio. Maybe also confirm the filetype you are using for this.


...ok Paul

Its a just a list of companies in a .docx file.



SDL Support wrote:
And finally, and possibly more importantly, what changes did you make to the segmentation rules? I can see they are in bold so this means you have edited something.


as far as the changes go, precisely because of this problem I went in to try and edit the segmentation rules but as the "basic view" is grayed out, I didn't understand how to edit, and I can't remember what I did exactly, but I have gone back and forth trying to sort out the problem, and resenting the rules to default. since your last message I "Reset to Defaults" again, with exactly the same results.



Direct link Reply with quote
 

Richard Hill  Identity Verified
Mexico
Local time: 11:21
Member (2011)
Spanish to English
TOPIC STARTER
hahaha Jul 20, 2011

rich. wrote:
"resenting the rules"

big time Freudian slip

[Edited at 2011-07-20 17:21 GMT]


Direct link Reply with quote
 

Richard Hill  Identity Verified
Mexico
Local time: 11:21
Member (2011)
Spanish to English
TOPIC STARTER
any ideas?... Jul 21, 2011

..on this segmentation problem, Paul, anybody?

Direct link Reply with quote
 

SDL Community  Identity Verified
United Kingdom
Local time: 17:21
English
Check your email? Jul 21, 2011

Hi Rich,

It was getting time consuming to do this through here so I emailed you yesterday first. If we can sort it then you can post the final resolution afterwards.

In a nutshell, I still don't have a problem here. The only part that failed to segment correctly for me was R.L. so I added this to the abbreviations and then all was well.

Let me know if you use a different address to the hotmail account you gave us.

Thanks

Paul
pfilkin@sdl.com


Direct link Reply with quote
 

Richard Hill  Identity Verified
Mexico
Local time: 11:21
Member (2011)
Spanish to English
TOPIC STARTER
Thanks for trying... Jul 21, 2011

Hi Paul
I appreciate your efforts.
My email is the same.
thanks again
rich


Direct link Reply with quote
 

SDL Community  Identity Verified
United Kingdom
Local time: 17:21
English
So my suggestion would be... Jul 21, 2011

... open the same document and create a new empty TM on the way. Check and see if you get the same results. If you have to make changes, only change the abbreviation list by adding any that are missing and leave the segmentation rules alone for this.

Regards

Paul


Direct link Reply with quote
 

Richard Hill  Identity Verified
Mexico
Local time: 11:21
Member (2011)
Spanish to English
TOPIC STARTER
success at last! Jul 21, 2011

SDL Support wrote:
... open the same document and create a new empty TM on the way. Check and see if you get the same results. If you have to make changes, only change the abbreviation list by adding any that are missing and leave the segmentation rules alone for this.


Hey Paul.
Great work. This was bugging the hell out of me. I'm sure it was U2. I had actually already done what you suggest, "creating a new TM", but hadn't disabled the other TM's, until I read your suggestion this morning and Hey presto! Not only does it work but we can also see where the problem lies.

Basically, it that abbreviations with spaces are not allowed, or their entry into the list maybe allowed but the spaces somehow mess up other abbreviations.


I really appreciate the time you put into this.

rich


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Segment-based and sentence-based segmentation?

Advanced search







WordFinder
The words you want Anywhere, Anytime

WordFinder is the market's fastest and easiest way of finding the right word, term, translation or synonym in one or more dictionaries. In our assortment you can choose among more than 120 dictionaries in 15 languages from leading publishers.

More info »
LSP.expert
You’re a freelance translator? LSP.expert helps you manage your daily translation jobs. It’s easy, fast and secure.

How about you start tracking translation jobs and sending invoices in minutes? You can also manage your clients and generate reports about your business activities. So you always keep a clear view on your planning, AND you get a free 30 day trial period!

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search