How does segmentation affect Autosuggest ?
Thread poster: Richard Hill

Richard Hill  Identity Verified
Mexico
Local time: 21:22
Member (2011)
Spanish to English
Nov 10, 2011

Having asked for advice re segmentation, here:

http://www.proz.com/forum/sdl_trados_support/211651-adding_new_segmentation_rule_to_tm_in_studio_2011.html

this led me to wonder how Autosuggest works. Is it similar to TM? I am about half way to having the 25,000 segments required to create a better Autosuggest dictionary. I currently have one created from the massive EU TM (I guess I could duplicate my personal TM [13,000 segements] and create a dictionary from the resulting 26,000 segments) but anyway, my point is, how does the Autosuggest "decide" whether to suggest short phrases or just individual words? Does this depend on the segmentation of the TMs from which the dictionary is created?

The thing is, in the legal work I do, there are a lot of common short phrases separated by semicolons so I wonder if I took the trouble to create a certain amount of TM segments at semicolon level and use these to create an autosuggest dictionary, would this kind of segmentation result in the Autosuggest dictionary throwing up this phrases (rather than just single words) when they appear in the target segment?

I need to understand how Autosuggest works to decide if it might be worth the work involved.

Any advice on this would be greatly appreciated.

BTW, I'm using Studio 2011

Thanks,

Rich

[Edited at 2011-11-10 21:54 GMT]


 

SDL Community  Identity Verified
United Kingdom
Local time: 04:22
English
Lot's of effort... Nov 10, 2011

Hi Rich,

The questions we see on ProZ are amazing... from one end of the spectrum to the other... and over time sometimes from the same usericon_wink.gif

I had to take a little advice here because I don't know enough about this myself. Autosuggest isn’t really affected by segmentation, but the generation of ASDs and the quality of phrases is to some degree improved with shorter segments because the alignment is better.

Long phrases can be suggested if the source/target phrase pair could be extracted (using statistical means) from the input TM, and if the (source) part of the phrase pair occurs in the document segment. The target phrase can be pulled from the ASD, and is displayed as a proposal. Usually AutoSuggest displays the longest phrase first, i.e. the phrase pair which has a longest match between document source and phrase source AND the longest associated target phrase.

The larger the input TM, the longer phrases can be extracted. Smaller TMs tend to result in shorter phrase pairs as it’s harder to align the words and phrases across the segments, as many don’t occur often enough.

Probably your best option is where you are heading and that would be to switch to semicolon segmentation altogether from now on (at least for legal texts and the like where you get a lot of long sentences like this with semicolons so you can take a breathicon_wink.gif) and then regenerate the ASD from time to time. So if you have example texts you could reasonably easily use to create a TM for addition to the one you use for your ASD then it may indeed help.

I hope that's a useful answer for you anyway Rich... and I'd be interested to know what you experience if you do give it a try..!

Regards

Paul


 

Richard Hill  Identity Verified
Mexico
Local time: 21:22
Member (2011)
Spanish to English
TOPIC STARTER
AutoSuggest leverage Nov 11, 2011

Hi Paul,

Thanks for your reply, which does help me understand a little better how AS works and you probably won't be surprised to see that it leads to more questions.

SDL Support wrote:
the generation of ASDs and the quality of phrases is to some degree improved with shorter segments because the alignment is better.


This certainly supports the use of semicolon segmentation.

SDL Support wrote:
The larger the input TM, the longer phrases can be extracted. Smaller TMs tend to result in shorter phrase pairs as it’s harder to align the words and phrases across the segments, as many don’t occur often enough.


Does this mean that I'd get more leverage from an ASD created out of a TM of say 25,000 segments copied (50 times say) and merged to create an ASD, than from the same ASD dictionary created out of only one instance of the same 25,000 units only?

Also, is there no detriment to ASD or TM leverage or otherwise, caused through having a TM and/or ASD that was part created with certain rules; for example, say 10,000 segments were created using paragraph segmentation, then the next 10,000 at sentence based segmentation, then another 10,000 adding the semicolon segmentation rule?

Also, in my limited understanding of how all this works, I'm still not 100% sure I should change my segmentation rules because if I start opening my new documents with the new semicolon rule, I'm not sure how much TM leverage I will loose from the last few months work before I added the semicolon rule.

Would leaving my current TM (without the semicolon rule) and adding a copy of that TM with the new rule make any difference? I'm guessing not, as only the rules of the first TM in the list will apply.

Cheers!

Rich


 

SDL Community  Identity Verified
United Kingdom
Local time: 04:22
English
On your questions Nov 11, 2011

Hi Rich,

From the horses mouth... I’m not sure about the first one. Just because you copy the TM you don’t get more information. It’s just more of the same. “Real” TUs should be better, but I’m not an expert in statistics.

Re the second, if you change the segmentation rules, particularly if the segments get longer (paragraph), the ASD phrase extractor will get more “confused” as the segments get longer and it needs more material to associate source with target words, i.e. bigger TMs. Also performance of the ASD extraction will drop, so not such a good idea this way. If, however, you have smaller segments, then the ASD extraction has “less confusion” to cope with, which should lead to clearer phrases being extracted for small TMs.

I’m not sure whether I’d bother, though. For very small segments a termbase may be better, and if the legal texts you're working on can be split at semicolons, changing the segmentation rules may be the best approach.

Regards

Paul


 

Aude Sylvain  Identity Verified
France
Local time: 04:22
English to French
+ ...
using other AS providers (termbases, autotext) Nov 11, 2011

Hi Rich,

I can't reply on the ASD/segmentation issue, but I can share my experience regarding use of the Autosuggest feature. I also translate a lot of legal materials, and make an extensive use of the AS feature.

While I was very interested in the discussion above, I wouldn't bother too much on the segmentation of the TM either, as far as the ASD is concerned. Any long match (more than 5-6 words for instance) would normally appear as a match in the TM window provided you linked the appropriate TM with your project.

Speaking for me, I generated my ASD from a TM with 80,000+ TUs, which includes a number of legal boilerplate phrases, and even then it's rather rare for me to get full sentences proposed as ASD hits.

What I am doing to be sure to have these boilerplates proposed by the Autosuggest feature (i.e. in order not to miss them even if the TM concerned is not linked to the project I am working on) is to enter them on the fly either in my Termbase or as an autotext item when I come across them in texts.
Since Autosuggest is based not only on the ASD but also on termbases and autotext(*), this allows me to get these hits through Autosuggest with no need to spend time on formatting my TM in a specific way only in view of generating the ASD.
Admittedly you need to type them at least once, which is not the case when it's included in the ASD, but I think that globally this method is more time efficient.

My post doesn't reply to your initial query, but I hope this will help you getting more precise AS suggestions.


(*) for making this work you must make sure that "termbases" and "autotext" are selected as Autosuggest providers in tools>option> autosuggest


[Edited at 2011-11-11 21:11 GMT]


 

Richard Hill  Identity Verified
Mexico
Local time: 21:22
Member (2011)
Spanish to English
TOPIC STARTER
Segmentation Nov 12, 2011

Hi Aude

Aude Sylvain wrote:
I wouldn't bother too much on the segmentation of the TM either, as far as the ASD is concerned.


Paul seems to think there may be some greater leverage from the ASD from having smaller segments in the TM

SDL Support wrote:
If, however, you have smaller segments, then the ASD extraction has “less confusion” to cope with, which should lead to clearer phrases being extracted for small TMs.
Paul


So although I probably won't bother re-segmenting my previous work, I think I will apply a couple of new segmentation rules from now on (e.g. semicolon and dot dash), not only for reasons a ASD leverage, but also because it seems that smaller segments result in more TM hits/less words to translate, according to some tests I did, as I mentioned today in the other post http://www.proz.com/forum/sdl_trados_support/211651-adding_new_segmentation_rule_to_tm_in_studio_2011.html.

Cheers!

Rich


 

Aude Sylvain  Identity Verified
France
Local time: 04:22
English to French
+ ...
segmentation Nov 12, 2011

Hi Rich,

Indeed I read Paul's post too quickly, sorry.

My (mistaken) remark applied only the ASD though, since this was the main issue you were addressing.
I fully agree that having shorter segments should result in more and better quality TM hits - in fact I could verify that very often in my projects (e.g. by applying the semicolon rule "manually" when I align texts in order to generate a specific TM). If this is also likely to improve the quality of the ASD suggestions, as I understand now, that's good news!

Have a good evening!


 

Richard Hill  Identity Verified
Mexico
Local time: 21:22
Member (2011)
Spanish to English
TOPIC STARTER
Good news! Nov 12, 2011

Hi Aude,

Yeah, it is good news that with some tweaking and fine tuning we can improve TM and AutoSuggest leverage, and it's kind of interesting to see where and how to apply this fine tuning.

Have a good one too!

Rich


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

How does segmentation affect Autosuggest ?

Advanced search







PerfectIt consistency checker
Faster Checking, Greater Accuracy

PerfectIt helps deliver error-free documents. It improves consistency, ensures quality and helps to enforce style guides. It’s a powerful tool for pro users, and comes with the assurance of a 30-day money back guarantee.

More info »
TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search