"Stemmer"/fuzzy fuctionality in SDL MultiTerm?
Thread poster: Mpoma

Mpoma  Identity Verified
United Kingdom
Local time: 05:56
French to English
Apr 29, 2015

Dear all,

I have just switched over to SDL Studio 2014 after many years just using Trados 6.5 Workbench.

I have just set up SDL MultiTerm and have a question about how clever it is.

A year or so ago I wrote a Java (or in fact Jython) app which takes all my entries in my vocab database (25,000 entries and counting) and indexes them using Lucene indices. This is basically the software which Google and all search engines use: in one type of index words are identified and indexed on the basis of their grammatical "stems" (Lucene has a "stemmer module" for English and a different one for French); also things like accents don't prevent identification...

So if I enter "refere audience" (French is my source language) into the search box of my app it lists multiple entries in my vocab database where either "refere" or "référés" or "audience" or some similar term comes up. Lucene does a good job "scoring" hits in its indices: hits with both terms ("refere" and "audience") come at the top of the listing. I have made it so such keywords are also highlighted in different colours.

The problem is that to get such "stemmer index results" I have to copy a term to clipboard and then go to my app and paste. Not such a big deal, as I use autohotkey extensively, but obviously not quite as good as having a term base where entries just pop up in an SDL Studio window.

However, if the Studio-MultiTerm setup isn't capable of this sort of fuzzy matching it massively undermines its usefulness. I've done some googling and seemed to see one page which suggested that it can do this ... but only with the "Server edition".

Does anyone know what the situation is?


 

Dominique Pivard  Identity Verified
Local time: 07:56
Finnish to French
Tuning fuzziness in Studio Apr 29, 2015

Mpoma wrote:
I have just set up SDL MultiTerm and have a question about how clever it is.

Basically, all the "cleverness" of Studio with regards to terminology recognition and fuzziness boils down to this setting:

2015-04-29_1858.png

Explanation from the help file:
The match value is the similarity, expressed as a percentage, between terminology in the source segment and the matching terminology found in a termbase. This value specifies the minimum acceptable similarity required for SDL Trados Studio to consider the terms to be a match. SDL recommends that you specify a value between 65% and 75%.
Everything else is a black box out of your control.
Mpoma wrote:
A year or so ago I wrote a Java (or in fact Jython) app which takes all my entries in my vocab database (25,000 entries and counting) and indexes them using Lucene indices. This is basically the software which Google and all search engines use: in one type of index words are identified and indexed on the basis of their grammatical "stems" (Lucene has a "stemmer module" for English and a different one for French); also things like accents don't prevent identification...

So if I enter "refere audience" (French is my source language) into the search box of my app it lists multiple entries in my vocab database where either "refere" or "référés" or "audience" or some similar term comes up. Lucene does a good job "scoring" hits in its indices: hits with both terms ("refere" and "audience") come at the top of the listing. I have made it so such keywords are also highlighted in different colours.

The problem is that to get such "stemmer index results" I have to copy a term to clipboard and then go to my app and paste. Not such a big deal, as I use autohotkey extensively, but obviously not quite as good as having a term base where entries just pop up in an SDL Studio window.

I may be wrong, but I don't think you can integrate your home-grown terminology app and your termbase optimized for that app with Studio and MultiTerm.
Mpoma wrote:
However, if the Studio-MultiTerm setup isn't capable of this sort of fuzzy matching it massively undermines its usefulness. I've done some googling and seemed to see one page which suggested that it can do this ... but only with the "Server edition".

I don't think any "Server edition" of MultiTerm would integrate any differently with Studio (regarding terminology recognition) than the current edition you are using.

You may want to have a look at OmegaT, as it uses Lucene tokenizers for its terminology recognition. I think the approach is similar to that of your home-grown app. I made a video on it a couple of years ago. The tokenizer for Finnish didn't work too well, but the one for French may be much better.


 

Mpoma  Identity Verified
United Kingdom
Local time: 05:56
French to English
TOPIC STARTER
Kiitos / Merci! Apr 29, 2015

Thanks for your input: you are very knowledgeable about all this stuff... and yet you also appear have time to go rowing... well done. Suggests a good work/life balance.

I looked at your video and think I shall have to do so again when I'm feeling slightly less frazzled. Stepping from Trados 6.5 and an MS Office 2000 to SDL Studio 2014 and MS Office 2013 is quite a leap.

I've looked a couple of times at Omega T and may do again. But, last time I looked, it wasn't able to cope with marked up text (bold, etc.) which to me seems pretty necessary. Secondly, on the couple of occasions when I've tried to get to grips with it, it seemed simply incapable of importing my mammoth "general" TM, which probably has about 400,000 segments in it.

I love the idea of open source, however, and I applaud the fact that they've decided to use Lucene indexing for their termbase module. In fact I yearn for the day when this (or another) open source app will supersede all the SDL Studio and other commercial stuff out there.

Term recognition using a percentage match, eh? Simple as that! Says it all, I'd say! I.e. about the design quality underpinning a certain "market-leading" TM app, although, should any lawyers ever read this, this is merely a personal opinion.

For the record, I wasn't expecting to be able to integrate my Lucene-based app with SDL Studio... it's enough of a sweat getting SS to work with the components designed specifically to work with it.



[Edited at 2015-04-29 17:30 GMT]


 

Meta Arkadia
Local time: 11:56
English to Indonesian
+ ...
Please, tell me more Apr 30, 2015

Mpoma wrote:
A year or so ago I wrote a Java (or in fact Jython) app which takes all my entries in my vocab database (25,000 entries and counting) and indexes them using Lucene indices

I downloaded Lucene some time ago, mainly because I was fascinated by OmegaT's tokeniser (as demonstrated by our CATGuru). As far as I know, OT is the only CAT tool that offers tokenisation based on Lucene. I tried OT several times, and like you, I decided against it. I gave up on Lucene, because - with my limited skills - I couldn't see a way to make use of it. Your Java app may very well change that.

Cheers,

Hans


 

Meta Arkadia
Local time: 11:56
English to Indonesian
+ ...
Short additional question(s) Apr 30, 2015

How does an entry created by your Java app look like? An example, please. And what's the file format of the Lucene indexed database?

Cheers,

Hans


 

Mpoma  Identity Verified
United Kingdom
Local time: 05:56
French to English
TOPIC STARTER
thanks for the interest Apr 30, 2015

Hi Hans,

I have been a hobby programmer for decades, but despite this understanding Lucene took a bit of research.

I'm happy to send you my app and see if it can be useful for you.

A few caveats, however:

My app is functional but I am not an expert in Lucene indexing in any way! In fact each entry (consisting of a database table row, with a French column and an English column in my case) is tokenised without any regard as to which words are English and which are French, so it's quite a blunt instrument... but it works!

In fact I have a book called "Elasticsearch - the Definitive Guide" which I bought earlier this year. Usually people don't use the Lucene Java classes directly: instead they use ES (which itself uses the Lucene jar files). An app along these lines would probably be my next step, although I think ES is quite a steep learning curve.

Each time you start it up it checks the database table in question for any new rows added ... but this is not quite ideal, as ideally you want to have a mechanism which can detect not only rows which have been newly added, but also rows which have been changed since the last run. Having said this, you can always recreate the entire indices (stemmer and simple) from scratch with a single command, and this takes only a couple of seconds: 25,000 entries is hardly "Big Data".

My app is written in Jython rather than Java. You may have to look up Jython: it is the Python language but written in such a way that it uses all the functionality available in Java. The syntax is far simpler and more economical than Java, but it is less "typed": like Python it uses "duck typing" (if it quacks like a duck...). Difficult to explain unless you've tinkered with it a bit. I could perhaps think of rewriting it in Java. It would be nice to let other people make use of it... although it seems that Omega T might already have the component you need ... (?)

My app links into a table in an Access database. You may need to access another database. Other things may go wrong, like which JRE (Java 7? Java 8?) your system uses. There may be "encoding" problems (I am not an expert in that, or anything!)... so any success you have getting my app to work may depend on your own IT skills!!!

If still interested, can you send me a message? I.e. some personal message to my Trados name, Mpoma? Is this possible?


 

Meta Arkadia
Local time: 11:56
English to Indonesian
+ ...
Still interested, but... Apr 30, 2015

... I don't use Trados, nor Windows, and therefore no Access either (though I can open and convert Access/SDLTB files). I don't know how flexible Trados is, but their SDLTBs are Access files.

Cheers,

Hans


 

Mpoma  Identity Verified
United Kingdom
Local time: 05:56
French to English
TOPIC STARTER
Shouldn't be a problem Apr 30, 2015

Hi,

Are you a Linux user? Or a Mac user?

Either way something written in Java/Jython should work OK on your system.

Equally the fact of connecting to one RDBMS or another (Access, MySQL or whatever) and extracting data from and editing certain tables shouldn't be a huge headache compared to many other technical problems I've had to overcome. Java's handling of the RDBMS connection (see Java class "java.sql.Connection"), once successfully made, should be fairly "agnostic"...

In short, if you are using MySQL (which I've experimented with), or are able to write a simple Java program where you manage to make a java.sql.Connection from your program to your RDBMS, I'd say that's the biggest headache overcome... after that I could potentially send you, bit by bit, the .java files as I start to rewrite the thing as a "pure Java" app. Initially therefore the first app I'd send you would hardly do anything, just be a test to make sure that stuff ran as expected on your system. Little by little I could then construct the Java app and each time send you a slightly more functional version...

As you wish...


 

Meta Arkadia
Local time: 11:56
English to Indonesian
+ ...
SQLite Apr 30, 2015

Mpoma wrote:
Are you a Linux user? Or a Mac user?


Mac.

In short, if you are using MySQL (which I've experimented with)


Me too, but the CAT tool I use isn't ready for it yet. I can now search (from within the tool) H2, HSQldb, and SQLite, and I concentrate on the latter.

As you wish...


I wish. Will send you a PM.

Thanks,

Hans


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

"Stemmer"/fuzzy fuctionality in SDL MultiTerm?

Advanced search







TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »
Déjà Vu X3
Try it, Love it

Find out why Déjà Vu is today the most flexible, customizable and user-friendly tool on the market. See the brand new features in action: *Completely redesigned user interface *Live Preview *Inline spell checking *Inline

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search