Pages in topic:   [1 2] >
Fuzzy glossary matches in OmegaT (and other feature ideas)
Thread poster: Michael Mestre

Michael Mestre
France
Local time: 07:37
English to French
+ ...
Sep 29, 2009

(I know that some of OmegaT's developpers hang out on this forum, so I am secretly hoping that they will also read this posticon_razz.gif )

OmegaT's glossary is quite basic; in particular, there are a few features that people would like to be added (fuzzy matches being the first one that comes to my mind, but I can also think of keyboard-driven copy/paste features from the glossary into the editor).

Since there is most probably already a fuzzy matching engine behind the TM, I was wondering how complicated it would be to implement a fuzzy matching feature for the glossary ?
I know that a plug-in has been released recently that enables the user to write instructions for OmegaT into the glossary file, but it is currently not possible to make the program understand (from an unmodified glossary file) that "houses" is the plural of "house" even though the spelling is very close and would have produced a very good match in the TM.

I was wondering if there were other users here who would love to see improvements in OmegaT's glossary features, and possibly donate some money to the project with this objective in mind ?


 

Laurent KRAULAND  Identity Verified
France
Local time: 07:37
French to German
+ ...
I can always donate... Sep 29, 2009

provided this thread gets some replies from authorised parties.

 

Michael Mestre
France
Local time: 07:37
English to French
+ ...
TOPIC STARTER
Caiman islands Sep 29, 2009

Laurent, thanks a lot for your idea - it hadn't crossed my mind, but I am now considering collecting all the donations from the very rich OmegaT lobby members, and then flee to the Caiman islands with suitcases full of dollars. No need to translate anymore, not even from the wifi-enabled sunny hotel poolicon_razz.gif

No, seriously, I had not even suggested that I would volunteer to collect any money. I don't want trouble with the tax authorities, no thank youicon_eek.gif)


 

Susan Welsh  Identity Verified
United States
Local time: 01:37
Member (2008)
Russian to English
+ ...
?? Already exists Sep 29, 2009

I am very far from being a developer, but I can tell you that the fuzzy match feature already exists for TMs. For glossaries, the OmegaT-tokenizers plug-in (which can be downloaded from Source Forge, along with the OmegaT itself, does "stemming," which means it can recognizes inflected forms of words. (house = house = housing) It also does a "stop word" function, which means it blocks out dumb words like "and" and "the" when figuring out what is a match for the TM.

I suggest you use the latest, beta, version of OmegaT, which is 2.0.4. It has some great new features, including access within your document to dictionaries and to (gasp) Google Translate. And other stuff that I don't understand.

I also suggest you join the OmegaT group on Yahoo, which is where these issues get thrashed out. See omegat.org for the links to all of the above.

I'm sure the Pro's will see your post and have more to say, but this is for starters.

Have fun,
Susan


 

Michael Mestre
France
Local time: 07:37
English to French
+ ...
TOPIC STARTER
Tokenizers Sep 29, 2009

Susan, thank you for your enumeration of all of tokenizer's features, because quite frankly I hadn't tried it.
I have now installed it. Here are my comments :

I like : It recognizes plurals and some conjugated verbs in French (haven't tested extensively, but it seemed to work)

I don't like :
1) You have to chose your source language by modifying the script that runs OmegaT-tokenizers. Not very hard, but not exactly user-friendly either - it could guess the language itself from the project description, right ?
2) It doesn't have Turkish, my second source language (which would not be a problem with fuzzy matches)

In conclusion, while Tokenizer does help me to some extent (for half of my source languages at least), I still think that the glossary should/could be improved.

PS : this is in no way a criticism of the work of the OmegaT team ! Not at all ! To the contrary, they have done a great job - I love this software so much that I don't consider using another CAT tool and I am ready to donate money.. yes, really..

PPS : Susan, thanks for your answer to my other question about OmegaT


 

Samuel Murray  Identity Verified
Netherlands
Local time: 07:37
Member (2006)
English to Afrikaans
+ ...
Glosspaste (requires Windows) Sep 29, 2009

Michael Mestre wrote:
OmegaT's glossary is quite basic; in particular, there are a few features that people would like to be added (fuzzy matches being the first one that comes to my mind, but I can also think of keyboard-driven copy/paste features from the glossary into the editor).


Occasionally some OmegaT users write little programs to do what OmegaT currently can't do. One of these things is the ability to insert glossary matches into the target field using a keyboard shortcut. A program that does this, called Glosspaste, was recently updated to work with the latest version of OmegaT. It requires Microsoft Windows. Read the message on the OmegaT mailing list here:

http://tech.groups.yahoo.com/group/OmegaT/message/15567


 

Laurent KRAULAND  Identity Verified
France
Local time: 07:37
French to German
+ ...
My suggestion Sep 29, 2009

Allowing some kind of "auto-suggest", akin to the one which is implemented in OO.org!

 

Andrei Eduard Kukucska
Romania
Local time: 08:37
Romanian to Slovak
+ ...
OmegaT features sugestions: Sep 29, 2009

Some nice features would be:

1. beter TM compatibility with other CAT tools (ability to create TM's that can be easily used by other CAT tools and the ability to use more TM types created by other CAT tools)

2. easier way of adding new words to the glossary (like from inside the project / program)

3. integration of machine translation tools (the Google translator would be a nice addition)

4. ability to recognise inflected forms of words from the glossary

5. support of the industry standard glossary format TBX proposed by LISA


 

Didier Briel  Identity Verified
France
Local time: 07:37
Member (2007)
English to French
+ ...
Tokenizers Sep 29, 2009

Michael Mestre wrote:
I don't like :
1) You have to chose your source language by modifying the script that runs OmegaT-tokenizers. Not very hard, but not exactly user-friendly either - it could guess the language itself from the project description, right ?


It could, (nearly) everything is possible, given sufficient time and resources.
Actually, tokenizer usage is quite complicated by legal matters, the Lucene's license (the engine behind the tokenizers) not being directly compatible with OmegaT's.

In addition, since there can be also several tokenizers for a given language, a user interface would be necessary to allow choosing the preferred one.


2) It doesn't have Turkish, my second source language

There might be new Lucene developments on Turkish.
We'll check.


(which would not be a problem with fuzzy matches)

For glossaries, fuzzy matches (at least as they are used by OmegaT, working on full words) are next to useless. They don't allow finding plurals, etc.
Stemming was the best solution.

Didier


 

Michael Mestre
France
Local time: 07:37
English to French
+ ...
TOPIC STARTER
Andrei : some of these features exist Sep 29, 2009

Hi Andrei,

As comments to your suggestions :

1) OmegaT already creates TMs in the ultra standard TMX format. No need for anything else, I guess.

2) Yes, this would be very nice.

3) This has been integrated in the latest version (2.0.4)

4) This exists already (tokenizers plugin), but it is still not very user-friendly (see the first posts of this thread)

5) Yes, why not.. is it a plain-text format ? (xml or something) ?


 

Michael Mestre
France
Local time: 07:37
English to French
+ ...
TOPIC STARTER
Thanks for your explanations, Didier Sep 29, 2009

Hi Didier, thanks for your explanations. I didn't know that fuzzy matching was useless for glossary terms.

By the way, how do search engines work (google & friends) ? In general, they are quite good at finding words with typos and purals/suffixes.


 

Didier Briel  Identity Verified
France
Local time: 07:37
Member (2007)
English to French
+ ...
Quite a number of things are already done Sep 29, 2009

Andrei Eduard Kukucska wrote:

Some nice features would be:

1. beter TM compatibility with other CAT tools (ability to create TM's that can be easily used by other CAT tools and the ability to use more TM types created by other CAT tools)

Since the TMX format works with pretty much every CAT tools, we have no plan currently to support other proprietary formats.


2. easier way of adding new words to the glossary (like from inside the project / program)

A partial improvement, since 2.0.3, is that changes in glossaries are detected automatically, and that there's no need to reload the project. So, letting the glossary file open, adding a word (and saving) is taken into account automatically.


3. integration of machine translation tools (the Google translator would be a nice addition)

Available since 2.0.4.


4. ability to recognise inflected forms of words from the glossary

Available since 2.0.1, see discussion above on tokenizers.

Didier


 

Didier Briel  Identity Verified
France
Local time: 07:37
Member (2007)
English to French
+ ...
Lucene Sep 29, 2009

Michael Mestre wrote:
By the way, how do search engines work (google & friends) ? In general, they are quite good at finding words with typos and purals/suffixes.


While I don't know exactly what Google is using, Lucene is rather widely used.

Didier


 

Michael Mestre
France
Local time: 07:37
English to French
+ ...
TOPIC STARTER
Apache licence Sep 29, 2009

Hmm, so it seems that Lucene is using the Apache license.

I guess that you have already investigated the issue - I will always be puzzled by these issues of license incompatibility between open source software.
The Apache license's wikipedia page says that it is compatible with GPL 3; is there a problem with OmegaT's own license, or is it the modules that it uses ?

Anyway, all this seems very complicated - I also saw that there are some issues with a fork of OmegaT and that people are arguing on forums about GPL licenses. I will try not to get involved in these debatesicon_razz.gif

In any case, I hope that you will be able to use Lucene's functionalities inside OmegaT eventuallyicon_eek.gif)
Michael


 

Didier Briel  Identity Verified
France
Local time: 07:37
Member (2007)
English to French
+ ...
OmegaT is GPL v2 Sep 30, 2009

Michael Mestre wrote:
The Apache license's wikipedia page says that it is compatible with GPL 3; is there a problem with OmegaT's own license, or is it the modules that it uses ?

OmegaT is GPL v2, which is not supposed to be compatible with the Apache license.
To quote the Apache Software Fundation:
Despite our best efforts, the FSF has never considered the Apache License to be compatible with GPL version 2

In any case, I hope that you will be able to use Lucene's functionalities inside OmegaT eventuallyicon_eek.gif)
Michael

That's what we already do with the tokenizers (based on Lucene), albeit a little indirectly. We have re-licensed a very small part (a few hundred lines) of OmegaT under GPL v3, thus providing legal compatibility. That's why we distribute this part of OmegaT separately.

Didier


 
Pages in topic:   [1 2] >


There is no moderator assigned specifically to this forum.
To report site rules violations or get help, please contact site staff »


Fuzzy glossary matches in OmegaT (and other feature ideas)

Advanced search






Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »
PerfectIt consistency checker
Faster Checking, Greater Accuracy

PerfectIt helps deliver error-free documents. It improves consistency, ensures quality and helps to enforce style guides. It’s a powerful tool for pro users, and comes with the assurance of a 30-day money back guarantee.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search