Hunspell extractor
Thread poster: Samuel Murray

Samuel Murray  Identity Verified
Netherlands
Local time: 20:31
Member (2006)
English to Afrikaans
+ ...
Apr 3, 2011

G'day everyone

Does anyone know of a tool that can convert a Hunspell spelling dictionary to a list of words? In other words, parse the AFF file along with the DIC file to get a list of words that don't have those affix codes attached to them.

Thanks
Samuel



[Edited at 2011-04-03 17:11 GMT]


 

FarkasAndras
Local time: 20:31
English to Hungarian
+ ...
xml? Apr 4, 2011

Did you try opening the files with a text editor? It should be trivial to extract the list from the .dic, it seems.

From http://pwet.fr/man/linux/fichiers_speciaux/hunspell :

Hunspell(1) requires two files to define the language that it is spellchecking. The first file is a dictionary containing words for the language, and the second is an "affix" file that defines the meaning of special flags in the dictionary.

A dictionary file (*.dic) contains a list of words, one per line. The first line of the dictionaries (except personal dictionaries) contains the word count. Each word may optionally be followed by a slash ("/") and one or more flags, which represents affixes or special attributes. Default flag format is a single (usually alphabetic) character. In a Hunspell dictionary file, there is also an optional morphological field separated by tabulator. Morphological desciptions have custom format.



If you can use perl or sed, s/\/.*// should strip the extraneous stuff.
Perl one-liner: perl -p -e "s/\/.*//" dictionary.dic > dictionary_cleaned.txt


 

Samuel Murray  Identity Verified
Netherlands
Local time: 20:31
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
It's not XML, Farkas Apr 4, 2011

FarkasAndras wrote:
If you can use perl or sed, s/\/.*// should strip the extraneous stuff.
Perl one-liner: perl -p -e "s/\/.*//" dictionary.dic > dictionary_cleaned.txt


It's not about stripping the stuff after the slash, but about merging the stuff after the slash with the stuff in the AFF file. For example, the DIC file might contain "tak/aBcFg" (I'm making up this code for illustration), which means "taking, taken, take, takes, taker, takers". The combination of the DIC and the AFF file is a type of compression format, in other words. Merely stripping the slash part will result in "tak", which is not what you'd want.

If you open an actual DIC and AFF file, you'll see what I mean (try one in your own language).


 

FarkasAndras
Local time: 20:31
English to Hungarian
+ ...
Doesn't look like that Apr 4, 2011

I opened the Hungarian dictionary, and a brief search turned up no truncated words. I then opened en_US.dic and got this under tak*:

takeaway/S
taken/A
takeoff/SM
takeout/S
takeover/SM
taker/M
take/RSHZGJ
takes/IA
taking/IA


To test my code, I ran
perl -p -e "s/\/.*//" en_US.dic > en_US_cleaned.txt
on it and got what I expected:

takeaway
taken
takeoff
takeout
takeover
taker
take
takes
taking


It looks like stripping everything from the / on (and deleting the first line) would work fine. I may very well be wrong, though. If the way it works follows some reasonably intelligible rule, it should be fairly easy to do this in a perl script.


 

Samuel Murray  Identity Verified
Netherlands
Local time: 20:31
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
Sorry, my mistake, but still Apr 4, 2011

FarkasAndras wrote:
takeaway/S


Ah, I just remembered incorrectly, but the point is still the same.

If you simply remove the slash part then you don't get all the words. The line "takeaway/S" really means "takeaway, takeaways" because the "S" code in the AFF file is for plural-"s".

In the line "take/RSHZGJ", the following AFF rule applies

SFX R 0 r e
SFX R y ier [^aeiou]y
SFX R 0 er [aeiou]y
SFX R 0 er [^ey]


... of which the "R" section means that if the word ends on an "e", then you can expand it to with "r" (to "taker"). It's quite complicated and I don't know all the ins and outs of it.

But essentially the slash part expands to work to more words. If you strip the slash parts, then you only get a small part of the dictionary.



[Edited at 2011-04-04 22:45 GMT]


 

Michael Beijer  Identity Verified
United Kingdom
Local time: 19:31
Member (2009)
Dutch to English
+ ...
morphological aliases Apr 4, 2011

Wow, interesting stuff. A little off topic, but, I started looking around and found the Hunspell man page quite interesting:

http://www.manpagez.com/man/4/hunspell/

Sure seems to be a lot of trouble to go through, rather than simply write them all out in full. But I'm sure they must have their reasons (other than saving space).

Michael


 

Michael Beijer  Identity Verified
United Kingdom
Local time: 19:31
Member (2009)
Dutch to English
+ ...
Hmm Apr 4, 2011

There seems to be a way to convert them to .bdic (Chrome dictionary format).

// This tool converts Hunspell .aff/.dic pairs to a combined binary dictionary
// format (.bdic). This format is more compact, and can be more efficiently
// read by the client application.

See e.g., http://src.chromium.org/svn/trunk/src/chrome/tools/convert_dict/convert_dict.cc

No idea if that would be of any use though.

Michael


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Hunspell extractor

Advanced search






Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »
SDL Trados Studio 2019 Freelance
The leading translation software used by over 250,000 translators.

SDL Trados Studio 2019 has evolved to bring translators a brand new experience. Designed with user experience at its core, Studio 2019 transforms how new users get up and running and helps experienced users make the most of the powerful features.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search