Hunspell extractor
Thread poster: Samuel Murray
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 14:05
Member (2006)
English to Afrikaans
+ ...
Apr 3, 2011

G'day everyone

Does anyone know of a tool that can convert a Hunspell spelling dictionary to a list of words? In other words, parse the AFF file along with the DIC file to get a list of words that don't have those affix codes attached to them.

Thanks
Samuel



[Edited at 2011-04-03 17:11 GMT]


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 14:05
English to Hungarian
+ ...
xml? Apr 4, 2011

Did you try opening the files with a text editor? It should be trivial to extract the list from the .dic, it seems.

From http://pwet.fr/man/linux/fichiers_speciaux/hunspell :

Hunspell(1) requires two files to define the language that it is spellchecking. The first file is a dictionary containing words for the language, and the second is an "affix" file that de
... See more
Did you try opening the files with a text editor? It should be trivial to extract the list from the .dic, it seems.

From http://pwet.fr/man/linux/fichiers_speciaux/hunspell :

Hunspell(1) requires two files to define the language that it is spellchecking. The first file is a dictionary containing words for the language, and the second is an "affix" file that defines the meaning of special flags in the dictionary.

A dictionary file (*.dic) contains a list of words, one per line. The first line of the dictionaries (except personal dictionaries) contains the word count. Each word may optionally be followed by a slash ("/") and one or more flags, which represents affixes or special attributes. Default flag format is a single (usually alphabetic) character. In a Hunspell dictionary file, there is also an optional morphological field separated by tabulator. Morphological desciptions have custom format.



If you can use perl or sed, s/\/.*// should strip the extraneous stuff.
Perl one-liner: perl -p -e "s/\/.*//" dictionary.dic > dictionary_cleaned.txt
Collapse


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 14:05
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
It's not XML, Farkas Apr 4, 2011

FarkasAndras wrote:
If you can use perl or sed, s/\/.*// should strip the extraneous stuff.
Perl one-liner: perl -p -e "s/\/.*//" dictionary.dic > dictionary_cleaned.txt


It's not about stripping the stuff after the slash, but about merging the stuff after the slash with the stuff in the AFF file. For example, the DIC file might contain "tak/aBcFg" (I'm making up this code for illustration), which means "taking, taken, take, takes, taker, takers". The combination of the DIC and the AFF file is a type of compression format, in other words. Merely stripping the slash part will result in "tak", which is not what you'd want.

If you open an actual DIC and AFF file, you'll see what I mean (try one in your own language).


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 14:05
English to Hungarian
+ ...
Doesn't look like that Apr 4, 2011

I opened the Hungarian dictionary, and a brief search turned up no truncated words. I then opened en_US.dic and got this under tak*:

takeaway/S
taken/A
takeoff/SM
takeout/S
takeover/SM
taker/M
take/RSHZGJ
takes/IA
taking/IA


To test my code, I ran
perl -p -e "s/\/.*//" en_US.dic > en_US_cleaned.txt
on it and got what I expected:

takeaway
taken
takeoff
takeout
takeover<
... See more
I opened the Hungarian dictionary, and a brief search turned up no truncated words. I then opened en_US.dic and got this under tak*:

takeaway/S
taken/A
takeoff/SM
takeout/S
takeover/SM
taker/M
take/RSHZGJ
takes/IA
taking/IA


To test my code, I ran
perl -p -e "s/\/.*//" en_US.dic > en_US_cleaned.txt
on it and got what I expected:

takeaway
taken
takeoff
takeout
takeover
taker
take
takes
taking


It looks like stripping everything from the / on (and deleting the first line) would work fine. I may very well be wrong, though. If the way it works follows some reasonably intelligible rule, it should be fairly easy to do this in a perl script.
Collapse


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 14:05
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
Sorry, my mistake, but still Apr 4, 2011

FarkasAndras wrote:
takeaway/S


Ah, I just remembered incorrectly, but the point is still the same.

If you simply remove the slash part then you don't get all the words. The line "takeaway/S" really means "takeaway, takeaways" because the "S" code in the AFF file is for plural-"s".

In the line "take/RSHZGJ", the following AFF rule applies

SFX R 0 r e
SFX R y ier [^aeiou]y
SFX R 0 er [aeiou]y
SFX R 0 er [^ey]


... of which the "R" section means that if the word ends on an "e", then you can expand it to with "r" (to "taker"). It's quite complicated and I don't know all the ins and outs of it.

But essentially the slash part expands to work to more words. If you strip the slash parts, then you only get a small part of the dictionary.



[Edited at 2011-04-04 22:45 GMT]


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 13:05
Member (2009)
Dutch to English
+ ...
morphological aliases Apr 4, 2011

Wow, interesting stuff. A little off topic, but, I started looking around and found the Hunspell man page quite interesting:

http://www.manpagez.com/man/4/hunspell/

Sure seems to be a lot of trouble to go through, rather than simply write them all out in full. But I'm sure they must have their reasons (other than saving space).

Michael


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 13:05
Member (2009)
Dutch to English
+ ...
Hmm Apr 4, 2011

There seems to be a way to convert them to .bdic (Chrome dictionary format).

// This tool converts Hunspell .aff/.dic pairs to a combined binary dictionary
// format (.bdic). This format is more compact, and can be more efficiently
// read by the client application.
... See more
There seems to be a way to convert them to .bdic (Chrome dictionary format).

// This tool converts Hunspell .aff/.dic pairs to a combined binary dictionary
// format (.bdic). This format is more compact, and can be more efficiently
// read by the client application.

See e.g., http://src.chromium.org/svn/trunk/src/chrome/tools/convert_dict/convert_dict.cc

No idea if that would be of any use though.

Michael
Collapse


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Hunspell extractor






Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »
Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »