Hunspell extractor
Thread poster: Samuel Murray

Samuel Murray  Identity Verified
Netherlands
Local time: 18:53
Member (2006)
English to Afrikaans
+ ...
Apr 3, 2011

G'day everyone

Does anyone know of a tool that can convert a Hunspell spelling dictionary to a list of words? In other words, parse the AFF file along with the DIC file to get a list of words that don't have those affix codes attached to them.

Thanks
Samuel



[Edited at 2011-04-03 17:11 GMT]


Direct link Reply with quote
 

FarkasAndras
Local time: 18:53
English to Hungarian
+ ...
xml? Apr 4, 2011

Did you try opening the files with a text editor? It should be trivial to extract the list from the .dic, it seems.

From http://pwet.fr/man/linux/fichiers_speciaux/hunspell :

Hunspell(1) requires two files to define the language that it is spellchecking. The first file is a dictionary containing words for the language, and the second is an "affix" file that defines the meaning of special flags in the dictionary.

A dictionary file (*.dic) contains a list of words, one per line. The first line of the dictionaries (except personal dictionaries) contains the word count. Each word may optionally be followed by a slash ("/") and one or more flags, which represents affixes or special attributes. Default flag format is a single (usually alphabetic) character. In a Hunspell dictionary file, there is also an optional morphological field separated by tabulator. Morphological desciptions have custom format.



If you can use perl or sed, s/\/.*// should strip the extraneous stuff.
Perl one-liner: perl -p -e "s/\/.*//" dictionary.dic > dictionary_cleaned.txt


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 18:53
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
It's not XML, Farkas Apr 4, 2011

FarkasAndras wrote:
If you can use perl or sed, s/\/.*// should strip the extraneous stuff.
Perl one-liner: perl -p -e "s/\/.*//" dictionary.dic > dictionary_cleaned.txt


It's not about stripping the stuff after the slash, but about merging the stuff after the slash with the stuff in the AFF file. For example, the DIC file might contain "tak/aBcFg" (I'm making up this code for illustration), which means "taking, taken, take, takes, taker, takers". The combination of the DIC and the AFF file is a type of compression format, in other words. Merely stripping the slash part will result in "tak", which is not what you'd want.

If you open an actual DIC and AFF file, you'll see what I mean (try one in your own language).


Direct link Reply with quote
 

FarkasAndras
Local time: 18:53
English to Hungarian
+ ...
Doesn't look like that Apr 4, 2011

I opened the Hungarian dictionary, and a brief search turned up no truncated words. I then opened en_US.dic and got this under tak*:

takeaway/S
taken/A
takeoff/SM
takeout/S
takeover/SM
taker/M
take/RSHZGJ
takes/IA
taking/IA


To test my code, I ran
perl -p -e "s/\/.*//" en_US.dic > en_US_cleaned.txt
on it and got what I expected:

takeaway
taken
takeoff
takeout
takeover
taker
take
takes
taking


It looks like stripping everything from the / on (and deleting the first line) would work fine. I may very well be wrong, though. If the way it works follows some reasonably intelligible rule, it should be fairly easy to do this in a perl script.


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 18:53
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
Sorry, my mistake, but still Apr 4, 2011

FarkasAndras wrote:
takeaway/S


Ah, I just remembered incorrectly, but the point is still the same.

If you simply remove the slash part then you don't get all the words. The line "takeaway/S" really means "takeaway, takeaways" because the "S" code in the AFF file is for plural-"s".

In the line "take/RSHZGJ", the following AFF rule applies

SFX R 0 r e
SFX R y ier [^aeiou]y
SFX R 0 er [aeiou]y
SFX R 0 er [^ey]


... of which the "R" section means that if the word ends on an "e", then you can expand it to with "r" (to "taker"). It's quite complicated and I don't know all the ins and outs of it.

But essentially the slash part expands to work to more words. If you strip the slash parts, then you only get a small part of the dictionary.



[Edited at 2011-04-04 22:45 GMT]


Direct link Reply with quote
 

Michael Beijer  Identity Verified
United Kingdom
Local time: 17:53
Member (2009)
Dutch to English
+ ...
morphological aliases Apr 4, 2011

Wow, interesting stuff. A little off topic, but, I started looking around and found the Hunspell man page quite interesting:

http://www.manpagez.com/man/4/hunspell/

Sure seems to be a lot of trouble to go through, rather than simply write them all out in full. But I'm sure they must have their reasons (other than saving space).

Michael


Direct link Reply with quote
 

Michael Beijer  Identity Verified
United Kingdom
Local time: 17:53
Member (2009)
Dutch to English
+ ...
Hmm Apr 4, 2011

There seems to be a way to convert them to .bdic (Chrome dictionary format).

// This tool converts Hunspell .aff/.dic pairs to a combined binary dictionary
// format (.bdic). This format is more compact, and can be more efficiently
// read by the client application.

See e.g., http://src.chromium.org/svn/trunk/src/chrome/tools/convert_dict/convert_dict.cc

No idea if that would be of any use though.

Michael


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Hunspell extractor

Advanced search






SDL MultiTerm 2017
Guarantee a unified, consistent and high-quality translation with terminology software by the industry leaders.

SDL MultiTerm 2017 allows translators to create one central location to store and manage multilingual terminology, and with SDL MultiTerm Extract 2017 you can automatically create term lists from your existing documentation to save time.

More info »
memoQ translator pro
Kilgray's memoQ is the world's fastest developing integrated localization & translation environment rendering you more productive and efficient.

With our advanced file filters, unlimited language and advanced file support, memoQ translator pro has been designed for translators and reviewers who work on their own, with other translators or in team-based translation projects.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search