Hunspell extractor (Software applications)

Technical forums » Software applications »
Hunspell extractor
Track this topic

Hunspell extractor

Thread poster: Samuel Murray

Samuel Murray

Netherlands
Local time: 14:05
Member (2006)
English to Afrikaans
+ ...

Apr 3, 2011

G'day everyone

Does anyone know of a tool that can convert a Hunspell spelling dictionary to a list of words? In other words, parse the AFF file along with the DIC file to get a list of words that don't have those affix codes attached to them.

Thanks
Samuel

[Edited at 2011-04-03 17:11 GMT]

FarkasAndras

Local time: 14:05
English to Hungarian
+ ...

xml?

Apr 4, 2011

Did you try opening the files with a text editor? It should be trivial to extract the list from the .dic, it seems.

From http://pwet.fr/man/linux/fichiers_speciaux/hunspell :

Hunspell(1) requires two files to define the language that it is spellchecking. The first file is a dictionary containing words for the language, and the second is an "affix" file that defines the meaning of special flags in the dictionary.

A dictionary file (*.dic) contains a list of words, one per line. The first line of the dictionaries (except personal dictionaries) contains the word count. Each word may optionally be followed by a slash ("/") and one or more flags, which represents affixes or special attributes. Default flag format is a single (usually alphabetic) character. In a Hunspell dictionary file, there is also an optional morphological field separated by tabulator. Morphological desciptions have custom format.

If you can use perl or sed, s/\/.*// should strip the extraneous stuff.
Perl one-liner: perl -p -e "s/\/.*//" dictionary.dic > dictionary_cleaned.txt ▲ Collapse

Samuel Murray

Netherlands
Local time: 14:05
Member (2006)
English to Afrikaans
+ ...

TOPIC STARTER

It's not XML, Farkas

Apr 4, 2011

FarkasAndras wrote:
If you can use perl or sed, s/\/.*// should strip the extraneous stuff.
Perl one-liner: perl -p -e "s/\/.*//" dictionary.dic > dictionary_cleaned.txt

It's not about stripping the stuff after the slash, but about merging the stuff after the slash with the stuff in the AFF file. For example, the DIC file might contain "tak/aBcFg" (I'm making up this code for illustration), which means "taking, taken, take, takes, taker, takers". The combination of the DIC and the AFF file is a type of compression format, in other words. Merely stripping the slash part will result in "tak", which is not what you'd want.

If you open an actual DIC and AFF file, you'll see what I mean (try one in your own language).

FarkasAndras

Local time: 14:05
English to Hungarian
+ ...

Doesn't look like that

Apr 4, 2011

I opened the Hungarian dictionary, and a brief search turned up no truncated words. I then opened en_US.dic and got this under tak*:

takeaway/S
taken/A
takeoff/SM
takeout/S
takeover/SM
taker/M
take/RSHZGJ
takes/IA
taking/IA

To test my code, I ran
perl -p -e "s/\/.*//" en_US.dic > en_US_cleaned.txt
on it and got what I expected:

takeaway
taken
takeoff
takeout
takeover
taker
take
takes
taking

It looks like stripping everything from the / on (and deleting the first line) would work fine. I may very well be wrong, though. If the way it works follows some reasonably intelligible rule, it should be fairly easy to do this in a perl script. ▲ Collapse

Samuel Murray

Netherlands
Local time: 14:05
Member (2006)
English to Afrikaans
+ ...

TOPIC STARTER

Sorry, my mistake, but still

Apr 4, 2011

FarkasAndras wrote:
takeaway/S

Ah, I just remembered incorrectly, but the point is still the same.

If you simply remove the slash part then you don't get all the words. The line "takeaway/S" really means "takeaway, takeaways" because the "S" code in the AFF file is for plural-"s".

In the line "take/RSHZGJ", the following AFF rule applies

SFX R 0 r e
SFX R y ier [^aeiou]y
SFX R 0 er [aeiou]y
SFX R 0 er [^ey]

... of which the "R" section means that if the word ends on an "e", then you can expand it to with "r" (to "taker"). It's quite complicated and I don't know all the ins and outs of it.

But essentially the slash part expands to work to more words. If you strip the slash parts, then you only get a small part of the dictionary.

[Edited at 2011-04-04 22:45 GMT]

Michael Beijer

United Kingdom
Local time: 13:05
Member (2009)
Dutch to English
+ ...

morphological aliases

Apr 4, 2011

Wow, interesting stuff. A little off topic, but, I started looking around and found the Hunspell man page quite interesting:

http://www.manpagez.com/man/4/hunspell/

Sure seems to be a lot of trouble to go through, rather than simply write them all out in full. But I'm sure they must have their reasons (other than saving space).

Michael

Michael Beijer

United Kingdom
Local time: 13:05
Member (2009)
Dutch to English
+ ...

Hmm

Apr 4, 2011

There seems to be a way to convert them to .bdic (Chrome dictionary format).

// This tool converts Hunspell .aff/.dic pairs to a combined binary dictionary
// format (.bdic). This format is more compact, and can be more efficiently
// read by the client application.

See e.g., http://src.chromium.org/svn/trunk/src/chrome/tools/convert_dict/convert_dict.cc

No idea if that would be of any use though.

Michael ▲ Collapse

Login to reply/comment

To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Natalie	[Call to this topic]
Prachya Mruetusatorn	[Call to this topic]

You can also contact site staff by submitting a support request »

Hunspell extractor

Forum rules

Help and orientation

Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators. Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way. More info »

Anycount & Translation Office 3000
Translation Office 3000 Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators. More info »

Recent posts | FAQ | Rules | Moderators | Article knowledgebase

Your current localization setting

English

Select a language

More languages...

Hunspell extractor

Hunspell extractor

You have native languages that can be verified

Your current localization setting

Select a language