Hunspell extractor Thread poster: Samuel Murray
| Samuel Murray Netherlands Local time: 14:05 Member (2006) English to Afrikaans + ...
G'day everyone Does anyone know of a tool that can convert a Hunspell spelling dictionary to a list of words? In other words, parse the AFF file along with the DIC file to get a list of words that don't have those affix codes attached to them. Thanks Samuel
[Edited at 2011-04-03 17:11 GMT] | | |
Did you try opening the files with a text editor? It should be trivial to extract the list from the .dic, it seems. From http://pwet.fr/man/linux/fichiers_speciaux/hunspell : Hunspell(1) requires two files to define the language that it is spellchecking. The first file is a dictionary containing words for the language, and the second is an "affix" file that de... See more Did you try opening the files with a text editor? It should be trivial to extract the list from the .dic, it seems. From http://pwet.fr/man/linux/fichiers_speciaux/hunspell : Hunspell(1) requires two files to define the language that it is spellchecking. The first file is a dictionary containing words for the language, and the second is an "affix" file that defines the meaning of special flags in the dictionary. A dictionary file (*.dic) contains a list of words, one per line. The first line of the dictionaries (except personal dictionaries) contains the word count. Each word may optionally be followed by a slash ("/") and one or more flags, which represents affixes or special attributes. Default flag format is a single (usually alphabetic) character. In a Hunspell dictionary file, there is also an optional morphological field separated by tabulator. Morphological desciptions have custom format. If you can use perl or sed, s/\/.*// should strip the extraneous stuff. Perl one-liner: perl -p -e "s/\/.*//" dictionary.dic > dictionary_cleaned.txt ▲ Collapse | | | Samuel Murray Netherlands Local time: 14:05 Member (2006) English to Afrikaans + ... TOPIC STARTER It's not XML, Farkas | Apr 4, 2011 |
FarkasAndras wrote: If you can use perl or sed, s/\/.*// should strip the extraneous stuff. Perl one-liner: perl -p -e "s/\/.*//" dictionary.dic > dictionary_cleaned.txt It's not about stripping the stuff after the slash, but about merging the stuff after the slash with the stuff in the AFF file. For example, the DIC file might contain "tak/aBcFg" (I'm making up this code for illustration), which means "taking, taken, take, takes, taker, takers". The combination of the DIC and the AFF file is a type of compression format, in other words. Merely stripping the slash part will result in "tak", which is not what you'd want. If you open an actual DIC and AFF file, you'll see what I mean (try one in your own language). | | | Doesn't look like that | Apr 4, 2011 |
I opened the Hungarian dictionary, and a brief search turned up no truncated words. I then opened en_US.dic and got this under tak*: takeaway/S taken/A takeoff/SM takeout/S takeover/SM taker/M take/RSHZGJ takes/IA taking/IA To test my code, I ran perl -p -e "s/\/.*//" en_US.dic > en_US_cleaned.txt on it and got what I expected: takeaway taken takeoff takeout takeover<... See more I opened the Hungarian dictionary, and a brief search turned up no truncated words. I then opened en_US.dic and got this under tak*: takeaway/S taken/A takeoff/SM takeout/S takeover/SM taker/M take/RSHZGJ takes/IA taking/IA To test my code, I ran perl -p -e "s/\/.*//" en_US.dic > en_US_cleaned.txt on it and got what I expected: takeaway taken takeoff takeout takeover taker take takes taking It looks like stripping everything from the / on (and deleting the first line) would work fine. I may very well be wrong, though. If the way it works follows some reasonably intelligible rule, it should be fairly easy to do this in a perl script. ▲ Collapse | |
|
|
Samuel Murray Netherlands Local time: 14:05 Member (2006) English to Afrikaans + ... TOPIC STARTER Sorry, my mistake, but still | Apr 4, 2011 |
FarkasAndras wrote: takeaway/S Ah, I just remembered incorrectly, but the point is still the same. If you simply remove the slash part then you don't get all the words. The line "takeaway/S" really means "takeaway, takeaways" because the "S" code in the AFF file is for plural-"s". In the line "take/RSHZGJ", the following AFF rule applies SFX R 0 r e SFX R y ier [^aeiou]y SFX R 0 er [aeiou]y SFX R 0 er [^ey] ... of which the "R" section means that if the word ends on an "e", then you can expand it to with "r" (to "taker"). It's quite complicated and I don't know all the ins and outs of it. But essentially the slash part expands to work to more words. If you strip the slash parts, then you only get a small part of the dictionary.
[Edited at 2011-04-04 22:45 GMT] | | | Michael Beijer United Kingdom Local time: 13:05 Member (2009) Dutch to English + ... morphological aliases | Apr 4, 2011 |
Wow, interesting stuff. A little off topic, but, I started looking around and found the Hunspell man page quite interesting: http://www.manpagez.com/man/4/hunspell/ Sure seems to be a lot of trouble to go through, rather than simply write them all out in full. But I'm sure they must have their reasons (other than saving space). Michael | | | Michael Beijer United Kingdom Local time: 13:05 Member (2009) Dutch to English + ...
There seems to be a way to convert them to .bdic (Chrome dictionary format). // This tool converts Hunspell .aff/.dic pairs to a combined binary dictionary // format (.bdic). This format is more compact, and can be more efficiently // read by the client application. ... See more There seems to be a way to convert them to .bdic (Chrome dictionary format). // This tool converts Hunspell .aff/.dic pairs to a combined binary dictionary // format (.bdic). This format is more compact, and can be more efficiently // read by the client application. See e.g., http://src.chromium.org/svn/trunk/src/chrome/tools/convert_dict/convert_dict.cc No idea if that would be of any use though. Michael ▲ Collapse | | | To report site rules violations or get help, contact a site moderator: You can also contact site staff by submitting a support request » Hunspell extractor Trados Studio 2022 Freelance | The leading translation software used by over 270,000 translators.
Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop
and cloud solution, empowering you to work in the most efficient and cost-effective way.
More info » |
| Anycount & Translation Office 3000 | Translation Office 3000
Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.
More info » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |