Secondary affix flag for spellchecking

Motivation:

From existing words new words can be built using affixes, for example beauty-beutyness and beauty-beautify. Both beauty and beautyness have plural, and beautify has an 's' affix in third person. In order to keep the word list simple and short, we can set up the affix table to handle the above mentioned word building in a one level depth. So we let the affix table handle the flectioning of the newly built word.

The impact of an additional affix flag can be a dramatical word count reduction for flecting or agglutinating languages, but also a remarkable size reduction for other dictionaries.

Let's show this in an example.


dictionary:
2
beauty
nice

affix file:
1 S SFX Y 4
2 S SFX 0 s [^ys]
3 S SFX y ies y
4 S SFX 0 es ss
5 S SFX 0 ness y S <------ This is the secondary flag
(The line numbers are just for the sake of clarity)

Interesting is the last entry in the SFX S group. It states, that words, that end with an 'y', can get the ending ness (beauty->beautyness). The so created new word can have the affixes of the affix group SFX S, with other words, it must be handled, as if it were entered into the dic file as beautyness/S.

The algorithm to handle the secondary affix flag is as follows:


All words will be first checked, as usual

  • Is the word in the dictionary? - if yes, found

  • Restore the word by:


    • Stripping the affix

    • Adding the stripped affix

    • Checking the conditions - if restored word in dictionary, found


  • Take all affixes, that are in secondary affix column


    • Restore the word with each of the secondary affixes

    • Restore the original word from the restored word above using all primary affixes, that belong to the used secondary affix. If restored original word in dictionary, found


Using the algorithm for the above example: if the word to check is beautynesses, first restore the word beautyness, using the line 4 of S SFX 0. Since beautyness is still not in the dictionary, apply the flag S to beautyness. Line 5 of S SFX then restores the word beauty from beautyness, which can be found in the dictionary.

Remark: The secondary flag algorithm works much faster as follows:

  • When read in the affices, remember all characters, that ever occure in any secondary flag
  • When read in the affices, also remember the start and stop index of each affix group.
  • When read in the affices, read every line containing a secondary affix also in an extra list.


  • Take each character, that appears in any secondary flag


    • Walk thru all lines of the affix group, marked with the current character

    • Try to restore the original word according to the current line in the affix group; if restored:


      • Walk thru all lines containing secondary affix flags, that have the current character in their secondary affix


        • Try to restore the original word from the restored word in the previous restore step

        • If a valid dictionary word could be restored, word found - stop the search


      • Continue for all secondary affix lines


    • Continue for all lines in the affix group


  • Continue for each secondary flag character
  • Let's take a Hungarian example with two searched words.


    dictionary:
    1
    hz

    affix file:
    SFX A Y 3
    1 SFX A 0 ak [a]z
    2 SFX A 0 ok [ou]z
    3 SFX A 0 as [a]z BC <--- secondary flag

    SFX B Y 3
    1 SFX B 0 nak [a]z
    2 SFX B 0 tl [ou]z
    3 SFX B 0 nek [ei]z

    SFX C Y 3
    1 SFX C 0 a [aou]z
    2 SFX C 0 ai [aou]z
    3 SFX C 0 ai [aou]z
    (The line numbers are just for the sake of clarity)

    First example:
    The word to be cheked is hzasnak.
    The basic dictionary and affix search is unsuccessful.
    The secondary search builds first the word hzas using SFX B, line 1. It then searches all lines that have a secondary flag 'B', and SFX A, line 3 recreates the original word, hz, that is in the dictionary.

    Second example:
    The word to be checked is hzasai,
    The basic dictionary and affix search is unsuccessful.
    The secondary search builds first the word hzas using SFX C, line 3. It then searches all lines that have a secondary flag 'C', and SFX A, line 3 recreates the original word, hz, that is in the dictionary.

    SFX A third line states, that all words, that fulfill the conditions listed in that line create new words, that need to be treated, as if they were entered into the dictionary as word/BC, for example hzas/BC, mzas/BC, vzas/BC, etc...

    You can find a reference implementation of a spell checker including secondary flags in perl on ./spell.tar.gz

    The secondary flag method and algorithm was invented and created by Nmeth Lszl, and it is protected by the GNU LGPL. The c++ implementation can be found on http://magyarispell.sf.net, hunspell, version >= 1.0. This description was created by Eleonora and is protected by the GNU Document protection schema.