Daniel Naber's language tool for other languages than English.

(http://www.danielnaber.de)

Contents:
What had to be changed
   Code changes in Tagger.py
   Code changes in Rules.py
   The Wfinder.py module
   The Wfhun.py and Wfdeu.py modules
   Modification in TextChecker.py
Language discrimination
Still to be done
Timing consideration
Adding new languages
How to use

-------------------------------------------------------------

To handle other languages than English (here: German and Hungarian) the following changes needed to be done:

  • Set up the word types that cover that language's peculiarities. Here it is important that you look into the target language's typical errors and formalistic rules. The description of the German and Hungarian word types is stored in the file TagInfo.py
  • In German, for example for each verb the personal pronoun has to be given, and they both must match (agreement). For example, ich gehe is OK, ich gehst is not OK, etc...
  • In German also the ending of the adjectives must follow the grammatical gender of the noun. For example, das schöne Haus is OK, das schönem Haus is not OK, der gute alter Mann is not OK, etc...
  • Since in German the words' grammatical gender is so important, the dictionary (see next item) must be declared as a word type and must be stored in the dictionary.
  • Prepare a rule set in the rules path, called yourlanguagegrammar.xml, e.g degrammar.xml, that handles the typical errors in the target language. This is the soul of the language checking, and the rules must be written and tested very carefully. Be careful if you use non-ascii characters, Python unfortunately is quite strange in this.
  • Prepare a dictionary of the given language that contains for each word the following information:
  • word/affixes/word type1/probability1/wordtype2/probability2/...
    where affixes are the affixes the myspell dictionary uses.

    For example for German:
    Abendgymnasium/Sr/NNS/1
    Abendhauch/EPST/NMS/1
    Collecting the dictionary is clearly the hardest part of the job!

  • Luckily only the modules Tagger.py and slightly Rules.py, TextChecker.py and TagInfo.py had to be modified in order to make the the language tool a multi language one, and a word finder and tag preparing module, Wfinder.py had to be added.

Code changes in Tagger.py:

Tagger.py must import Wfinder and at the very beginning create an object, wfinder, that reads the dictionary file described above and the affix file, which is identical to the affix file used by the myspell program. The files are in data/deutsch.aff and data/deutsch.txt.

Tagger.py got some global variables, the language, the aff file name and the dict file name, since these variables are only interesting for Tagger.py and Wfinder.py.

The methods bindData must use the ReadData method to read in the data. I am still evaluating if this read-in can be eliminated altogether, since now Wfinder.py handles the dictionaries. Since there is some logic to handle tag probabilities in Tagger.py, the ReadData is still needed. BindData also does not try to read the additional tag probability files any more.

DeleteData gets an empty method, since the program does not modify the dictionary any more.

CommitData does not pickle any more the structures onto files, just prints some information.

ReadData is only an empty routine setting some structures to empty ones.

guessTags has to be modified, and the rules, that apply only to English Texts, must be if-ed using if textlanguage == 'en':.

Also in the tag function some English-only logic had to be if-ed out. In the tag function I also had to add several lines:

if len(word) >= 1 and word[-1] == '.':
  word = word[0:-1]
  r = Text.tagWord(self, word, data_table)
  tagged_list.extend(r)
  word = '.'
r = Text.tagWord(self, word, data_table)
tagged_list.extend(r)

To cut the trailing dot. Otherwise valid words were not found because of the trailing dot. This is probably due to cutting of several English only functionality .


The Wfinder.py module:

The main modification is in the TagWord method. Rather than using data_table.has_key(word) in order to find the word, it uses texts rc = wfinder.test_it(word) to find the word for non-English. Test_it is located in Wfinder.py. It checks the dictionary, using the Dömölki-algorithm (also ued by myspell), if this word is to be found. This is because words in agglutinating languages having 1000-2000 variations of each word cannot be handled by simple hash table searches like the has_key method. But using affixes reduces also the size of an English word collection by factor 2, and a German one by factor 6.

Test_it finally calls getTyp, that adjusts the word type according to the word's suffixes or prefixes.


The Wfhun.py and Wfdeu.py module:

These modules contain one function, getTyp, that adjusts the word type according to the word's suffixes or prefixes. If in German a verb is found, test_it finds out -- according to the verb's ending -- which person uses the verb (me, you, he, we, they, etc..) and refines the tag V to V11...V15. It also checks for adjectives' ending and refines the ADJ tag to ADJE...ADJEM. In fact it does similar functions for all supported languages.


Code changes in Rules.py:

The file names and class names under python_rules path were modified. AvsAnRule is renamed to enAvsAnRule, since this applies only for English texts. allWordRepeat had to be changed to enWordRepeat and deWordRepeat, since the world repeating rules are different in German or Hungarian. The allSentenceLength rule is pretty general for any language on the world. In Rules.py for each language will be now checked, if the rule applies for the active language or if it applies for all languages. Otherwise the rule will not be applied to the checked text. For English, the applied rule files and the classes must have 'en' or 'all' as the first characters, while for German 'de' or 'all'.


Modification in TextChecker.py

The language is set in TextChecker using the variable textlanguage, which is a global variable in Tagger. TextChecker reads the TextChecker.ini file, which is in the same directory as TextChecker.py, and sets up the right file names for the aff and dictionary file in the data path and the grammar file in the rules path.

I have also documented the German and Hungarian word types in TagInfo.py.


Language discrimination:

The language identification is done using the -l flag followed by the language identification, which is 'en' for English texts, 'de' for German ones and 'hu' for Hungarian ones. TextChecker uses an initialization file, TextChecker.ini, which is in the same directory as TextChecker.py. It contains a section for each supported language. It looks now:

[de]
dicFile=deutsch.txt
affFile=deutsch.aff
grammarFile=degrammar.xml
maxSentenceLength=90

[en]
dicFile=words
affFile=english.aff
grammarFile=engrammar.xml
maxSentenceLength=30

[hu]
dicFile=hungarian.txt
affFile=hungarian.aff
grammarFile=hugrammar.xml
maxSentenceLength=80

The file names are used in Tagger.py or Rules.py.


Still to be done:

There are several unnecessary for languages other than English in the data directory: det_an.txt, det_a.txt, c7toc5.txt, abbr.txt, chunks.txt.

Chunk handling is not yet implemented for German or Hungarian, since strictly seen it is not necessary, but a nice feature.

In German the grammatical gender of a word is determined by the last word in case of compound words. It would probably make sense to build such a determination into Wfinder.py, and then all German words were covered by the dictionary. The disadvantages of this approach are, however, that the last word's determination cannot be done absolutely error free, and also that this check would be quite time consuming.

The groups adj, int and ind are not very well sorted in German, e.g. these word types are intermixed. Since the present rules don't use them, this is now no problem, but the dictionary should be tidied up later.


Timing consideration:

The dictionary read-in takes about 20-30 seconds. This is due to the large size of the dictionaries, and cannot be reduced. The program timing is otherwise unchanged.


Adding new languages

If you want to add a new language, the following actions are needed:

  • Check and understand the grammar rules of the target language. Check existing texts agains typical errors, and try to set up rules for them. Read the original English, German and Hungarian rules, and check, if these are applyable for the target language. Especially in the English rules there are lots of useful tips.
  • If the rules are ready, you also have the word types of the target language with them. Now get a word list for myspell, and add the word types to that list. Check the German word list, how this must be done. See the short example above
  • Set up a small word list that contains all the word types, your rules handle. Check the rules using this small word list and modify the word types and/or make refinements, if necessary
  • If the word types need to be identified on the fly (which is very likely), enter your rules into the module Wfyourlan.py into the marked part. Do not forget to include Wfyourlan into Wfinder.py and initialize it using: self.wfyourlan = Wfyourlan.Wfyourlan() in the __init__ function.
    # in the module Wfinder.py:
    #
    else if Tagger.text1language == 'YourLanguage':
    ......................typ = getTyp(typ, word, oword)

    # in the module Wfyourlan.py
    def getTyp(self,typ, oword, word):
    # Here are the language specific rules of each language
    return typ
    # end of language specific rules for new languages.
    #
  • Test your rules using TextChecker.py, refine and enrich them step by step.
  • Document your word types in the file TagInfo.py.

How to use

Simply unzip the languagetool file somewhere in your system. Go with a command window into the upper path (where the most python files are stored like TextChecker.py), and enter:

python TextChecker.py -l de tests/detest1.txt.
or
python TextChecker.py -l hu tests/hutest1.txt.
or
python TextChecker.py -l en tests/entest7.txt.

This implies, that you must have a python interpreter installed and you must have a path to the executable python command. After a short time (0.5-2 minutes) a bunch of xml coded error messages will appear on your screen.

The output of the sample test files (entest.txt..entest7.txt and detest1.txt, detest2.txt, hutest1.txt, hutest2.txt) is stored in entest.out, detest.out and hutest.out in the test directory. If you make changes, please check, if the output of these tests remained the same.

Regards tr, transam45@nospam.gmx.net
Please remove the nospam, if you wish to email to me.
http://tkltrans.sf.net