|
Germanic Lexicon Project
Message Board
|
|
|
Author: Keith Briggs
Date: 2004-11-08 05:07:19
Subject: Re: Probabilistic correction
I tried training the system on all corrected (c) and uncorrected (u) BT files.
I'm using dbacl (http://dbacl.sourceforge.net/) as the Bayesian engine.
Then I tried classifying some short lines. Some results look promising, e.g.
c 420.91 u 193.02 tð-hlooen. v. to-hlecan.
c 420.91 u 424.34 tð-hlocen. v. to-hlecan
c 432.46 u 435.39 tó-hlocen. v. to-hlecan
c 454.49 u 556.86 tó-hlocen. v. tó-hlecan
A lower score means a better match. "c 420.91 u 193.02" means roughly:
the probability that the line was generated by the model estimated from the c data is about exp(-421), and the probability that the line was generated by the model estimated from the u data is about exp(-193). So we can see that as I correct the errors one by one, dbacl estimates that it is more likely that the line is correct than not correct.
But I believe this is actually deceptive. It starts well, recognizing that "oo" is unlikely to be correct. But after that, I'm just adding the string "acute", which is known by the model to be very frequently occurring, so the score gets better. dbacl's default model rejects punctuation, and all the letter transitions a->c->u->t->e contribute to a good score. We don't want this - we want to treat á as a single character á, and give it the same weight as other single characters. This means first preprocessing the files (not so hard), and then using dbacl's regular expressions to include characters like á in words. I tried this, but the learning phase gets very slow. Also, we're still left with the problem that we're mixing up several languages in the same model, which makes it struggle to fit any one of them well.
How do I proceed?
Keith