Germanic Lexicon Project
Message Board
Home
Texts
Search
Messages
Volunteer
About
[ Main Message Index ]   [ Previous | Next ] [ Reply ] Author: Keith Briggs
Date: 2004-11-11 06:57:20
Subject: Re: Probabilistic correction
I'm trying automatic correction on Latin. Suppose somehow we've figured out that the group "rn" in "arrnilla" on page 71 of BT is an error (not so hard - "rrn" never occurs in Latin, and also we know that "rn" for "m" is just about the most common OCR error).
According to my own data, tabulated here:
http://research.btexact.com/teralab/documents/english_latin.pdf,
"arm" is by far the mostly likely triplet starting "ar", so we have guessed the right correction.
But using dbacl, trained on the same data set, we get the scores below for the most likely corrections (smaller=better). dbacl's model *should* be much more sophisticated than my crude triplets - it also considers what follows the third letter. But it goes badly wrong, and I know why - it has already seen "argilla" in Caesar's Gallic Wars.
What's the right way to do this?
argilla 17.70
artilla 31.95
aruilla 32.49
ariilla 32.98
arvilla 33.12
arrilla 33.17
areilla 33.48
arbilla 33.66
ardilla 33.84
armilla 34.02
arsilla 34.20
Messages in this thread Name College/University Date Probabilistic correction Keith Briggs 2004-11-04 05:41:11 Re: Probabilistic correction Keith Briggs 2004-11-04 07:49:10 Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-04 22:42:53 Re: Probabilistic correction Keith Briggs 2004-11-05 05:31:16 Re: Probabilistic correction Keith Briggs 2004-11-05 06:59:54 Re: Probabilistic correction Keith Briggs 2004-11-05 07:29:53 Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-05 09:32:30 Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-05 09:48:16 Re: Probabilistic correction Keith Briggs 2004-11-08 05:07:19 Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-08 09:12:45 Re: Probabilistic correction Keith Briggs 2004-11-08 09:46:59 Re: Probabilistic correction Keith Briggs 2004-11-08 10:02:13 Re: Probabilistic correction Keith Briggs 2004-11-08 12:10:56 Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-08 15:26:04 Re: Probabilistic correction Keith Briggs 2004-11-09 06:47:45 Re: Probabilistic correction Keith Briggs 2004-11-09 08:50:46 Re: Probabilistic correction Keith Briggs 2004-11-09 09:43:19 Re: Probabilistic correction Keith Briggs 2004-11-09 10:59:49 Italics (was: Probabilistic correction) Sean Crist Swarthmore College 2004-11-09 13:39:13 Re: Probabilistic correction Keith Briggs 2004-11-11 06:57:20