Germanic Lexicon Project
Message Board

Home

Texts

Search

Messages

Volunteer

About


[ Main Message Index ]     [ Previous | Next ] [ Reply ]

Author: Keith Briggs
Date: 2005-01-03 10:46:05
Subject: Re: Probabilistic correction

I'm still trying to make probabilistic correction work in a useful way. I've been concentrating on Latin, as there are fewer character coding issues. My working principles are these:

1. The system should be completely probabilistic and therefore not use any dictionaries or conventional grammar rules. It learns everything from a training corpus of texts known to be correct.

2. The idea would be to have a script which reads uncorrected B-T files, identifies words containing likely OCR errors and adds a list of possible corrections. These would still need human checking, but it's quicker to delete wrong corrections than think up the right correction and add it.

Perhaps B-T will be finished before I get all this working, but what I learn about these methods might still be useful in future projects. Here are some examples of the current system on real examples from B-T. Ideally, the right correction will be top of the list, but I'm happy if it comes out in the top 5 which I show. The right answer is starred.

patibuium
50.34 patibulum *
50.88 patibutum
50.97 psubulum
51.51 psubutum
51.97 putibulum
arnilla
43.44 armilla *
43.98 aruilla
44.12 armillu
44.21 armitta
44.57 arnilla
lerrae
38.48 terrae *
38.52 terrat
39.42 terrac
39.83 lerrae
39.87 lerrat
montinm
42.35 monumm
42.85 montium *
44.07 monuum
44.39 montimm
44.39 montinm
tt
17.59 u
26.38 et *
27.87 tt
28.64 ef
30.75 ff
dominns
44.34 dominus *
45.06 dominua
45.38 dominna
45.96 domimus
45.96 domiuma
disperiient
54.23 disperitent
55.58 disperilent
55.76 disperticut
55.76 dispertient *
56.03 dispertitut
qaod
36.67 quod *
54.84 qsod
57.86 qaod
sa/tus
40.05 sultus
40.87 saltus *
41.00 aultus
41.32 sultua
42.13 saltua
inlerposili
54.77 interpositi *
55.49 interposili
56.08 interpositt
56.93 interposill
57.02 tulerpositi
man's
35.41 maria
35.50 maris *
36.67 muria
36.76 muris
50.92 msria
bnllasque
50.07 bultusque
50.70 bullusque *
51.25 bultasque
51.56 bullasque
52.37 buttusque
lantum
38.03 tantum *
38.61 lantum
39.11 tuntum
40.55 luntum
41.00 lautum
inter
34.01 inter
36.22 infer *
36.31 tuter
37.08 luter
37.75 iuter

No other messages in this thread