Germanic Lexicon Project
Message Board

Home

Texts

Search

Messages

Volunteer

About


[ Main Message Index ]     [ Previous | Next ] [ Reply ]

Author: Sean Crist (Swarthmore College)
Email: kurisuto at unagi dot cis dot upenn dot edu
Date: 2004-11-04 22:42:53
Subject: Re: Probabilistic correction

> > I already tested the idea using about 20 c and u files. It works some of the time, which tells me I need a bigger training set
>
> I just did some more tests, still only with 20 c and u files. I can now
> classify an unknown file with 100% accuracy. So it looks like this idea
> could at least be useful to check a submitted corrected file for any missed errors.

Keith,

I am really glad you're thinking about how we could do the global corrections better. I'm very open for ideas.

It sounds like what you have so far is a way to distinguish a corrected file from an uncorrected one. That alone could definitely be useful. There's the checking system I wrote, but there are some common kinds of error which it can't catch. It could happen, for example, that somebody misunderstands what they're supposed to do and corrects only the errors which the checking system tells them to correct. Your program might be able to catch a case like that.

The thing that would really save us the most work is something that actually makes corrections in individual words automatically. Even it corrected just, say, 10% of the errors, that would translate into a big savings of effort with the hand-corrections.

I've written probabilistic spell checkers before, but I haven't figured out a way to use one in this context. The kind I've written is a simple, widely used type which just multiplies two probabilities: the probability of a word (as estimated by counting frequencies in a big corpus) times the probability that the observed unknown word could be a misspelling for a known word (for which you need o make up some model; but I've gotten very good results even with a simple, fairly stupid model.)

I haven't figured out a way to use that approach on these dictionaries. For one thing, I don't know what we'd use as the corpus for the frequency counts. We use the hand-corrected part of the dictionary as the training corpus; but a dictionary is a very funny kind of text where the words are not distributed at all as they are in a normal text. There might well be some words which occur only within the entry where they are defined.

That's not the only probabilistic approach that one could imagine, however. I've been scratching my head over this for a while, so if we bounce ideas off each other, maybe we can figure out something clever to do.

--Sean

Messages in this threadNameCollege/UniversityDate
Probabilistic correction Keith Briggs 2004-11-04 05:41:11
Re: Probabilistic correction Keith Briggs 2004-11-04 07:49:10
Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-04 22:42:53
Re: Probabilistic correction Keith Briggs 2004-11-05 05:31:16
Re: Probabilistic correction Keith Briggs 2004-11-05 06:59:54
Re: Probabilistic correction Keith Briggs 2004-11-05 07:29:53
Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-05 09:32:30
Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-05 09:48:16
Re: Probabilistic correction Keith Briggs 2004-11-08 05:07:19
Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-08 09:12:45
Re: Probabilistic correction Keith Briggs 2004-11-08 09:46:59
Re: Probabilistic correction Keith Briggs 2004-11-08 10:02:13
Re: Probabilistic correction Keith Briggs 2004-11-08 12:10:56
Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-08 15:26:04
Re: Probabilistic correction Keith Briggs 2004-11-09 06:47:45
Re: Probabilistic correction Keith Briggs 2004-11-09 08:50:46
Re: Probabilistic correction Keith Briggs 2004-11-09 09:43:19
Re: Probabilistic correction Keith Briggs 2004-11-09 10:59:49
Italics (was: Probabilistic correction) Sean Crist Swarthmore College 2004-11-09 13:39:13
Re: Probabilistic correction Keith Briggs 2004-11-11 06:57:20