Germanic Lexicon Project
Message Board

Home

Texts

Search

Messages

Volunteer

About


[ Main Message Index ]     [ Previous | Next ] [ Reply ]

Author: Sean Crist (Swarthmore College)
Email: kurisuto at unagi dot cis dot upenn dot edu
Date: 2004-11-08 09:12:45
Subject: Re: Probabilistic correction


> But after that, I'm just adding the string "acute",
> which is known by the model to be very frequently occurring, so the score gets better.
> dbacl's default model rejects punctuation, and all the letter
> transitions a->c->u->t->e contribute to a good score.

I wonder if that is a reason why you were able to get such a high level of accuracy in distinguishing corrected from uncorrected pages. I wonder what we'd find if we counted the number of times that a-c-u-t-e occurs per page, comparing corrected with uncorrected pages; would this have the same predictive power over which pages are corrected or not? Not that makes your technique any less potentially useful, of course.


> How do I proceed?

I guess it some reassurance to me that I'm not the only one scratching my head over this. It seems like there has to be some way to do some sort of useful correction using a probabilistic model.

If I'm understanding, the way you did this looks only at the probability of bigrams (where the atoms are characters, not words), right? I can imagine that if XAZ is almost always an error for XYZ, and XAZ itself is almost never correct, then a bigram model could correct that case.

On the other hand, you might not even need a probabilistic model for that. Since we're starting to have a substantial corpus of pairs of corrected/uncorrected pages, one of us could probably write a program to automatically count how frequent the substitution X -> Y is, and also whether there are lots of tokens of X which are correct as X and shouldn't become Y. If there aren't, then this would be a good candidate for a global substitution. This would be a really useful tool to tell me where I should concentrate my energies with the global corrections. The only part of this I haven't entirely figured out is how to align the words in the corrected and uncorrected page so that you know what corresponds to what. Maybe I could make a hash and looking for words which appear only once in each version, using those as anchor points, and then somehow expanding the associations from there.

I guess one of the reasons why this whole thing is so hard is that with probabilistic models, you've usually got something much bigger (e.g. ten years of Wall Street Journal text) which you're comparing something against. Here, that other something is much more nebulous; as you mention, the text contains words from multiple languages. There's no other text like this one.

--Sean

Messages in this threadNameCollege/UniversityDate
Probabilistic correction Keith Briggs 2004-11-04 05:41:11
Re: Probabilistic correction Keith Briggs 2004-11-04 07:49:10
Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-04 22:42:53
Re: Probabilistic correction Keith Briggs 2004-11-05 05:31:16
Re: Probabilistic correction Keith Briggs 2004-11-05 06:59:54
Re: Probabilistic correction Keith Briggs 2004-11-05 07:29:53
Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-05 09:32:30
Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-05 09:48:16
Re: Probabilistic correction Keith Briggs 2004-11-08 05:07:19
Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-08 09:12:45
Re: Probabilistic correction Keith Briggs 2004-11-08 09:46:59
Re: Probabilistic correction Keith Briggs 2004-11-08 10:02:13
Re: Probabilistic correction Keith Briggs 2004-11-08 12:10:56
Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-08 15:26:04
Re: Probabilistic correction Keith Briggs 2004-11-09 06:47:45
Re: Probabilistic correction Keith Briggs 2004-11-09 08:50:46
Re: Probabilistic correction Keith Briggs 2004-11-09 09:43:19
Re: Probabilistic correction Keith Briggs 2004-11-09 10:59:49
Italics (was: Probabilistic correction) Sean Crist Swarthmore College 2004-11-09 13:39:13
Re: Probabilistic correction Keith Briggs 2004-11-11 06:57:20