Germanic Lexicon Project
Message Board

Home

Texts

Search

Messages

Volunteer

About


[ Main Message Index ]     [ Previous | Next ] [ Reply ]

Author: Keith Briggs
Date: 2004-11-04 05:41:11
Subject: Probabilistic correction

Sean is doing a great job with some clever automatic corrections. But I'm wondering if there is a role for probabilistic ideas in this project. Until a year ago, I would not have thought that this would work, but having witnessed the spectacular success of Bayesian spam filters (typically 99.5% correct on very short texts), I've changed my mind.

What I'm thinking is: we train a Bayesian text classifier on the corrected (c) and uncorrected (u) Bosworth-Toller .txt files. We can then classify an unseen .txt file as c or u with high probability. With a bit more work, we can identify features of a u file causing that classification. Those features could be automatically marked as needing human intervention. As the set of c files gets bigger, we can retrain the system and thus classify new files with higher confidence.

I already tested the idea using about 20 c and u files. It works some of the time, which tells me I need a bigger training set (I'm not seeing enough instances of typical errors, like missing accents), and maybe a smarter choice of features to classify against. Sean - if you like the idea, could you arrange for me to get all the B-T files, and I will do some more tests.

Keith

Messages in this threadNameCollege/UniversityDate
Probabilistic correction Keith Briggs 2004-11-04 05:41:11
Re: Probabilistic correction Keith Briggs 2004-11-04 07:49:10
Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-04 22:42:53
Re: Probabilistic correction Keith Briggs 2004-11-05 05:31:16
Re: Probabilistic correction Keith Briggs 2004-11-05 06:59:54
Re: Probabilistic correction Keith Briggs 2004-11-05 07:29:53
Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-05 09:32:30
Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-05 09:48:16
Re: Probabilistic correction Keith Briggs 2004-11-08 05:07:19
Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-08 09:12:45
Re: Probabilistic correction Keith Briggs 2004-11-08 09:46:59
Re: Probabilistic correction Keith Briggs 2004-11-08 10:02:13
Re: Probabilistic correction Keith Briggs 2004-11-08 12:10:56
Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-08 15:26:04
Re: Probabilistic correction Keith Briggs 2004-11-09 06:47:45
Re: Probabilistic correction Keith Briggs 2004-11-09 08:50:46
Re: Probabilistic correction Keith Briggs 2004-11-09 09:43:19
Re: Probabilistic correction Keith Briggs 2004-11-09 10:59:49
Italics (was: Probabilistic correction) Sean Crist Swarthmore College 2004-11-09 13:39:13
Re: Probabilistic correction Keith Briggs 2004-11-11 06:57:20