Germanic Lexicon Project
Message Board
Home
Texts
Search
Messages
Volunteer
About
[ Main Message Index ]   [ Previous | Next ] [ Reply ] Author: Keith Briggs
Date: 2004-11-04 05:41:11
Subject: Probabilistic correction
Sean is doing a great job with some clever automatic corrections. But I'm wondering if there is a role for probabilistic ideas in this project. Until a year ago, I would not have thought that this would work, but having witnessed the spectacular success of Bayesian spam filters (typically 99.5% correct on very short texts), I've changed my mind.
What I'm thinking is: we train a Bayesian text classifier on the corrected (c) and uncorrected (u) Bosworth-Toller .txt files. We can then classify an unseen .txt file as c or u with high probability. With a bit more work, we can identify features of a u file causing that classification. Those features could be automatically marked as needing human intervention. As the set of c files gets bigger, we can retrain the system and thus classify new files with higher confidence.
I already tested the idea using about 20 c and u files. It works some of the time, which tells me I need a bigger training set (I'm not seeing enough instances of typical errors, like missing accents), and maybe a smarter choice of features to classify against. Sean - if you like the idea, could you arrange for me to get all the B-T files, and I will do some more tests.
Keith
Messages in this thread Name College/University Date Probabilistic correction Keith Briggs 2004-11-04 05:41:11 Re: Probabilistic correction Keith Briggs 2004-11-04 07:49:10 Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-04 22:42:53 Re: Probabilistic correction Keith Briggs 2004-11-05 05:31:16 Re: Probabilistic correction Keith Briggs 2004-11-05 06:59:54 Re: Probabilistic correction Keith Briggs 2004-11-05 07:29:53 Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-05 09:32:30 Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-05 09:48:16 Re: Probabilistic correction Keith Briggs 2004-11-08 05:07:19 Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-08 09:12:45 Re: Probabilistic correction Keith Briggs 2004-11-08 09:46:59 Re: Probabilistic correction Keith Briggs 2004-11-08 10:02:13 Re: Probabilistic correction Keith Briggs 2004-11-08 12:10:56 Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-08 15:26:04 Re: Probabilistic correction Keith Briggs 2004-11-09 06:47:45 Re: Probabilistic correction Keith Briggs 2004-11-09 08:50:46 Re: Probabilistic correction Keith Briggs 2004-11-09 09:43:19 Re: Probabilistic correction Keith Briggs 2004-11-09 10:59:49 Italics (was: Probabilistic correction) Sean Crist Swarthmore College 2004-11-09 13:39:13 Re: Probabilistic correction Keith Briggs 2004-11-11 06:57:20