|
Germanic Lexicon Project
Message Board
|
|
|
Author: Sean Crist (Swarthmore College)
Email: kurisuto at unagi dot cis dot upenn dot edu
Date: 2004-11-08 15:26:04
Subject: Re: Probabilistic correction
> What about something like this:
>
> 0. We build dbacl models of English, OE (correctly accented - are such available?), and Latin (plenty available e.g. http://penelope.uchicago.edu/Thayer/E/Roman/home.html) from corpora of those languages, *not* from the corrected BT files.
The Toronto corpus of Old English, which includes essentially the entire body of extant Old English text, regrettably does not mark the distinction between short and long vowels. This doesn't make it useless, just less than ideal. In more than one of the earlier rounds of automated correction, I had to write scripts in a careful way that made the best use of the information that's there.
I sure wish I knew where to find a big corpus of Latin text which marks the short and long vowels. This would be a huge help in automatically correcting the Latin in B/T, for those portions of B/T where length on Latin words is indicated.
> 1. We strip all punctuation and html markup from BT files we want to check and replace á by á etc.
>
> 2. Using a language guesser (e.g. the trigram code here http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/326576), we identify maximal groups of adjacent words all in the same language.
That's a good idea. There would need to be some way to handle the problem the text has a lot of errors (so a trigram of Latin words might contain two good words and one bad word, but you could probably still use that noisy information to make a good guess).
> 3. We check each of these groups using the appropriate model built in step 0.
>
> ?
> Keith
The general approach seems to make a lot of sense. It's certainly worth a try. Sometimes I try things like this and they work beautifully, and other times they are a disappointment. I don't know which one this will be. There's the non-trivial groundwork of getting a Latin corpus assembled from the stuff on the web.
There is also the issue of the abbrevations. You get a one, two, or three word abbreviation, followed by number comma number (then optional: space, letter). These could be treated as another "language". The difference in this case is that we have a complete list of the abbreviations, since the authors give them to us. There's a nice XML file I made containing the B/T abbreviations; it is on the B/T page under "Texts".
This is fun, trying to figure this out.
--Sean