Germanic Lexicon Project
Message Board

Home

Texts

Search

Messages

Volunteer

About


[ Main Message Index ]     [ Previous | Next ] [ Reply ]

Author: Keith Briggs
Date: 2004-11-09 06:47:45
Subject: Re: Probabilistic correction

> That's a good idea. There would need to be some way to handle the problem the text has a lot of errors (so a trigram of Latin words might contain two good words and one bad word, but you could probably still use that noisy information to make a good guess).
> ...
> The general approach seems to make a lot of sense. It's certainly worth a try. Sometimes I try things like this and they work beautifully, and other times they are a disappointment. I don't know which one this will be. There's the non-trivial groundwork of getting a Latin corpus assembled from the stuff on the web.
> ...
> There is also the issue of the abbrevations. You get a one, two, or three word abbreviation, followed by number comma number (then optional: space, letter). These could be treated as another "language". The difference in this case is that we have a complete list of the abbreviations, since the authors give them to us. There's a nice XML file I made containing the B/T abbreviations; it is on the B/T page under "Texts".

Lots to think about here! At the moment I have time for only a few simple tests, but I'm thinking more and more that we need to initiate a proper research project on this. I might be able to organize something next year with some Master's program I am involved with - if this works, we could have a student work for 3 months next summer. But would this be too late?

Meanwhile, I trained on a Latin text - I picked Pliny Liber II at random (it's about 2200 words). Then, I split the BT gehelmian entry into 11 lines. Three lines are Latin, all with errors. Two of these are the only lines dbacl gives a score for the Latin category much greater than zero, although they still score higher as c (=corrected BT). It fails to recognize "corSnasti nos" - not surprising as the Pliny text contains no instances of an "asti" termination. Still, this is quite promising - I think by training on bigger corpora we should be able to get very reliable language recognition. Remember that category c contains mixed languages, so it's really a bad model.

> db="dbacl -c c -c latin -vN"
> echo "ge-helmian ; p. ode, ede; pp. od, ed" | ${db}
c 100.00% latin 0.00%
> echo "To cover with a helmet, crown;" | ${db}
c 100.00% latin 0.00%
> echo "galeSre, coronare" | ${db}
c 93.32% latin 6.68%
> echo ":-- ÐÚ gehelmodest us" | ${db}
c 100.00% latin 0.00%
> echo "corSnasti nos," | ${db}
c 100.00% latin 0.00%
> echo "Ps. Spl. 5, 15. Of wuldre and weorþmynt ðú" | ${db}
c 100.00% latin 0.00%
> echo "gehelmedest hine" | ${db}
c 99.99% latin 0.01%
> echo "de gloria et hondre coronasti eum," | ${db}
c 99.79% latin 0.21%
> echo "Ps. Spl. T. 8, 6. Gehelmod " | ${db}
c 100.00% latin 0.00%
> echo "gáleátus,
Ælfc. Gr. 43; Som. 45, II." | ${db}
c 100.00% latin 0.00%
> echo "[Laym, i-helmed : O. H. Ger. gehelmot.]" | ${db}
c 100.00% latin 0.00%




Messages in this threadNameCollege/UniversityDate
Probabilistic correction Keith Briggs 2004-11-04 05:41:11
Re: Probabilistic correction Keith Briggs 2004-11-04 07:49:10
Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-04 22:42:53
Re: Probabilistic correction Keith Briggs 2004-11-05 05:31:16
Re: Probabilistic correction Keith Briggs 2004-11-05 06:59:54
Re: Probabilistic correction Keith Briggs 2004-11-05 07:29:53
Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-05 09:32:30
Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-05 09:48:16
Re: Probabilistic correction Keith Briggs 2004-11-08 05:07:19
Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-08 09:12:45
Re: Probabilistic correction Keith Briggs 2004-11-08 09:46:59
Re: Probabilistic correction Keith Briggs 2004-11-08 10:02:13
Re: Probabilistic correction Keith Briggs 2004-11-08 12:10:56
Re: Probabilistic correction Sean Crist Swarthmore College 2004-11-08 15:26:04
Re: Probabilistic correction Keith Briggs 2004-11-09 06:47:45
Re: Probabilistic correction Keith Briggs 2004-11-09 08:50:46
Re: Probabilistic correction Keith Briggs 2004-11-09 09:43:19
Re: Probabilistic correction Keith Briggs 2004-11-09 10:59:49
Italics (was: Probabilistic correction) Sean Crist Swarthmore College 2004-11-09 13:39:13
Re: Probabilistic correction Keith Briggs 2004-11-11 06:57:20