|
Germanic Lexicon Project
Message Board
|
|
|
Author: Sean Crist (Swarthmore College)
Email: scrist1 at swarthmore dot edu
Date: 2004-11-05 09:48:16
Subject: Re: Probabilistic correction
> Sean: would you be able to send me a gzipped tar file of these? (it's too slow
> to get them separately by hand).
There are a few ways we could do this:
1. If it would be adequate for your purposes, you could get the whole dictionary by clicking the "Texts" tab, going to the Bosworth/Toller page, and clicking the link to download the huge joined text file of the entire dictionary. This file is joined together once a week from the most recent version of each page, or the uncorrected page if there is no corrected version. You could edit out the pages which have already been corrected.
2. There is a directory on the server which contains a separate text file for each page. When you upload a file to the server, it gets saved to this directory with a name like bt_b0025_20041015; the last part is the date when the file was submitted (so if the file has been submitted more than once, there is a separate copy for each date). In the same directory, there is bt_0025_00000000, which is the uncorrected file. This way we can assemble the most up-to-date version of the dictionary by picking the version of each page with the most recent date. This is what happens every week.
3. There is also another file even further upstream. It is the entire dictionary, and it has had no hand corrections done on it. When I do global corrections, I do them on this file. Then I re-explode this file into all the 00000000 pages mentioned above, overwriting the old 0000000 files. Then the PDF files are regenerated. This way, the hand-corrected versions of pages automatically take precedence and aren't affected by global corrections which might inadvertently introduce new errors.
You can get 1 yourself, or I can make a tar file of 2 or 3. #2 would allow you to compare corrected and uncorrected pages, which might be of some use. Let me know which you'd prefer.
--Sean