Germanic Lexicon Project
Message Board

Home

Texts

Search

Messages

Volunteer

About


[ Main Message Index ]     [ Previous | Next ] [ Reply ]

Author: Sean Crist (Nuance Communications)
Email: kurisuto1 at yahoo dot com
Date: 2007-07-09 12:58:27
Subject: Re: Character database made public

> Hi Sean,
>
> thanks for the update & the new entity list. I'll check the runes
> tomorrow. The list is now missing only the Hebrew set, I can send you the list
> we have devised so far.
>
> The only thing that is troubling me is the list of Greek entities. Does the
> new list come from the currently used entities as introduced mainly by Bekkie,
> or did you just create them upon a Greek-Unicode table? We have devised a list
> to correspond to BT's use, so that we don't need to collapse any
> diacritics etc. and the list is a bit different. Shall we do some conversion,
> or include all the entities?

Yikes-- posting that file does seem to have opened a can of worms. :-/

The list isn't actually new; it's been a project-internal file for a long time. Back in the early days of the project, I used to just make up an entity when I needed one. I couldn't always remember what I had done from one text to another, so the entities didn't agree perfectly across texts. Later, I went thru and made a central list; then I made all of the texts agree with that list, so that everything was uniform across the project. Later, I added more columns to the character table for use by the search system, the web-based correction system, etc.

The character database is driven by what's actually been found in the texts. If &omicron-dasia-oxia; is in the character database, it's because there's an instance of that character in one of the texts somewhere. The only case where I included any characters which aren't necessarily in the texts was that I included all of the standard entities for the Latin 1 (ISO-8859-1) characters which aren't in ASCII. The ones which aren't found in the project texts are marked as such in one of the columns.

Bekie's text was encoded in UTF-8; she didn't devise any entities. When I imported her text, I converted the UTF-8 characters to the project's entities. (There was a good bit of cleanup involved; I remember, for example, that in one case, a Greek word used the Cyrillic "A" instead of the Greek "A"). Since Bekie had represented all of the classical Greek diacritics with the modern tonos, I used temporary entities such as &omicron-tonos; within her pages of Bosworth/Toller, but I didn't add those to the project character list, since we want to replace those entities in the long run.

Could you email me your list? Let's see how bad the differences are, and figure out the best course of action.

--Sean

Messages in this threadNameCollege/UniversityDate
Character database made public Sean Crist Nuance Communications 2007-07-01 15:51:51
Re: Character database made public Keith Briggs BT Research 2007-07-03 10:56:46
Re: Character database made public Ondrej Tichy Faculty of Arts, Charles University, Prague 2007-07-08 01:59:54
Re: Character database made public Sean Crist Nuance Communications 2007-07-09 04:17:58
Re: Character database made public Sean Crist Nuance Communications 2007-07-09 12:58:27