|
Germanic Lexicon Project
Message Board
|
|
|
Author: Boaz Yaniv (Hebrew University of Jerusalem)
Date: 2009-06-08 17:59:32
Subject: Converting Bosworth/Toller into Unicode and XML
Hi,
First of all, I'd like to thank everybody here for your efforts in working on this dictionary. I really find it useful.
Right now, I'm trying to generate a dictionary for offline use from the Bosworth/Toller master file (the most recent version, if I judge correctly, is bt_canon_3.txt). The file format is far from being trivial to process (not being proper XML and all that), but I more or less managed to go through that, and I think I can create a script that will almost automatically convert it into proper XML.
As the work on the dictionary seems to be ongoing, the question is - should I really do a fork? Wouldn't it be more productive if the master file itself gets converted into a new well-formed XML format, conveniently sorted by entries (and not by pages)? If this is considered a goal, I'd be glad to help, since I'm already doing that thing anyway.
The second major change I'll have to do is to convert the generated dictionary into Unicode. The master file uses a slew of confusing non-standard SGML/HTML/XML-entities, some of them are redundant and some of them are just plain wrong (lacking a terminating ";", or being misspelled like "&actue;" instead of "´"). I've already made a simple script to fix all that, and normalized everything as proper Unicode, using combining diacritics where necessary. I still haven't gone through logically ordering the Hebrew characters (they're visually ordered right now, although the diacritics are not), but I think that's gonna be simple.
Would any of this work be useful to the project? This can easily solve the problem of not being able to view Greek or Hebrew entries.
Cheers,
Boaz
Messages in this thread | Name | College/University | Date |
Converting Bosworth/Toller into Unicode and XML |
Boaz Yaniv |
Hebrew University of Jerusalem |
2009-06-08 17:59:32 |
Re: Converting Bosworth/Toller into Unicode and XML |
Ondrej Tichy |
Charles University, Prague |
2009-06-13 23:07:30 |