Pleiades Content Methods
What we need to do to get content (especially legacy content) into Pleiades, and the tools we use to make that happen.
CAP Map-by-Map Directory Listings
Let's call the XHTML data "hDir", the XML data "xDir".
- Convert from MSWord to well-formed XHTML, minus MSWord tags, and put somewhere safe
- done 18 November 2008; it's on atlantides.org under /var/pleiades/batlas/dirs/xhtml/ (accomplished with code in dir2xml.py in BADataMunger; see README.txt)
- Convert from MSWord to some kind of relatively clean, semantic XML and put somewhere safe
- first stab 18 November 2008; on atlantides.org under /var/pleiades/batlas/dirs/xml/ (accomplished with code in dir2xml.py in BADataMunger; see README.txt)
- basically munged together batlasid xml with old-style frankenformat xml
CAP Bibliography Listings
As above, "hBib", "xBib".
- Extract from directory XHTML (see above) and convert to XHTML term lists instead of just paragraphs
- Convert from XHTML term lists to MODS by parsing and matching, where possible, with the AWMC bibliographic database content, already in MODS
- Submit book records to Open Library
- Keep existing bibliographic service
- prep and upload tooling needs to be streamlined and made easier to use (BibIt, plus I think some BADataMunger dependencies)
- make Zotero interop better (lossless-ier) than current COinS support
- Current bibliographic data is in the pleiades-bibliography module
Matching CAP-list-extracted data with map-digitized data
- extract/retool the existing logic that's used to match digitized map data with the directory data so it can be cleanly packaged for use in a tool
- maybe subclass concordia.matchtool to do the job?
Fuzziness
Resolving ambiguity between map sheet features and map directory records is the community's work. Development should focus on indicating ambiguity and showing possibilities.
Example:
Map grid A1 has two unlabeled bridge features, the directory has 2 records for "unnamed bridges" in grid A1.
