Barrington Atlas Data Munger

We're building python and xslt code to create a scalable, flexible and reliable toolset for moving the legacy Barrington Atlas data into a format ready for upload to Pleiades.

Code base

Data

  • in AWMC mass storage at UNC

Pseudocode

For each directory file:

  • Preparatory work:
    • Manually inspect for idiosyncrasies in structure or format (e.g., Sardinia/Corsica or Map 87 vs. Map 87 inset)
      • Modify code or develop other procedural workarounds as necessary
    • Determine default country for this directory (note Sardinia/Corsica problem)
  • Save from MSWord as "web page"; "Web Options" must first be set as follows:
    • Target browser: MSIE 5.0 or later
    • Disable features not supported by this browser = yes
    • Rely on CSS for font formatting = yes
    • Save new web pages as Web archives = no
    • Encoding: save this document as Unicode (UTF-8)
    • (others irrelevant)
  • Make well-formed and valid
  • Strip unneeded formatting inherited from MSWord
  • Extract and parse bibliography from the bibliography division and any abbreviation(s) table
  • Save bibliographic data to file in MODS format
  • Isolate the directory listing tables, determine the title of each and store the title and content in a structure more amenable to further parsing

For each table:

  • Prep:
    • Determine relevance of table and eliminate from further processing if irrelevant
    • Define the mapping from columns to desired data items
    • Eliminate irrelevant rows from each table:
      • blanks
      • cross-references (both internal and external)
  • For each row: read each column's content into an appropriate variable, prefratory to subsequent processing (one row = one place)
    • Make sure to preserve essential html formatting (like <i> in names, which signals use of a modern name vs. an ancient one)
      • Note that a few maps can have mixed modern and ancient names in the names column ... esp. Roaf's maps IIRC).