Barrington Atlas Data Munger
We're building python and xslt code to create a scalable, flexible and reliable toolset for moving the legacy Barrington Atlas data into a format ready for upload to Pleiades.
Code base
Data
- in AWMC mass storage at UNC
Pseudocode
For each directory file:
- Preparatory work:
- Manually inspect for idiosyncrasies in structure or format (e.g., Sardinia/Corsica or Map 87 vs. Map 87 inset)
- Modify code or develop other procedural workarounds as necessary
- Determine default country for this directory (note Sardinia/Corsica problem)
- Manually inspect for idiosyncrasies in structure or format (e.g., Sardinia/Corsica or Map 87 vs. Map 87 inset)
- note: there is a python program to string all the bits and pieces together: source:BADataMunger/trunk/pipeline.py
- Save from MSWord as "web page"; "Web Options" must first be set as follows:
- Target browser: MSIE 5.0 or later
- Disable features not supported by this browser = yes
- Rely on CSS for font formatting = yes
- Save new web pages as Web archives = no
- Encoding: save this document as Unicode (UTF-8)
- (others irrelevant)
- Make well-formed and valid
- the cycle() method on the Pipe class in pipeline.py invokes code in wordhtml2xml.py to accomplish this task
- Strip unneeded formatting inherited from MSWord
- the cycle() method on the Pipe class in pipeline.py invokes code in wordstripper.py to accomplish this task
- Extract and parse bibliography from the bibliography division and any abbreviation(s) table
- the cycle() method on the Pipe class in pipeline.py invokes code in biblioextractor.py to accomplish this task
- Save bibliographic data to file in MODS format
- the cycle() method on the Pipe class in pipeline.py invokes code in bibliosaver.py to accomplish this task
- Isolate the directory listing tables, determine the title of each and store the title and content in a structure more amenable to further parsing
For each table:
- Prep:
- Determine relevance of table and eliminate from further processing if irrelevant
- Define the mapping from columns to desired data items
- Eliminate irrelevant rows from each table:
- blanks
- cross-references (both internal and external)
- For each row: read each column's content into an appropriate variable, prefratory to subsequent processing (one row = one place)
- Make sure to preserve essential html formatting (like <i> in names, which signals use of a modern name vs. an ancient one)
- Note that a few maps can have mixed modern and ancient names in the names column ... esp. Roaf's maps IIRC).
- Make sure to preserve essential html formatting (like <i> in names, which signals use of a modern name vs. an ancient one)
