- An intermediate data layer to take what we have (a collection of PDFs and freeform CMS text) and normalize it into structured data.
- Not a final, permanent, format.
- Data that can be easily parsed, massaged, and then loaded into a final format to seed the new website's database.
- Minimal and flexible.
- XML. Easy to hand-edit. Easy to start with a template. Easy to copy/paste from website into editor.
- Data format validation.
test.shwrappers calls toxmllintto look for bad XML, divergence from the DTD, or missing linked files.
- Wasting time arguing over the data structure. I had to start with something, then revised it a couple of times as I found non-conformate content. It's fairly solid now. Unless we run across more content that doesn't conform to the existing data structure, consider the DTD final for this task. Save the arguing for the final website's data structure.
- This is a “file based” database.
- Each month is in a
YYYY/MMdirectory. - Each month has a
month.xmldescribing the month and its puzzles. - There are also the various support files — puzzle PDFs, answer sheets, solutions, etc.
- Upon import, I was anal-retentive and renamed the puzzle PDFs to
{number}-{title}-{puzzle | solution}.pdf. That let me better see what files mapped to what puzzle. Some of the original names were pretty silly and embarrassing (“mypuzzle.pdf” and “final, version 3” and such). We don't have to stick to this naming. - The test app doesn't specifically test for this (yet?) but there should be no files in these folders that are not referenced in the XMLs. That is, the XML always points to files that have no local external dependencies (such as images, or links to other local files). It's possible that some things may have links out to external resources such as crossword solvers, anagram solvers, etc.
Each month folder holds a month.xml file. This describes the following:
- The month and its description.
- Informational text about the month (theme, authors, etc.)
- All puzzles, including location.
- Location puzzle answer word.
- Hints, if available.
- Answer sheet, if available.
They should all conform to the top level month.dtd. Look at the comments there for more info. A sample_month.xml file lives at the top level as an example to copy when adding new months.
Run the top level test.sh script and this will spider into each year/month folder and:
- Validate the XML against the DTD using xmllint.
- Validate that any
hrefattributes point to real files. - Warn of any
hrefattributes that point to external domains.
All freeform text in the XML is Markdown. We don't have to stick to this — for example, we can render it as HTML upon database import — but it's a start. They should be considered Github-flavored markdown. They occasionally make use of links, bold, italics, tables, and occasional code-blocks.
Linked files are typically PDF, but could technically be anything. We have a few media files here and there (MP3, MP4).
The hint file ./2015/06/00-location-hint1.html hotlinks images on snout.org.
Missing solutions:
- January 2012
- February 2012
- March 2012 is missing both the LOCATION PUZZLE and the solution
- April 2012
For months with two location puzzles (Portland + Seattle), I only captured the Portland variant.
- The smallest “interesting” unit of data is the puzzle.
- The puzzle has, at minimum:
- A title.
- Content: freeform text and/or a linked [typically PDF] file.
- A solution: freeform text and/or a linked [typically PDF] file.
- It can also have any number of hints (including zero). Any of these can also be freeform text and/or a linked file. In theory, the website could render these as progressive clues.
The Location Puzzle builds on the normal puzzle by including an official answer word.
The Month is the largest unit of data in this format. It includes:
- Metadata — year, month, title, icon image.
- Freeform text, typically used to introduce the theme, include an author bio, and so on.
- A location puzzle.
- One or more puzzles.
- Optionally, an
allpuzzles.pdf. - Optionally, an answer sheet.
- Optionally, an answer sheet with solutions filled in.
As an example application to demonstrate loading and processing the monthly puzzle data, we've included a static site generator. This Python application loads up the monthly XMLs, copies the puzzles over, then generates a website of static HTML files, suitable for browsing locally or hosting on any web server.
Run the following command to generate a puzzle archive website. It assumes you have Python 3 and virtualenv installed. Virtualenv is used to install dependencies, such as the Markdown generator.
./gen/sh
The root of the static website will then be at static/index.html. You can upload that folder to a web server or host it on an S3 bucket.
Highlights of the object model include:
Years: Effectively a dictionary that maps a numeric year to aMonthsobject.Months: Effectively a dictionary of all (available) months in a given year, each mapping to aMonthobject.Month: This is the equivalent of amonth.xml, with title, notes, location puzzle, and array of puzzles.Puzzle: This is a puzzle object.
The templating happens in the Template object. This is a home-grown templating system that replaces {{keywords}} with snippets of HTML. The template files are:
years.html: The root page, listing all years and months.month.html: The month page, listing location puzzle and the full puzzle set.location.html: The location puzzle page, with progressive hints.location-solution.html: Rarely used, but some location puzzles list a solution directly in the HTML vs. as a separate file.hints.html: A page for displaying progressive hints for a single puzzle.
Everything in the assets folder also gets copied over. At this time, it is just the CSS.
Known Issues / TODO:
- The templating was thrown together quickly. It works, but isn't elegant. It could use something more akin to Ruby's ERB.
- There's a UTF-8 encoding problem with either reading
notesfrom the XML or writing the HTML.



