I initially wanted to do validation and display together, with a real parser, possibly bound to RelaxNG (it supports an "Annotation" feature that could be used for this purpose). But it's a lot of work. Furthermore, we need to generate a useful display even when texts are not valid. So many files are invalid that being too strict would leave us with not much, and even so not being able to display a text at all because of a single error would be super annoying.
Initially, the display was generated by converting TEI files to a sequence of instructions stored in an array, together with their operands. This was quick and easy, but it turned out to be insufficient for addressing new requirements, in particular the need to produce three displays of an edition (physical, logical, full). In practice, it is necessary to parse the TEI into some kind of tree and to modify the structure of this tree to make it amenable to HTML, etc. conversion. We thus now use an intermediary XML representation called "internal" for this.
We have transforms for various tasks:
tei2internal
internal2internal
internal2search
internal2html
tei2internal must be run first. It converts TEI-encoded XML files to a simpler
XML representation from which we derive everything else. We call this
representation "internal". After this, internal2internal must be called. It
performs various operations to validate or fix the internal XML encoding. In
particular, it generates three encodings of the div[@type='edition']: physical,
logical, full.
At some point we will want to produce PDF or Word files or plain text files. For
this, we should use pandoc, but we need to use pandoc's data model. See
https://boisgera.github.io/pandoc/document and more importantly:
https://hackage.haskell.org/package/pandoc-types-1.23.1/docs/Text-Pandoc-Definition.html;
our internal representation must be close enough to this one to allow us to use
pandoc at some point.
==
Note that link/@href is not mandatory; when it is not given, we mark the link as invalid.
document: metadata_fields main_division
metadata_fields = identifier? repository? title? editor? summary? hand?
main_divisions: edition? apparatus? translation* commentary? bibliography?
edition: division apparatus: division translation: division commentary: division bibliography: division div: division
division: head division_contents
para_like: para | verse | quote | dlist | list milestone: npage | nline | ncell
inline = span | link | note | TEXT
note: para+
¶ data fields: title, author, editor (these elements should not contain paragraphs) ¶ divisions: summary, hand, edition, apparatus, translation, commentary, bibliography, div (all their children strings should be removed) ¶ para-like: para verse(>verse-line) head quote ¶ para containers: dlist(>(key, value)) list(>item) note ¶ sub-paragraphs divisions: item, key, value, verse-line (can only contain inline)
== structural errors in internal2internal
recursively, iter on block-only elements and wrap inlines in paras.
we have structural errors with use of inlines in places where blocks are expected; fixed that for milestones,;see
if there is no milestone-accepting element, we should end up with just milestones, so put them in a newly-created p.
but remain cases where wrapping is necessary; basically:
within divisions. wrap all child elements inside a .
within inlines (span, link, but not note). -> unwrap the block (and child blocks)
within blocks. forbid other blocks, thus . need to set rules for nesting stuff; basically, only divisions should hold block elements. otherwise, unwrap the outermost block element.
div: head (block | div)+ head: inline+
verse: verse-head? verse-line+ verse-head: inline+ verse-line: inline+
para: inline+ quote: inline+ list: item+ dlist: (key value)+ note: para+
inline: span | link | milestone | note | TEXT milestone: npage | nline | ncell note: restricted_block+
restricted_block: like a normal block but disallowing nested notes
will need to review metadata fields, not sure which we should keep, depends how we do the search stuff.
=== TODO
also might want to store in the same "internal" structure query results from the db, because the search system will need the info, and we assume it doesn't know about the database structure. it should be possible to to make the document searchable without accessing the db at all. but it's kinda stupid to have to parse/unparse the data just for display.
no. better use two passes. in the second one, fetch stuff from the db. but try to avoid cascading.
===
find coherent solution for spaces around lb, pb and milestone (and in fact all elements). delete them when appropriate, add them when needed. if we want to be able to fix whitespace-related issues in the original XML (like spaces between and ), we need to sort them out directly on the XML tree. but do not modify trees in the pipeline; this should not be part of the update process. still, do fix whitespace-related issues in the generated tree. stuff like nesting of
and must also be done on the generated tree.
to simplify processing, build the tree first, then fix it. once the tree is constructed, validate it against a (strict) rng schema, for error checking. or maybe a peg grammar. generated documents should always validate, otherwise there's a bug, or we're not covering every use case.
===
the search system should be able to run completely independently from the update system, for testing. must have a language-agnostic interface (through SQL?). also maybe run as a separate process, but in this case it should have its own db, or have a predictable update pattern.
the mapping between search offsets and display offsets must be handled by the search system. this means that the search system should be able to decode the display representation, or at least understand what's relevant. firstly, it must understand the distinction between block-level elements and inline elements (all the others); this is necessary for highlighting to work.
which set of elements/attributes should we have? the minimum necessary to generate both the display and the search representation. should use a single tag for inline spans; probably simpler and more readable than nesting presentation-related infos tags like , , etc. also simpler because for producing the search representation; we will have to examine each element to see whether it's relevant, so better not to add too many. must have a lookup table for inline elements and block elements.
to make copy-paste to work for manu, simpler method would be to describe in some file what CSS classes should produce bold, italics, etc., and use the info to insert extra , , etc. tags when generating the html. we could also have an "export to .docx" functionality and use pandoc for the conversion.