dharma/todo.txt at master · erc-dharma/dharma · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
# TODO

écritures pas assignées correctement sur DHARMA_INSSIIv17p0i0227.

regarder dharmalekha.info/languages; on a des "english", etc. qui ne devraient pas apparaître

handle /home/michael/dharma/texts/DHARMA_INSSIIv02p0i0001.xml (language display)

17

## search

REPR Dharma TG p. 17 (PDF 18)

et rendre les champs affichés ou non selon qu'ils comprennent un match

ignorer espaces+hyphens
ajouter support de la recherche levenshtein

make the biblio searchable
only display fields with matches
je crois qu'on gère pas enjamb, à vérifier

## other stuff

tfa-tamil-outside-TN-epigraphy: ~DHARMA_INSTamilOutsideTN10002.xml ; manque une réf. biblio à Pillai
KanapathiPillai1962_01

hi rend=grantha à uniformiser avec rendition.  et ne mettre du gras que si le parent est en tamoul.


XXX Dans enrich.py, quand on a un passage en grantha, il faut le mettre en gras. Cela suppose d'entourer de <span> tous ces passages.

* biblio: DHARMA_INSTamilNadu01003 ; DHARMA_INSTamilNadu01009; GopinathaRao_1921-1922_01 ;

cliquer sur les snippets dans la fenêtre de résultats doit renvoyer, sur le
document, aux passages correspondants.

* aux champs cherchés, ajouter: bibliograph (et autres)

Languages and scripts are assigned recursively. For assigning
	languages, we ignore the div[@type='apparatus'], to
	keep things simple, because it does not explicitly indicate languages
	and scripts on <lem> and <rdg>. Languages should only be extracted from
	the other divs. We follow the basic inheritance rule: if an element does
	not have an @xml:lang, it is assigned the language assigned to its
	parent. But there is an exception: if the tag is "foreign" or
	div[@type='edition'] and does not have a @lang, then its lang is set to
	"und".


* deal with @lang
* deal with @editorial

!!! This is breaking http://localhost:8023/editorial-conventions

XXX
¶ If the tag is "foreign", "lem" or "rdg" and does not have an
        @xml:lang, we assume it is in some undetermined source language (as per
        the guide) and assign it a generic language named "source" and a script
        named "source" as well. These "source" values just mean "any source
	language" or "any source script". These represent any source language
        (per contrast with languages used in translations).

exemple: dharma_test.xml.hid

===

search: enjamb
search: snippets selection

===


mettre en gras tous les @rendition=grantha

pour e.g. INSIIv04p0i0433; ne pas mettre en gras les parenthèses, les chevrons,
les accolades, etc., les points d'interrogation et d'exclamation, bref tous les
caractères qu'on ajoute.

arie: fichiers publics
arie-corpus: fichiers privés
tout mettre dans le même dépot et tout afficher

pour bestow, pour chaque traduction, ajouter l'auteur de la traduction (support
<div type="translation" source="bib:Hultzsch1895_01">).

support assignment of languages in DHARMA_CritEd* et DHARMA_DiplEd*.

---

need to deal properly with creation/deletion of repos. the only input parameter
should be the repos file. add a routine for this in change.py

should add extra commands to change.py to do more granular updates. because, at
startup, we start with project-documentation and thus reindex the whole catalog only to then update each repo again. we should skip the whole catalog reindexing at startup (and only in this circumstance.)

---

Il faut couvrir tous les sections principales avec des div, pour éviter qu'il y ait un mix de para et de div a un même niveau, et idem recursivement. On devrait avoir des div phantom pour toutes les div type édition, translation, etc., de telle sorte que toutes ces sections principales contiennent au moins une div (ainsi on pourra plus facilement calculer la taille d'une div, etc.)

Dans la repr interne, Il faudrait éviter de hardcoder les noms des div (édition, translation, etc.), plus encore si on n'a pas besoin de savoir ce qu'elles contiennent. Parce que c'est chiant dans le code qui les parse, et parce qu'on doit prendre en charge d'autres types de div pour également bestow. il vaudrait. Mieux avoir seulement div comme élément.


Ajuster dispositif n des milesyones en fonction de remplissage par para et verse-line éléments.

Permettre les div imbriquées, et vérifier que les résultats est le bon. Autoriser tous types de div, pas seulement text part.

On devrait avoir des lignes fantômes pour le physical aussi, dans la recherche.


---

XXX do commit several times (or maybe savepoint+release) when rebuilding the
catalog.

abrévier le summary dans tableau résultat à un seul <p> (ajouter [...] à la place des restants)

## internal repr

must allow (at least) <verse> within a <quote>. but note that for milestones we
assume there is no overlap.

DHARMA_INSCIC00137

<quote>
            <verse>
                <verse-line break="true">āsādya <!--space-->
                    <span class="bold">śaktiṁ</span><!--space--> vivudhopanītāṁ māheśvarīṁ jñānamayīm amoghām</verse-line>
                <verse-line break="true">
                    <span class="bold">kumāra</span>bhāve vijit<span class="bold">āri</span>varggo yo dīpayām āsa mahendralakṣmīm ||</verse-line>
            </verse>After attaining the Power (or: weapon) of Maheśvara (Śiva) that consists in Knowledge, that is never failing (viz. after attaining initiation) [and that has been] transmitted by the gods, being in youth (or: as crown-prince, or: as Kumāra, i.e. Skanda) one whose enemies (or: passions) were conquered, he caused the glory of Mahendravarman to shine.</quote>


## Misc

ultimately, we should remove the bs4 dependency, but for this we need a HTML
parser _and also_ a serialization method that does the appropriate thing for
self-closing tags.

## XML Schema

should do an autoreplace from my code to the xml schema; for at least:

- prosody @met
- people; the "part:xxxx" stuff
- language codes @lang
- script age + maturity @rendition
- and for bib:xxx, we could perform http requests in schematron, but might be too slow

for this we need to access the app's repo. can either do the transform within the app's repo; or within project-documentation.


## Database

Need to have some locking logic when reading from a github repo. For when we
want to do manual maintenance while the app is running.

Need to make sure that all the files we need for display, etc., are stored in
the db. Currently, this is not the case. Not easy to guarantee. unless we have a
reliable access method. Should have a unique mechanism for storing files.

[after we're done with the catalog]. in the db, stop using "und" as default for languages that don't have one, we should have an empty array in such cases. the "empty" value for everything should be null. and also hide "Language: Undetermined" when no language found in div type edition. be careful that we are using languages in search.

¶ need a protocol for bootstrapping the db.

¶ support adding/deleting/renaming repos; should have a single entry point for
this. and should document it somewhere.

	delete from repos where repo='repo-test';
	delete from files where repo='repo-test';
	delete from documents where repo='repo-test';
	delete from documents_index where repo='repo-test'; **but lowercasing!**

should probably use triggers for dealing with this. need to remove all files related to a repo.

find a way to merge the parallels db with the main one; it should be updated
every week or so


## Arlo

transfo des éditions critiques


## Manu

Pour les unités dans la bibliographie, utiliser explicitement le singulier et le pluriel ? `<citedRange unit="page|pages">`. Manu et moi sommes pour, Arlo et Daniel pas vraiment.

Pour le display des metadata, Manu voudrait ceci. metadata: short display: langue, écriture, date, summary; long display: avoir un bouton pour afficher les métadonnées au complet.

Bouton de translittération à ajouter pour le tamoul

Pour le copier/coller sous Word, fonctionne pas très bien pour Manu (plutôt que d'utiliser des classes, il faut employer des éléments `<b>`, etc.)

in biblio, move sharedocs links to notes; requires to have
* a mechanism for updating the biblio
* a mechanism for adding notes and linking them to an existing entry

attr to `<p>` for marking up blessings/curses? @ana? find something generic for
all custom stuff (additions to the egd).

prosodic patterns; be careful about placemenbt of guillemets and footnote nums.

manu: pour le display des métadonnées, avoir un bouton expand/unexpand comme pour l'apparat
commencer à réfléchir à la faceted search.


## ODD schema

need to fix the datatype mess in rng schema; distinguish between attrs that
accept a single value and the others.

deal with div with @rendition class:grantha, should be put in bold. need first
to deal with

=================

elements that can be ignored and should be removed eventually:

prefixDef, listPrefixDef, schemaRef
but first verify that nothing depends on them in xslt files
and also make sure they do not appear in templates

//TEI/teiHeader/fileDesc/publicationStmt
//TEI/teiHeader/fileDesc/sourceDesc/msDesc/msIdentifier
//TEI/teiHeader/encodingDesc
//TEI/teiHeader/revisionDesc

revisionDesc can always be ignored

=================

pour languages dans display, tous ceux de div type=edition + écrfiture
écritures à gérer
"Tamil in Tamil Script; Sanskrit in Grnatha script":


## Parsing

We must do NFC normalization at some point; when? not before storing the
file in the db (might need the original later on for e.g. hashing); simplest
would be to do that just before parsing the file, but this will mess up
columns numbers. still should do it, because we don't refer to columns for
now and because all string comparisons will be messed up. do the final
normalization step (new lines, etc.) when outputting documents.

OTOH, we shouldn't do anything with column numbers. This is too inconvenient.
Idem for byte offsets.


## Refactoring

shouldn't store separately the app from all the data files, because we need the
data files to be present to do anything with the db. the app code is useless
on its own. add projdoc as a submodule? can we still do a git pull in the app
repo without having git complain that the repo has been modified?
everything should be in the same repo. maybe use a reload command that reloads
data files _but not the code_?

in fact we have 2 build levels: fetch files from the fs, and parse the
documents. should do the minimum whenever possible.


## Validation

Script maturity is for use only with the class "Brahmi and derivatives" (and
its subcategories); for any other script classes it is not optional but
"forbidden". For Brahmi, it is mandatory. Amend rules accordingly.

deal with uniqueness of phys elements:
* pb and pagelike milestones must have unique @n in the whole div/edition

Check for multiple uses of the same bib entry as in https://dharman.in/display/DHARMA_INSBengalCharters00050#bibliography (disallow?)

actually use dharma.rng, only in /texts for now, afterwards distribute it;


## Display

Parse urls and make them clickable.

Some stuff to implement in the display of the apparatus: https://github.com/erc-dharma/project-documentation/issues/282#issuecomment-2444650894

for invalid inscriptions, show the xml, but without formatting tags, etc.
need to convert the error (line,column) to an offset
use xmlparser.ErrorByteIndex instead of donig a manual conversion
https://docs.python.org/3/library/pyexpat.html

deal with rendition and xml:lang, which must cover the whole text in div type edition.
must be dealt with in tree.py

manu: In physical display, do not display editorial hyphens, but do show them
in logical display. For this to work, need to tag languages. XXX
hyphens between words? or at the end fo a line?

manu: grantha translit with button several states translit methods

* Sort out languages tagging; assign language categories (lang of
	the ed. or of the rest, main or secondary lang; probably not
	useful to keep track of `<foreign>`)

add tooltip for expan in `<abbr><expan>` in phys disp; but need to know how to do
that

div rendition="class:38768 maturity:83213" (grantha) à mettre en gras pas seulement hi rend=grantha ; pour SII0501358
idem pour `<lg rendition=...>` dans Tiruvavatuturai01

don't think we are formarring abbr/expan as supposed

should use the lang attribute in html to tag appropriately xml elements
with an @xml:lang.


## Problems with @n.

The repetitive scheme is not clear and unpredictable. Should have a clearer
convention.


## XML source display

Put the tab button in the sidebar, call it "view source". Should allow resizing
the sidebar, too. when it is completemy closed, what to display?

* when displaying the sidebar, add toc headings for navigating the xml: header,
	edition, translation, etc.
* Need to have a pretty-print func that preserves space and
	doesn't add unnecessary space.
* Also add line numbers
* Style the thing with a color for comments and tags, maybe different colors
	for milestones and logical elements.
* Add error messages with popups in the XML.


## Website

in the "texts errors" page, have a column with the severity level

for https://github.com/erc-dharma/project-documentation/issues/266#issue-2207593274
don't use href in in-page links, it's confusing; use data-href instead; and this
would allow us to distinguish page-internal links from the others.

when a file is completely invalid, show the raw xml in the displqay (not pretty-printed).

do a redirect /foo/ -> /foo in nginx _but_ watch out with the /zotero-proxy
stuff.

deal with elem flashing:
cumulative timeout for flashing https://developer.mozilla.org/fr/docs/Web/API/setTimeout
repr here https://stackoverflow.com/questions/29017379/how-to-make-fadeout-effect-with-pure-javascript

Make the top menu sticky on pc? no. Add a button to show/hide the left sidebar (on
pc); where? left of the top menu downward-pointing > thing. The left sidebar
shouldn't pop when we arrive to the page footer, how? The left sidebard should
be resizable, but then dimensions need to be saved as a cookie because reloading
the page will mess up the size.

Generate a site map (wget?).

Use the w3c validator API https://validator.w3.org/docs/api.html with random
urls to detect issues; submit URLs like so:

	https://validator.w3.org/nu/?out=json&doc=$URL

Add a "status" search field to catalog to filter by error status.

add global table of gaiji symbols actually found in inscriptions.
we will add links to inscriptions within this table so that we can find which
inscriptions, etc. contain which symbols.


## Parallels

Allow quoting part of the input with "..." to force an exact substring match.
Still keep using the same similarity measure. When there are several quoted
passages, allow overlaps viz. "foo"f"fo" match "foo". Or not? require the
matched strings to occur in the same order? In fact having a second field for
filtering seems better.

This should be linked to the catalog search features, but we must first
integrate with the main db.


## Duplicate file idents and zotero idents

would be convenient to have position-independent files, viz. assume file
basenames are unique AND also extension-independent files, to allow people to
move files around.

find some way to report non unique files ; could use an intermediate table that
stores duplicates, like for zotero; for duplicate files in the same repo, we are
sure there is a problem and we can complain early on while processing the repo
itself (to whom, however?), but we don't need to do it early; but if the files
are in distinct repos, we cannot tell whether the file is being moved or
anything, because there is no global commit across all repos and the order of
operations is not guaranteed.

in any case, we must preserve the fact that a given ident always corresponds to
exactly one elem; so if we have a duplicate ident, do not use this
duplicate ident, instead generate new ones and delete these when appropriate.