Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MVP of better management of Dictiōnāria + Cōdex automated updated (necessary for future schedule/cron) #29

Open
fititnt opened this issue Apr 16, 2022 · 7 comments
Labels
praeparatio-ex-codex praeparātiō ex cōdex; (related to) preparation of book (of a collection of dictionaries)

Comments

@fititnt
Copy link
Member

fititnt commented Apr 16, 2022

Related


Since we already have several dictionaries (some more complete than others) it's getting complicated call then manually. The fetching of the Wikidata Q labels in special is prone to timeouts (it varies by hour of the day) so the thing need to deal with remote calls failing and still retrying with some delay or eventually give up and try hours later.

For sake of this Minimal Viable Product, the idea is at least start to use the 1603:1:1 as starting point to know all available dictionaries, then evoke one by one instead of going add then directly on shell scripts.

@fititnt
Copy link
Member Author

fititnt commented Apr 16, 2022

Few weeks ago in 1603:1:1 column to add tags was added. Since the cron jobs could evolve over time and also allow more complex rules, then the cli to output result could also allow pre-filter based on such tags.

This somewhat would require more options to the CLI, but at same time allow it to be more flexible. For example, we have different tags for both what is considered "public" (something can be both for internal and public use, like 1603:1:51) and what uses Wikidata Q. The first case is a candidate to update a release on publishing channels (like CDN, but could be CKAN or something). The second tag is a hint that such a group of dictionaries actually requires update translations from Wikidata Q (which means more downloads).

Something to obviously implement is a way to check if the dictionaries were recently published (maybe default to near 7 days?) based on previous saved status and consider this if need re-run. This could distribute the jobs. However such features obviously would need to be ignored if running as local testing or if humans require to update despite default rules.

Note: A hard-coded filter is only to try to process what on 1603:1:1 has at least one tag. At the moment not all dictionaries which already have content have such tags.

Captura de tela de 2022-04-16 19-30-34

fititnt added a commit that referenced this issue Apr 16, 2022
fititnt added a commit that referenced this issue Apr 17, 2022
@fititnt
Copy link
Member Author

fititnt commented Apr 17, 2022

Ok. First time full drill tested yesterday. Beyond load groups of dictionaries from the main table [1601:1:1], the only filter implemented was --quaero-ix_n1603ia='({publicum}>=11)'.

Some small bugs. Some of them are related to old strategy (only act if file is missing on disk) but that was too primitive to scale at this point.

Notable to do's

1 configurable limit of cron jobs to return

Depending on the context, all (or very large number) dictionaries would be marked to update immediately. This is a recipe for things to go wrong. For example:

  • Increase risk of take more than one hour (which would likely be an ideal time to setup cron jobs)
  • Definitely could upset Wikidata SPARQL backends
  • Any other request which would likely be rate limit (maybe even GShets if not downloading a single XLSX)

One way to help with this is to simply at least configure the number of max jobs to return with --ex-opus-tempora(...)

2 configurable sort cron jobs (generic)

Reasoning is the same as previous points.

Currently the default shorting is to use order of Numerordinatio of the Cōdex, but makes sense to expose the ordering by additional CLI parameters.

3 opinionated sort of cron jobs to deal with fails

The precious topic may be too complicated to generalize, so this point tends to be better done from the programming side.

All other configurations are likely to suggest jobs that failed recently, so it could get stuck.

We also need to consider cases we're the fail is not servers, but misconfiguration of the reference tables (edited by humans) which can make weird bugs (like generate an invalid SPARQL query) so the cronjob manager could give up all other works thinking it's a busy hour, while the issue is specific to one place.

This doesn't mean push failed jobs at the very end of the list, but for sure they shouldn't be the first ones.

4 requisites to decide if Wikidata Q needs update: CRC Wikidata Q codes per focused dictionaries group + CRC of concepts from 1603:1:51 + default time to assume Wikidata Q is always stale

The Wikidata Q already is quite intensive. We didn't implement more than labels, so it is realistic think it could easily get much, much more intense with other data (such as at least also the alternative labels).

Assuming last successful run time, there are additional reasons to consider immediately stale (which is relevant to run locally), which is if concepts of the focused group of dictionaries changed Wikidata Q codes or... the number of languages on 1603:1:51 changed.

5 define some defaults to allow weekly updates

This can't be exactly 7 days because old runs have natural delays to finish (from 1 min to 6 min; likely more in special if add more PDF versions) so it always would need a way for the next run.

However the major reason would be either the network (like timeout errors on Wikidata) or last run a week ago was so full it would need to run again.

Maybe 6 days?

6 define some default value to be 100% all data is stale (then, after downloads, maybe check hash's); maybe 14 days? 30 days?

This actually should depend on several other factors. But could be used as a last resort (like if all other checks failed.

Potential to do's (may be micro optimization)

1 Predictable affinity to run days of the week

From time to time, all groups of dictionaries could be regenerated on nearly the same day, but the ideal would be that the natural tendency pushes them on a more predictable deterministic schedule. Without this always one day of the week would be prone to overload servers with too many requests.

Maybe this could use some pattern on the Numerodinatio to avoid human decision.

2 notify if failed after many retry attempts

Things are expected to fail quite often, which means notifying humans without trying a few more times would make several near false positives. This need optimization (likely over the time)

@fititnt
Copy link
Member Author

fititnt commented Apr 17, 2022

Okay. We have at least 17 group of dictionaries at the moment (already not considering the ones which are not labeled with ix_n1603ia). Some are still using old syntax (that still work) but miss some feaures of new Codex versions.

Some features of this issue may left to later (when becomes more viable to do full automation on some remote worker) but most hard parts would already be done earlier.

The temporary alternative to smarter scheduler: full list on random sorting + limit number of results

Full command
1603_1.py --ex-opere-temporibus='cdn' --quaero-ix_n1603ia='({publicum}>=1)' --in-ordinem=chaos --in-limitem=2

Both because I'm testing bugs and also to take change to start updating dictionaries not being working on, we're just doing as the title says.

Another reason to do this strategy for some time is because the library is so big (and need to process all items to get a full picture) that we're adding more metatada to the 1603.cdn.statum.yml. If new feature is implemented too soon, it would require wait all the time for the full thing work (which would take hours).

The need to have better general view (out of topic of this issue)

The areas of the dictionaries are so different from each other, that unlikely someone looking at the full index would get a focused page on what is interested. But this is essential complexity, because the HXLTM was designed to allow totally different types of information be compiled, including with annexes (such as images), which don't make a lot of sense on most dictionaries people are aware of (but are on medical atlas).

However, even without intent of create dedicated pages to some topics, we can mitigate a bit the chaos of people simply not going to the current index page at HXL-CPLP-Vocab_Auxilium-Humanitarium-API/1603_1_1 https://docs.google.com/spreadsheets/d/1ih3ouvx_n8W5ntNcYBqoyZ2NRMdaA0LRg5F9mGriZm4/edit#gid=2095477004

@fititnt
Copy link
Member Author

fititnt commented Apr 18, 2022

We're having this issues with zeroes on XLSX files, same issue as this comment wireservice/csvkit#639 (comment).

We're using in2csv to extract the HXLated versions from a single XLSX file with everything instead of download one by one with hxltmcli (which uses libhxl-python and work perfect with remote GSheets and local CSVs, but XLSXs, likely some common issue with other because of package dependencies; have some edge cases).

For some time the HXLTM files may have integers like "10" on CSVs with values such as "10.0".

@fititnt
Copy link
Member Author

fititnt commented Apr 28, 2022

Humm... I think some refactoring, the empty columns started to not be removed from final .no11.tm.hxl.csv files. Need some post processing just in case.

fititnt@bravo:/workspace/git/EticaAI/multilingual-lexicography-automation/officinam$ f
rictionless validate 1603/63/101/1603_63_101.no11.tm.hxl.csv
# -------
# invalid: 1603/63/101/1603_63_101.no11.tm.hxl.csv
# -------

====  =====  ===============  =================================================================================================================
row   field  code             message                                                                                                          
====  =====  ===============  =================================================================================================================
None     10  blank-label      Label in the header in field at position "10" is blank                                                           
None     11  blank-label      Label in the header in field at position "11" is blank                                                           
None     12  blank-label      Label in the header in field at position "12" is blank                                                           
None     13  blank-label      Label in the header in field at position "13" is blank                                                           
None     14  blank-label      Label in the header in field at position "14" is blank                                                           
None     15  blank-label      Label in the header in field at position "15" is blank                                                           
None     16  blank-label      Label in the header in field at position "16" is blank                                                           
None     17  blank-label      Label in the header in field at position "17" is blank                                                           
None     18  blank-label      Label in the header in field at position "18" is blank                                                           
None     19  blank-label      Label in the header in field at position "19" is blank                                                           
None     20  blank-label      Label in the header in field at position "20" is blank                                                           
None     21  blank-label      Label in the header in field at position "21" is blank                                                           
None     22  duplicate-label  Label "#item+rem+i_qcc+is_zxxx+ix_wikiq" in the header at position "22" is duplicated to a label: at position "5"
====  =====  ===============  =================================================================================================================

The None 22 duplicate-label is human error. But I think with #35 we can start add more decent data validation instead of just check if valid CSV with csvkit.

@fititnt
Copy link
Member Author

fititnt commented Apr 28, 2022

HUMMMMM.... the reason for the issue (empty columns) is actually quite curious: when automatically extracting data from XLSX, more rows than necessary are extracted.

In theory this could be edited by the user (by deleting extra columns) but this would be too annoying to document, so lets automate it.

Image with context of why it happens

Captura de tela de 2022-04-27 23-09-40

@fititnt
Copy link
Member Author

fititnt commented Apr 28, 2022

Okay. Almost there. The last validation is duplicated key column used to merge the Wikidata Q terms to the index of the dictionaries.

frictionless validate 1603/63/101/1603_63_101.no11.tm.hxl.csv
# -------
# invalid: 1603/63/101/1603_63_101.no11.tm.hxl.csv
# -------

====  =====  ===============  =================================================================================================================
row   field  code             message                                                                                                          
====  =====  ===============  =================================================================================================================
None     10  duplicate-label  Label "#item+rem+i_qcc+is_zxxx+ix_wikiq" in the header at position "10" is duplicated to a label: at position "5"
====  =====  ===============  =================================================================================================================

The problem (which need hotfix)

The hxlmerge CLI already is far beyond tested strategies, so depending of the number of the columns (not remember now excatly which byte, but after 100 languages it sure start to have it) it will discard one column and raise a error like this

ERROR (hxl.io): Skipping column(s) with malformed hashtag specs: #item+

This error is deterministic (always the same, like some strategy to get how many columns to check). But the way merging works, it all on memory. So the merge operations may already have more bugs because of a previous issue

In this case the documented hxlmerge --replace (documentation: Replace empty values in existing columns (when available) instead of adding new ones.) which is supposed to do exactly what we want, after so many columns, with the ERROR (hxl.io) error it will actually not add any new language at all.

Potential hotfix

A potential hotfix here is we create another temporary file, use a different name for the key column, and after the merge on temporary file, we discard the duplicated column. Not the ideal, but considering the amount of files and rewrites we're doing, pretty okay. Also it would not break if fixed on the library (or if breaks we would know and simply use --replace againt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
praeparatio-ex-codex praeparātiō ex cōdex; (related to) preparation of book (of a collection of dictionaries)
Projects
None yet
Development

No branches or pull requests

1 participant