-
Notifications
You must be signed in to change notification settings - Fork 8
Description
A very natural need when working on a large corpus for iterative development is to experiment with convenient subsets of the large collection, based on certain metrics of interest.
I have become quite fond of the term "stratify", as employed by ML tools such as sklearn, so I would like to start calling these "stratified sandboxes".
In essence, cortex should provide a mechanism on each (corpus,service) report page to request a "stratified sandbox" from a finished service run. That mechanism is likely just a simple html form, allowing to select for various types of stratification:
- by path fragments: try to provide an equal number of entries from each corpus segment (be it subdirectory, or filename globs, etc)
- by messages: try to provide an equal number of entries for each message
severity:category:whatclass - by count/frequency: both lower/upper bounds for the main stratification criteria. For example:
- only 1 entry for each message registered
- not more than 1,000 per class, stratified by path fragments
- unlimited, for a specific message class
- and so on
The implementation is somewhat direct. A new corpus is instantiated, named after the corpus, service, stratification nickname, and timestamp. A metadata structure is then constructed via a specialized query for postgres. Each reported entry is then added to the new corpus, and finally the service in question is registered on the new sandbox, preparing it for a conversion run.
Add to this feature the ability to download custom corpora (by receiving a dangling link and having a background task creating the download for you, at which point the link activates and optionally you get a notification), and you get a full lifecycle UX for sandboxes.