[Feature] Stratified sandboxes

A very natural need when working on a large corpus for iterative development is to experiment with convenient subsets of the large collection, based on certain metrics of interest.

I have become quite fond of the term "stratify", as employed by ML tools such as sklearn, so I would like to start calling these "stratified sandboxes".

In essence, cortex should provide a mechanism on each (corpus,service) report page to request a "stratified sandbox" from a finished service run. That mechanism is likely just a simple html form, allowing to select for various types of stratification:
  - by path fragments: try to provide an equal number of entries from each corpus segment (be it subdirectory, or filename globs, etc)
 - by messages: try to provide an equal number of entries for each message `severity:category:what` class
 - by count/frequency: both lower/upper bounds for the main stratification criteria. For example:
    - only 1 entry for each message registered
    - not more than 1,000 per class, stratified by path fragments
    - unlimited, for a specific message class
    - and so on

The implementation is somewhat direct. A new corpus is instantiated, named after the corpus, service, stratification nickname, and timestamp. A metadata structure is then constructed via a specialized query for postgres. Each reported entry is then added to the new corpus, and finally the service in question is registered on the new sandbox, preparing it for a conversion run.

Add to this feature the ability to download custom corpora (by receiving a dangling link and having a background task creating the download for you, at which point the link activates and optionally you get a notification), and you get a full lifecycle UX for sandboxes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Stratified sandboxes #46

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature] Stratified sandboxes #46

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions