Skip to content

[Feature] Stratified sandboxes #46

@dginev

Description

@dginev

A very natural need when working on a large corpus for iterative development is to experiment with convenient subsets of the large collection, based on certain metrics of interest.

I have become quite fond of the term "stratify", as employed by ML tools such as sklearn, so I would like to start calling these "stratified sandboxes".

In essence, cortex should provide a mechanism on each (corpus,service) report page to request a "stratified sandbox" from a finished service run. That mechanism is likely just a simple html form, allowing to select for various types of stratification:

  • by path fragments: try to provide an equal number of entries from each corpus segment (be it subdirectory, or filename globs, etc)
  • by messages: try to provide an equal number of entries for each message severity:category:what class
  • by count/frequency: both lower/upper bounds for the main stratification criteria. For example:
    • only 1 entry for each message registered
    • not more than 1,000 per class, stratified by path fragments
    • unlimited, for a specific message class
    • and so on

The implementation is somewhat direct. A new corpus is instantiated, named after the corpus, service, stratification nickname, and timestamp. A metadata structure is then constructed via a specialized query for postgres. Each reported entry is then added to the new corpus, and finally the service in question is registered on the new sandbox, preparing it for a conversion run.

Add to this feature the ability to download custom corpora (by receiving a dangling link and having a background task creating the download for you, at which point the link activates and optionally you get a notification), and you get a full lifecycle UX for sandboxes.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions