[ARCHIVED] Registry + leaderboard design v1 #139

nickjalbert · 2021-09-10T10:06:18Z

nickjalbert
Sep 10, 2021
Maintainer

This thread will track ideas related to the registry and leaderboard web services.

Executive Summary

We built a prototype web service that allows us to publish the performance of an agent and then download and reproduce those results all from the command-line. The web service integrates the AgentOS registry and allows us to associate agent components with agent performance in particular environments.

Top of Mind and Next Steps

Product management questions: who are we building this web service for (hobbyists, practitioners, researchers)? What is the goal of this web service (building community, expanding the capability of the AgentOS platform)? Are we on the right track with the prototype (and how would we know)? What does an updated demo script look like?
Convert all AgentOS runtime data management and tracking to MLflow Tracking (will also give insight into how the MLflow Tracking paradigm interacts with RL).
Dependency management - maybe an approach using MLflow Projects and Docker?
A taxonomy of concepts, nail down terminology for this version of AgentOS.

Desired functionality

From Andy's slides, the vision of the AgentOS web service is to combine ideas from PyPI, MLflow, Hugging Face, Kaggle, and Papers with Code into a website that will build a community around the AgentOS platform. Specific high-level features of these services that we find compelling:

PyPI: project versioning, publishing, sharing
MLflow: reproducibility, experiment tracking, model management/portability
Hugging Face: model sharing, model (agent) hosting
Kaggle: public leaderboard, team competition, community
Papers with Code: reproducibility, research focused

Some lower level features that we wanted to investigate in this first prototype:

Browsable web view of agents and their performance against environments (v0 done!)
Browsable web view of environments and the performance of agents on them (v0 done!)
Download a custom, pre-trained agent via commandline. Should be able to run against your desired environment in a few commands (v0 done!)
Download a custom environment (v0 done!)
Publish the results of running a specific agent and environment (v0 done!)
Publish a custom agent (potentially with artifacts like a trained neural net) (TODO)
Run a custom agent against a standard set of benchmarks (TODO)
Publish a custom environment (TODO)
Run a custom environment against a standard set of agents (TODO)

AgentOS Web Service Prototype

We built out a prototype web service to get a better sense of the design challenges we face in this line of work. Some artifacts:

Prototype web service code
A dev deployment of the web service
nj_leaderboard branch of Nick's AgentOS fork that integrates the AgentOS CLI with the deployed server
Some cursory slides describing the web service

Mentions of the AgentOS CLI in the following sections refer to the CLI developed in tandem with the AgentOS web service (on nj_leaderboard branch in Nick's AgentOS fork) unless otherwise noted.

Registry

The AgentOS registry stores information about the various components that can compose an AgentOS agent. This information allows the CLI to download these components and incorporate them into agents.

In the prototype, the registry of components (Agents, Environments, Policies, Trainers, Datasets, etc) was extracted from registry.yaml in the AgentOS master branch and placed into the web service's database across two tables (Component, ComponentRelease) with the following structure:

When agentos install [component] is called from the command-line, the registry is pulled down from the web service and the information about the requested component is extracted from the registry and the install proceeds as in AgentOS master.

Registry future work

Account system to allow people/groups to manage certain components
Live demos on the website of environments and agents
Once officially deployed, the registry should be immutable (or, close to immutable) to ease reasoning about versions of components and generating leaderboard views.
An AgentOS installation should have a local copy of the registry so it doesn't need to talk to the web server every time it wants to install a component (avoids a single point of failure). The CLI should have a command to update the local registry copy.
A ComponentRelease should have some more info:
- Version of AgentOS with which it is compatible
- OSes with which it is compatible
Users should be able to register new components and push new releases into the registry. This will require an account system.

Benchmark Runs

AgentOS web service is designed to make it easy to share benchmarks and the agents that run against those benchmarks. The idea is to combine the model hosting and sharing enabled by Hugging Face with the competitive aspects of Kaggle. To that end, the prototype introduces the concept of a benchmark run which captures the performance of a particular trained agent against a particular environment.

The AgentOS CLI now distinguishes between training runs (i.e. those runs initiated by executing agentos learn ...) and benchmark runs (i.e. those runs initiated by executing agentos run ...). A training run is one in which the agent's experience is recorded and the Trainer is called on to improve the agent's policy. A benchmark run, on the other hand, does not involve any policy improvement--it is simply a way to assess the agent's current competence in its environment--so experience is not recorded and the agent's policy is not update but metrics related to the agent's performance are recorded.

The envisioned use case is that an agent developer will alternate between training sessions and benchmarking sessions to track the improvement in their agent. After a while, the history of a given agent will look similar to the following:

The AgentOS CLI offers a new command agentos publish that performs the following actions to share and make a benchmark run reproducible:

It determines the most recent benchmark run and collects stats from that run (5000 mean reward over 100 episodes in the above example)
It identifies artifacts (e.g. trained neural nets) used by the agent during the most recent benchmark run
It identifies the components that constituted the agent during the benchmark run as captured by the agentos.ini file
It records all hyperparameters fed to the agent during the benchmark run
It aggregates over all training runs previous to the most recent benchmark run to calculate the amount of training the agent has received (12,500 training episodes across two training runs in the above example)

All of the above get packaged up (including artifacts used by the agent during the benchmark run) and shipped to the AgentOS web service via two API calls: The first API call creates a record of the benchmark run by inserting a row into the Run table with the following schema:

The Run row associates the benchmark results with each component that constituted the agent and thus allows us to associate benchmark runs with particular components.

The second API call uploads a zipped tarball of all the backing data (e.g. trained neural nets, etc) required to reproduce the benchmark run and associates that data with the previously created Run object.

Making runs reproducible

Our prototype also implements an agentos get [benchmark run number] [local directory] command that pulls down the data for the chosen benchmark run from the AgentOS web service and recreates the agent that generated that benchmark run in the local directory.

This command downloads the zipped tarball that is associated with the chosen benchmark run. This tarball contains the following files:

The agent's agentos.ini file specifying the components constituting the agent
A parameters.yaml that captures the hyperparameters that were passed to the agent
A run_data.yaml file that contains info on the performance of the agent
A data/ directory that contains all the backing artifacts required to run the agent (e.g. a trained neural net)

This tarball is extracted and untarred and then:

We initialize a new agent in the local directory
We copy over the downloaded agent's agentos.ini file into the local directory
We reinstall each component specified in the agentos.ini file
We copy over the parameters.yaml file (However, currently we don't respect these updated hyperparameters; see future work)
We copy over the data/ directory into the new agent's data directory

The local agent is now in an identical state to the agent that generated the benchmark run results and can be run with agentos run.

Benchmark runs future work

Track reward not just transitions - requires twiddling with the abstractions to make reward visible to the runtime.
Agent provenance - if you get and agent, train it for one episode, and then re-upload it, it seems like the system should somehow track that you didn't train your agent from scratch. Alternatively, you may want private and public agents.
Run should point to ComponentRelease and not Component
Run should be renamed to BenchmarkRun once we've settled on terminology
Changing components of an Agent should reset their stats (are there cases when this is not true/desired?)
Honor system currently in place for benchmarking performance; eventually we could verify performance on the AgentOS web service backend.
Harden the API (e.g. use Django Rest Framework, accounts, access control, etc).
Web service should store backing data in S3.
Currently you can only publish agents when all components are available on Github (i.e. no "local" components). We should revisit this limitation.
Flesh out the leaderboard on the web service (e.g. actually show top N performances on a given environment)
BUG: The parameters.yaml should override all hyperparameters in a run
BUG: If training occurs after the last benchmark run, then we'll push the newest backing data, not the backing data that was used for the benchmark.

Command-line interface

We've discussed the changes to the AgentOS CLI in the previous sections, but in summary:

The AgentOS CLI now gets the registry listings from the AgentOS web service instead of the registry.yaml in the AgentOS repository.
The command agentos publish publishes a benchmark run to the AgentOS web service.
The command agentos get ... reconstitutes an agent used to generated a benchmark run listed on the AgentOS web service.

MLflow

The current prototype half-way commits to using MLflow Tracking and does not use MLflow Projects.

We only half use MLflow Tracking because we rolled our own system for managing agent backing data (e.g. neural nets) when we first prototyped the componentized version of AgentOS, and it seemed low priority to port everything over to MLflow Tracking for this new web service prototyping exercise. Current MLflow Tracking usage is concentrated in the run_agent() function. However, I think a full port to MLflow Tracking will be helpful for both correctness and consistency in the runtime.

The usage of MLflow Projects requires further investigation but it may help us handle dependencies in a cleaner way.

MLflow future work

Convert the runtime to use MLflow Tracking for all metrics and backing data storage
Investigate MLflow Projects as a way to simplify dependency management. It may be too heavyweight, but each component could potentially run in its own Docker container and AgentOS could orchestrate communication between the components as it does now. This would resolve issues around potentially dependency conflicts.
Even if we decide to skip Docker, supporting Conda (instead of or in addition to pip) might make sense for our intended users.

High-level Thoughts

It might be a good idea to focus on an "environment-first" approach. I envision this to be a landing page that highlights the environments while de-emphasizing the agents and their components. The environment are, I think, the fun/interesting and broadly appealing part of developing an agent (e.g. space invaders is more approachable than dueling DQN for most). When you first land on the site, you probably don't care about a particular component. However, if you can see an environment (maybe e.g. play an Atari game and compare yourself to an agent) you might be tempted to build an agent and dig further into the AgentOS ecosystem. Our approach here also depends on the audience we're targeting with the leaderboard (researchers? practicioners? hobbyists?).
Another question we've discussed is how MLflow can be extended to help with the specifics requirements of RL. I think the time aspect is interesting (i.e. both in a given episode and that you train up an agent over time and sometimes you may care about intermediary versions of an agent). There might be some low-hanging fruit here to improve the RL dev experience.

nickjalbert · 2021-09-11T09:31:50Z

nickjalbert
Sep 11, 2021
Maintainer Author

Some (raw) notes from my brainstorm session with @andyk on 9/10/2021:

Are the leaderboard and the registry separable? Maybe, but it probably doesn't make sense to force it when making the MVP.
Andy is excited about the following features from each of these related projects:
- Model storage and accessibility of model from Hugging Face
- Competition aspect of Kaggle
- Reuse and sharing enabled by PyPI
- Reproducibility and Tracking enabled by MLFlow
Things a user might want to do:
- Share my custom agent, publish with all its constituent sub-components (e.g. trained NN, dataset, policy,etc)
- Can download a shared agent, potentially train for scratch or fine-tune for my specific problem
- Run my agent against a set pre-published environments
- Publish a custom environment and run agents against it
- Can publish a run (maybe just hyperparameters, or high-res trajectory/transition logging) with a specific agent and environment
- MLFlow-like tracking and reproducibility

4 replies

andyk Sep 11, 2021
Maintainer

👍

After our chat, I made a couple of slides to capture my initial high level thoughts more clearly. This one is probably most relevant:

andyk Sep 11, 2021
Maintainer

Also, just dug up this old slide from my Q2 2021 team planning deck:

nickjalbert Sep 20, 2021
Maintainer Author

@andyk updated this discussion to reflect the current state of the web service prototype plus some higher level thoughts. Would love to discuss and figure out good next steps soon (maybe staff or working meeting this week)!

andyk Sep 21, 2021
Maintainer

Google doc with my feedback in it: https://docs.google.com/document/d/1siIGxeGx_LCFU5MqV5apAOqhjkw0D3aa9RhFju2-pe4/edit

andyk · 2021-10-16T00:26:12Z

andyk
Oct 16, 2021
Maintainer

In the last couple of days I cloned and explored the aos_web repo w/ the prototype web app, looked at the current prototype port of the runtime to work with it, and started thinking about what changes we might need to make to update the registry + leaderboard architecture to support the new component system (#148). I started drafting some ideas for a v2 design of the registry in this google doc.

1 reply

andyk Oct 16, 2021
Maintainer

Btw, I'm feeling like google docs (readable to the world) might just be the way to go for design docs

andyk · 2021-10-22T18:52:57Z

andyk
Oct 22, 2021
Maintainer

@nickjalbert I renamed this design discussion to include "v1". I figure that we can use versions our design discussions as we iterate on core project elements to keep the conversation threads easier to manage

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ARCHIVED] Registry + leaderboard design v1 #139

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[ARCHIVED] Registry + leaderboard design v1 #139

nickjalbert Sep 10, 2021 Maintainer

Executive Summary

Top of Mind and Next Steps

Desired functionality

AgentOS Web Service Prototype

Registry

Registry future work

Benchmark Runs

Making runs reproducible

Benchmark runs future work

Command-line interface

MLflow

MLflow future work

High-level Thoughts

Replies: 3 comments · 5 replies

nickjalbert Sep 11, 2021 Maintainer Author

andyk Sep 11, 2021 Maintainer

andyk Sep 11, 2021 Maintainer

nickjalbert Sep 20, 2021 Maintainer Author

andyk Sep 21, 2021 Maintainer

andyk Oct 16, 2021 Maintainer

andyk Oct 16, 2021 Maintainer

andyk Oct 22, 2021 Maintainer

nickjalbert
Sep 10, 2021
Maintainer

Replies: 3 comments 5 replies

nickjalbert
Sep 11, 2021
Maintainer Author

andyk Sep 11, 2021
Maintainer

andyk Sep 11, 2021
Maintainer

nickjalbert Sep 20, 2021
Maintainer Author

andyk Sep 21, 2021
Maintainer

andyk
Oct 16, 2021
Maintainer

andyk Oct 16, 2021
Maintainer

andyk
Oct 22, 2021
Maintainer