Agent class #102

vmoens · 2022-04-28T16:43:53Z

vmoens
Apr 28, 2022
Collaborator

As pointed in #90, the agent class is the least TorchRL thing in TorchRL.
It's a magic one-fits-all class that is highly customized for TorchRL internals, and as such it requires folks who wish to use it to understand well torchrl and adopt it in all its depth and breadth... Not what we want!

This raises a series of questions:

Do we want an Agent class?
In my mind there are 2 advantages to have an agent class: First, examples are highly repetitive. In most cases, training an (online) RL algorithm can be sketched as

for batch in collect_data:
    batch = do_something_with_my_data(batch)
    for optim_step in range(num_optim_steps):
        batch = draw_data_for_training()  # from batch if on-policy, from replay buffer if off-policy
        loss = compute_loss(batch)
        train_step(loss)

Obviously, each of these steps may be decorated by some other things that need to be incorporated (e.g. resetting noise of noisy layers, updating target networks weights, updating priority of the replay buffer, etc.)

The second reason an agent class is a nice-to-have is that it gives directions to newcomers. Imagine this is the first RL algorithm you're coding, and you don't really know where to start. Having a class where stuff is already aligned is an effective way to start! Then you may just say (after you gather some experience): "ok dude, this is useless, I'll code it from scratch myself", which is probably what I'd do as an external user.

Do we want multiple agent classes, with highly hierarchical inheritance schemes?
One might say: this sketch of an agent class you drew above is all good and well until I have a really convoluted algorithm to implement.
So, we may code up a parent class and inherit from it, the same way we do with losses for instance. Problem is that this solution means that we'll have to do some nasty spaghetti code from time to time, with multiple inheritance, overriding legit methods with others etc. IMO it makes the code lengthy and very specific (read: not very elastic).
Does Agent belong to torchrl core or is it something that should live somewhere else (other repo / examples directory)?
We could indeed build another repo on top of torchrl that would build all the higher level primitives to train RL agents. But IMO it's going to be hard to get traction for that: if the community is receptive to it (say we get 80% of the clones/stars we get with TorchRL), then it should probably belong to TorchRL. If the repo has no success, well, why is it there?
We may perhaps move this class to examples but that would redefine the implicit role of examples where we only wish to have scripts to train specific benchmarks at the moment (not re-usable classes that one may want to use in other contexts).

I have created a PR where I propose a refactored version of the Agent class.
The Agent is now very schematic and generic, on the high level it looks like this:

def train(self):
    for i, batch in enumerate(self.collector):
        batch = self._process_batch_hook(batch)  # <- here
        self._pre_steps_log_hook(batch)  # <- here
        self.optim_steps(batch)
        self._post_steps_hook()  # <- here
        self._post_steps_log_hook(batch)  # <- here

def optim_steps(self, batch: _TensorDict) -> None:
    self._pre_optim_hook()  # <- here
    for j in range(self.optim_steps_per_batch):
        sub_batch = self._process_optim_batch_hook(batch)  # <- here
        losses_td = self.loss_module(sub_batch)
        self._post_loss_hook(sub_batch)  # <- here
        losses_detached = self._optimizer_step(losses_td)
        self._post_optim_hook()  # <- here

I have pointed all the methods that can be modified easily by "hooking" stuff to them.

As it is broken down in little pieces, we can test each of them independently. One can also add its own if necessary. There are still some constraints on what those methods can do but I guess it should not be too difficult to overcome those limitations (either in the design of the hooks or in the design of the agent class).
For now, the make_agent function still takes care of building this lego castle, but we can also make very specific change in training scripts that won't make the whole class compliant to a single example.

Examples of hooks that can be registered are:

process_batch: extend the replay buffer with the current data (if there is), update normalizing statistics of the reward (if normalized) etc
post_steps: update weights of policy in collector (if on another process / cuda device / worker etc), make an annealing step of epsilon greedy
process_optim_batch: sample from replay buffer (e.g. DQN) OR sample from batch of data collected (e.g. PPO)
post_loss: update priority of replay buffer with td_error or similar
post_optim: a step of the optimizer scheduler or a step of update of the target network
pre_steps_log: log of reward in the batch collected, or log something else from that batch
post_steps_log: execute recorder (i.e. execute the policy in eval mode to display results without exploration)

Naming -- and everything else! -- is subject to change and suggestion!

Also if I missed something let me know.

Anyhow: please, share feedback!

@shagunsodhani @walkacross

seba-1511 · 2022-04-29T17:29:25Z

seba-1511
Apr 29, 2022

Thanks for kindly mentioning cherry in #90.

I agree with the sentiment that monolithic Agent classes are of limited utility — eg, it’s not clear how to extend them to multi-agent or multi-task learning settings in a way that generalizes and is future-proof.

Instead, I'm still looking for a library that focuses on two aspects of RL infra:

Reusable data structures: obviously a good replay (torchrl has that covered already) and other such utilities to manage experience, but also nn.Module specific to RL and with a unified API, eg, abstract Policy and ActionValue classes.
Algorithmic building blocks: for example TRPO’s conjugate gradient routine, but also higher-level ones like update functions for PPO / TD3 / DrQ / etc. The latter starts to look like an Agent class, but that might be a necessary evil in the same way every optimizer has a different class in PyTorch*. Importantly, they separate the concerns of data handling and model updating. It’s OK for these building blocks to (sometimes) assume inputs are the reusable data structures, but vanilla PyTorch tensors / modules should be preferred.

I’d love for some of those ideas to find their way into mainstream RL libraries, so feel free to borrow as much as you’d like from cherry. Unfortunately I don’t get as much time as I wished to work on it nowadays.

*Side note: I wonder if one could write RL algorithms by "chaining" smaller operations, like optax does for optimization.

3 replies

vmoens May 1, 2022
Collaborator Author

Hi Seb, thanks for this!

Starting by the last point, I think that's the direction this PR is giving to the agent class. In practice, there'll be a skeleton of basic, somewhat abstract operations in the agent that the user will be able to fill using her own primitives. It is currently restricted to TensorDicts but there are very few things that need to be amended to make them work with any kind of input.

Regarding the first two things you'd like to see in a RL lib, I hope it's where we're going too!

Our TDModule and its subclasses (Actor, ProbabilisticTDModule, ProbabilisticAtcor, ValueOperator, ActorValueOperator...) are custom classes that do exactly that. By using the TensorDict and the "in_keys" and "out_keys", you can pass and retrieve any information you want to the actor. Instead of setting the deterministic or random behaviour of the actor, we pass this via a set_exploration_mode context manager. I'd be curious to know what you think of this API and how it compares to cherry's.
Regarding the second point, I'm curious about this:

It’s OK for these building blocks to (sometimes) assume inputs are the reusable data structures, but vanilla PyTorch tensors / modules should be preferred.

do you mean you're against TensorDicts, batch and similar? The way I see it (but I'm awfully biased obsiously), TDs are quite versatile and not very restrictive. For instance you can pass directly tensors as kwargs from a tensordict by using

>>> from torchrl.data import TensorDict
>>> import torch
>>> def fun(a, b):
...     print(b)
...
>>> td = TensorDict({'a': torch.zeros(3), 'b': torch.ones(3)}, [])
>>> fun(**td)
tensor([ 1., 1., 1.])

such that you can "escape" them if you want to. Though I agree that any design choice comes with restrictions, the purpose of this data container is to make the module interactions as generic as can be. By not relying on TDs sometimes but all the time, we make sure that you'll never have to expand a tensordict, check that all inputs match etc. I'd be glad to chat more about this if you have feedback to provide about it! It's not too late to loosen torchrl dependency on those structures!

Regarding TRPO and such, stay tuned! It's something we are working on! I'll keep you posted on the progress. I will certainly borrow stuff from cherry for these, and ping you to check if it's all good and proper! The investment in torchrl is made such that people like you can have your say while sparing time coding/testing/documenting/reviewing stuff and rely on us and the community to execute things in a collaborative way!

seba-1511 May 8, 2022

Sorry for the late reply, NeurIPS is keeping me busy.

Re trainer class, and TRPO & co: Sounds good, looking forward to it. I like @shagunsodhani and @walkacross suggestions to defer as much as possible to outside frameworks, and have single file examples.

Re TDModule: Looks promising! One suggestion: I associate 'TD' with 'temporal difference' in RL but it seems to mean TensorDict here — maybe omit it from the class names?

Re TensorDict: Are they worth the extra cognitive overhead to the library user? If I understand correctly, we use them to a) manage batches of states, actions, ..., and b) ensure batches have the same properties (leading dimensions, dtypes, etc). Maybe a) should be the responsibility of the replay buffers, and b) hidden away from the user behind the library's internals (eg, by validating function / module arguments)? This way a user who just wants to call a few functions from torchrl doesn’t have to learn a new data structure.

One question: does code inspection and autocomplete work with TensorDicts? For example, can I tell what inputs are expected just by inspecting the signature of a TDModule (ie, when does in_keys translate to actions vs values vs states)? If not, is it a deal breaker?

vmoens May 9, 2022
Collaborator Author

Thanks for your comments!

Re TDModule: Looks promising! One suggestion: I associate 'TD' with 'temporal difference' in RL but it seems to mean TensorDict here — maybe omit it from the class names?

great suggestion, I'll take care of that ASAP

Re TensorDict: Are they worth the extra cognitive overhead to the library user?

I think (hope) there is little overhead for the user. Certainly not more than trees for instance, which are ubiquitous in RL.
You can just think of it as an augmented dictionary for tensors (a dict that has a shape and a device). As with many features in such library, you can have a minimal use of it (ie use it as a dictionary) and it should work fine. I have had a couple of chats already with early users of the library and it's quite easy to explain what this class does on a high-level. I'll add a tutorial shortly to make sure there are all the necessary resources for the community to grasp the subtleties of those classes.

I'm still considering whether some classes (e.g. losses) could accept regular dictionaries as inputs, which would then be easy to translate to regular keyword arguments.

A concrete example of why I think it's useful to have something like TensorDict is the following: imagine you are training an algorithm on state-vectors and pixels for similar tasks. In torchrl, you'd have a 2 policies with signatures in_keys = ["pixels"] and in_keys = ["state"] or similar. Your loss would be unaffected, it'd just take as input a tensordict and pass it to the policy:

def loss(tensordict):
    policy_tensordict = policy(tensordict)
    value_tensordict = value(tensordict)
    ...

If your policy also takes a hidden state as input, the behaviour with in_keys = ["pixels", "hidden_state"] will remain unchanged.
If you need to expand this and keep things general, it's a bit harder. You could do something like

def loss(policy_inputs, value_inputs):
    policy_output = policy(*policy_inputs)
    value_output = value(*value_inputs)
    ...

but personally I find it difficult to keep it general. What if the policy returns more than one thing (like a hidden state)? It becomes hard to write a loss that is truly reusable at will, and every slight deviation from the usual signature of your policy or value function will lead you to rewrite parts of the code that have little to do with the core behaviour of your loss.

One question: does code inspection and autocomplete work with TensorDicts? For example, can I tell what inputs are expected just by inspecting the signature of a TDModule (ie, when does in_keys translate to actions vs values vs states)? If not, is it a deal breaker?

Good point. For now, it is not the case, and I agree it's a pity. It's something I find really annoying in pytorch too, with buffers and similar that won't be caught by the IDE as they are created in a rather hacky way. I need to explore this topic more in depth before giving you a definite answer on what can be done.

Fingers crossed with NeurIPS :)

shagunsodhani · 2022-05-01T11:15:20Z

shagunsodhani
May 1, 2022

Thanks for initiating this discussion @vmoens

Regarding the question of Do we want an Agent class?, I will take a step back and say that the Agent class, as implemented today, is more of a trainer and less of an agent. What is the difference? Probably not much in terms of how the Agent class is implemented. But I think when we hear the word Agent, we consider the component to be a first-class RL primitive. This connotation biases us into thinking that an RL library has to support an Agent class as a first-class citizen.

On the other hand, if we think of the Agent as the Trainer, we would probably want to design it (Trainer class) to be as generic as possible. Another benefit of this view is that we may decide not to implement a new Trainer class and re-use existing libraries. For instance, the current hooks are similar to the hooks supported by PyTorch Lightening.

Relying on external training libraries has several benefits:

If the external libraries are popular, more people are likely to try torch-rl.
It frees up the dev time to focus on core torch-rl components while offloading the trainer logic.

I think an even better case would be to have examples showing integration with multiple training frameworks and not just one. That would go a long way towards establishing the flexibility of using torch-rl and will ensure more adoption.

The obvious downside of relying on one (or more) external library (libraries) is the dependence on external libraries. However, we can limit this dependence if we take a lightweight integration approach.

Getting back to the original question, Do we want an Agent class?, I think we want a Trainer class though we may not want to own the trainer class.

Regarding the second question, Do we want multiple agent classes, with highly hierarchical inheritance schemes?, I think we should have self-contained one-file-per-implementation examples. Of course, we can put shared code in modules, etc., but we should not import stuff from other agents/trainers. This design makes it much easier to hack around and implement new algorithms.

Regarding the third question,Does Agent belong to torchrl core or is it something that should live somewhere else (other repo / examples directory)? I agree with your arguments against building another repo on top of torchrl that would build all the higher-level primitives to train RL agents. I think the examples, agent or trainer folder would make sense.

1 reply

vmoens May 2, 2022
Collaborator Author

@shagunsodhani thanks for the input! Indeed we could rely on lightning and I totally agree with all the points you made!
Like you say this dependency should be as light as possible but it'd bring some considerable advantages.

My only concern is that in RL we don't necessarily have the same training loops as supervised learning for example. The scheme collect -> loss -> backward -> step -> repeat is very often broken in RL. Data may come from different sources (collector and replay buffer), etc. It may sometimes feel like fitting a square in circle... That being said, since RL researchers like weird stuff, it's likely that they won't even use the trainer we propose because it's too generic.

Here's what I will do: I'll have a try coding up something (DQN, SAC or other) in lightning, there's not better way to see how it fits than trying! I'll keep you posted in this discussion.

walkacross · 2022-05-01T12:02:55Z

walkacross
May 1, 2022

Yes, the design of this part is a challenge and tradeoff and maybe determine how far torchrl can go.

Design serves his goals and positioning. what's the goals of torchrl? it's a framework or a library? the goal of torch is to support model-free rl or to support model-free/model-based, single-agent/multi-agent, on-policy/off-policy, on-line/off-line, meta-rl simultaneously?

Do we want an Agent class?

My answer intends to no, because we don't have to create a trainer class to guide train process. regarding the training process of deep learning, maybe there exists three modes(informal naming)

a native mode (functional and stateless)
b event-driven mode
   b.1 pytorch-lighting mode
   b.2 pytorch-ignite mode
c trainer class mode

a native mode

    for batch, (X, y) in enumerate(dataloader):
        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

b.1 pytorch-lighting mode(event-driven limited in a trainer class)

from pytorch_lightning import LightningModule

class ImageClassifier(LightningModule):
    def __init__(self, model=None, lr=1.0, gamma=0.7, batch_size=32):
        super().__init__()
        self.save_hyperparameters(ignore="model")
        self.model = model or Net()
        self.test_acc = Accuracy()

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self.forward(x)
        loss = F.nll_loss(logits, y.long())
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adadelta(self.model.parameters(), lr=self.hparams.lr)
        return [optimizer], [torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=self.hparams.gamma)]

    @property
    def transform(self):
        return T.Compose([T.ToTensor(), T.Normalize((0.1307,), (0.3081,))])

    def prepare_data(self) -> None:
        MNIST("./data", download=True)

    def train_dataloader(self):
        train_dataset = MNIST("./data", train=True, download=False, transform=self.transform)
        return torch.utils.data.DataLoader(train_dataset, batch_size=self.hparams.batch_size)

cli = LightningCLI(ImageClassifier, seed_everything_default=42, save_config_overwrite=True, run=False)
cli.trainer.fit(cli.model, datamodule=cli.datamodule)

b.2 pytorch-ignite mode(event-driven not limited in a trainer class)

from ignite.engine import create_supervised_evaluator, create_supervised_trainer, Events

train_loader, val_loader = get_data_loaders(train_batch_size, val_batch_size)
model = Net()

optimizer = SGD(model.parameters(), lr=lr, momentum=momentum)
criterion = nn.NLLLoss()
trainer = create_supervised_trainer(model, optimizer, criterion, device=device)
trainer.logger = setup_logger("trainer")

@trainer.on(Events.ITERATION_COMPLETED(every=log_interval))
def log_training_loss(engine):
       pbar.desc = f"ITERATION - loss: {engine.state.output:.2f}"
       pbar.update(log_interval)

@trainer.on(Events.EPOCH_COMPLETED)
def log_training_results(engine):
        pbar.refresh()
        evaluator.run(train_loader)
        metrics = evaluator.state.metrics
        avg_accuracy = metrics["accuracy"]
        avg_nll = metrics["nll"]
        tqdm.write(
            f"Training Results - Epoch: {engine.state.epoch} Avg accuracy: {avg_accuracy:.2f} Avg loss: {avg_nll:.2f}"
        )

@trainer.on(Events.EPOCH_COMPLETED | Events.COMPLETED)
def log_time(engine):
        tqdm.write(f"{trainer.last_event_name.name} took { trainer.state.times[trainer.last_event_name.name]} seconds")

trainer.run(train_loader, max_epochs=epochs)

c trainer class mode
something like keras for tensorflow

according to the intuition, readability and expansibility, my preference is: a > b.2 >> b.1 > c.
the choice of mode applies equally to torchrl for reinforcement learning research.

whatever mode we choose, it's fine, as long as keep a unified principle in the library rather than provide too many alternatives to the user. we should avoid the same mistakes that tensorflow-keras mode takes.

so "Having a class where stuff is already aligned is an effective way to start! Then you may just say (after you gather some experience): "ok dude, this is useless, I'll code it from scratch myself" is not a good idea.

1 reply

walkacross May 1, 2022

pytorch-ignite event driven mode support reinforcement learning, see https://github.com/pytorch/ignite/blob/master/examples/reinforcement_learning/actor_critic.py

Agent class #102

Uh oh!

vmoens Apr 28, 2022 Collaborator

Replies: 3 comments · 5 replies

Uh oh!

Uh oh!

seba-1511 Apr 29, 2022

Uh oh!

vmoens May 1, 2022 Collaborator Author

Uh oh!

seba-1511 May 8, 2022

Uh oh!

vmoens May 9, 2022 Collaborator Author

Uh oh!

shagunsodhani May 1, 2022

Uh oh!

vmoens May 2, 2022 Collaborator Author

Uh oh!

Uh oh!

walkacross May 1, 2022

Uh oh!

walkacross May 1, 2022

vmoens
Apr 28, 2022
Collaborator

Replies: 3 comments 5 replies

seba-1511
Apr 29, 2022

vmoens May 1, 2022
Collaborator Author

vmoens May 9, 2022
Collaborator Author

shagunsodhani
May 1, 2022

vmoens May 2, 2022
Collaborator Author

walkacross
May 1, 2022