Replies: 3 comments 5 replies
-
Thanks for kindly mentioning cherry in #90. I agree with the sentiment that monolithic Instead, I'm still looking for a library that focuses on two aspects of RL infra:
I’d love for some of those ideas to find their way into mainstream RL libraries, so feel free to borrow as much as you’d like from cherry. Unfortunately I don’t get as much time as I wished to work on it nowadays. *Side note: I wonder if one could write RL algorithms by "chaining" smaller operations, like optax does for optimization. |
Beta Was this translation helpful? Give feedback.
-
Thanks for initiating this discussion @vmoens Regarding the question of On the other hand, if we think of the Relying on external training libraries has several benefits:
I think an even better case would be to have examples showing integration with multiple training frameworks and not just one. That would go a long way towards establishing the flexibility of using torch-rl and will ensure more adoption. The obvious downside of relying on one (or more) external library (libraries) is the dependence on external libraries. However, we can limit this dependence if we take a lightweight integration approach. Getting back to the original question, Regarding the second question, Regarding the third question, |
Beta Was this translation helpful? Give feedback.
-
Yes, the design of this part is a challenge and tradeoff and maybe determine how far torchrl can go. Design serves his goals and positioning. what's the goals of torchrl? it's a framework or a library? the goal of torch is to support model-free rl or to support model-free/model-based, single-agent/multi-agent, on-policy/off-policy, on-line/off-line, meta-rl simultaneously? Do we want an Agent class? My answer intends to no, because we don't have to create a trainer class to guide train process. regarding the training process of deep learning, maybe there exists three modes(informal naming)
a native mode
b.1 pytorch-lighting mode(event-driven limited in a trainer class)
b.2 pytorch-ignite mode(event-driven not limited in a trainer class)
c trainer class mode according to the intuition, readability and expansibility, my preference is: a > b.2 >> b.1 > c. whatever mode we choose, it's fine, as long as keep a unified principle in the library rather than provide too many alternatives to the user. we should avoid the same mistakes that tensorflow-keras mode takes. so "Having a class where stuff is already aligned is an effective way to start! Then you may just say (after you gather some experience): "ok dude, this is useless, I'll code it from scratch myself" is not a good idea. |
Beta Was this translation helpful? Give feedback.
-
As pointed in #90, the agent class is the least TorchRL thing in TorchRL.
It's a magic one-fits-all class that is highly customized for TorchRL internals, and as such it requires folks who wish to use it to understand well torchrl and adopt it in all its depth and breadth... Not what we want!
This raises a series of questions:
Agent
class?In my mind there are 2 advantages to have an agent class: First, examples are highly repetitive. In most cases, training an (online) RL algorithm can be sketched as
Obviously, each of these steps may be decorated by some other things that need to be incorporated (e.g. resetting noise of noisy layers, updating target networks weights, updating priority of the replay buffer, etc.)
The second reason an agent class is a nice-to-have is that it gives directions to newcomers. Imagine this is the first RL algorithm you're coding, and you don't really know where to start. Having a class where stuff is already aligned is an effective way to start! Then you may just say (after you gather some experience): "ok dude, this is useless, I'll code it from scratch myself", which is probably what I'd do as an external user.
Do we want multiple agent classes, with highly hierarchical inheritance schemes?
One might say: this sketch of an agent class you drew above is all good and well until I have a really convoluted algorithm to implement.
So, we may code up a parent class and inherit from it, the same way we do with losses for instance. Problem is that this solution means that we'll have to do some nasty spaghetti code from time to time, with multiple inheritance, overriding legit methods with others etc. IMO it makes the code lengthy and very specific (read: not very elastic).
Does
Agent
belong to torchrl core or is it something that should live somewhere else (other repo / examples directory)?We could indeed build another repo on top of torchrl that would build all the higher level primitives to train RL agents. But IMO it's going to be hard to get traction for that: if the community is receptive to it (say we get 80% of the clones/stars we get with TorchRL), then it should probably belong to TorchRL. If the repo has no success, well, why is it there?
We may perhaps move this class to
examples
but that would redefine the implicit role ofexamples
where we only wish to have scripts to train specific benchmarks at the moment (not re-usable classes that one may want to use in other contexts).I have created a PR where I propose a refactored version of the Agent class.
The Agent is now very schematic and generic, on the high level it looks like this:
I have pointed all the methods that can be modified easily by "hooking" stuff to them.
As it is broken down in little pieces, we can test each of them independently. One can also add its own if necessary. There are still some constraints on what those methods can do but I guess it should not be too difficult to overcome those limitations (either in the design of the hooks or in the design of the agent class).
For now, the
make_agent
function still takes care of building this lego castle, but we can also make very specific change in training scripts that won't make the whole class compliant to a single example.Examples of hooks that can be registered are:
process_batch
: extend the replay buffer with the current data (if there is), update normalizing statistics of the reward (if normalized) etcpost_steps
: update weights of policy in collector (if on another process / cuda device / worker etc), make an annealing step of epsilon greedyprocess_optim_batch
: sample from replay buffer (e.g. DQN) OR sample from batch of data collected (e.g. PPO)post_loss
: update priority of replay buffer withtd_error
or similarpost_optim
: a step of the optimizer scheduler or a step of update of the target networkpre_steps_log
: log of reward in the batch collected, or log something else from that batchpost_steps_log
: execute recorder (i.e. execute the policy in eval mode to display results without exploration)Naming -- and everything else! -- is subject to change and suggestion!
Also if I missed something let me know.
Anyhow: please, share feedback!
@shagunsodhani @walkacross
Beta Was this translation helpful? Give feedback.
All reactions