QValueActor potential new features and improvements. #2012

dtsaras · 2024-03-16T16:46:51Z

dtsaras
Mar 16, 2024

I have been going through the code of QValueActor and I was hoping to have the option to have an exploration mode with a temperature parameter. Rather than always selecting the max-value, during exploration mode we could sample the action based on the weight of the Q-Values.
Also, another comment is that if a user provides an action_mask the action values returned will be the min of the given data type for the illegal actions. Maybe the user should be able to select, if the log-probabilities should be altered or not. It's easy to mask the logits after but impossible to get the original logits out of the masked ones.

I can help with the implementation if needed.

albertbou92 · 2024-03-18T15:32:55Z

albertbou92
Mar 18, 2024

I was toying with DQN a while ago and also needed to select the actions based on the weight of the Q-Values. This is what I put together based on https://github.com/pytorch/rl/blob/main/torchrl/modules/tensordict_module/exploration.py#L31:

from typing import Optional

import torch

from tensordict.nn import (
    TensorDictModule,
    TensorDictModuleBase,
    TensorDictModuleWrapper,
)
from tensordict.tensordict import TensorDictBase
from tensordict.utils import expand_as_right, expand_right, NestedKey

from torchrl.data.tensor_specs import CompositeSpec, TensorSpec
from torchrl.envs.utils import exploration_type, ExplorationType


class SoftmaxSamplingModule(TensorDictModuleBase):
    """Softmax exploration module.

    This module randomly select the action(s) from a distribution created by applying a softmax transformation to the
    specified tensordict tensors, which are required to have a last dimension size equal to the number of actions.

    Args:
        spec (TensorSpec): the spec used for sampling actions.
        action_key (NestedKey, optional): the key where the action can be found in the input tensordict.
            Default is ``"action"``.
    """

    def __init__(
            self,
            action_spec: Optional[TensorSpec] = None,
            logits_key: Optional[NestedKey] = "action_value",
            action_key: Optional[NestedKey] = "action",
    ):
        self.action_key = action_key
        self.logits_key = logits_key
        self.in_keys = [self.logits_key]
        self.out_keys = [self.action_key, "chosen_action_value"]
        super().__init__()

        if action_spec is not None:
            if not isinstance(action_spec, CompositeSpec) and len(self.out_keys) >= 1:
                action_spec = CompositeSpec(
                    {action_key: action_spec}, shape=action_spec.shape[:-1]
                )
        self._spec = action_spec

    @property
    def spec(self):
        return self._spec

    def forward(self, tensordict: TensorDictBase, temperature=1.0) -> TensorDictBase:
        """Computes the softmax distribution and samples an action from it."""
        if exploration_type() == ExplorationType.RANDOM or exploration_type() is None:

            # Ensure numeric stability by subtracting the maximum Q-value
            logits = tensordict.get(self.logits_key)
            max_logits, _ = torch.max(logits, dim=-1, keepdim=True)
            exp_values = torch.exp((logits - max_logits) / temperature)

            # Calculate probabilities using the softmax function
            probabilities = exp_values / torch.sum(exp_values, dim=-1, keepdim=True)

            # Sample an action according to the probabilities
            dist = torch.distributions.one_hot_categorical.OneHotCategorical(
                probs=probabilities
            )
            out = dist.sample()

            tensordict.set(self.action_key, out)
            tensordict.set("chosen_action_value", torch.sum(out * logits, dim=-1))

        return tensordict

Needs to be polished and also does not implement the action_mask feature, but maybe can serve as a base. If you want I can create a PR. Or you can give it a shot if you want!
The idea is that it can simply replace the EGreedyModule in https://github.com/pytorch/rl/blob/main/examples/dqn/dqn_cartpole.py

3 replies

dtsaras Mar 19, 2024
Author

Thanks for sharing. I have a similar implementation, but I am thinking it would be a nice improvement to the QValueActor class to wrap the policy_module to a QValueActor and then have the option to pass some exploration functions. Which will be called accordingly to the exploration type rather than wrapping the module to different Modules every time.

vmoens Mar 19, 2024
Collaborator

I think there are different things going on here:

Using Q-values as log-probabilities (softmax idea) is a essentially another map from value space to action space. We could have an arg in the constructor like action_map=softmax.

actor = QValueActor(..., action_map="softmax", temperature=1.0)

Now the problem with this is that it's rather "static" and indeed ExplorationType would be the right tool to handle the change in strategy from training to test time:

actor = QValueActor(...)
with set_exploration_type(ExplorationType.SOFTMAX):
    actor(data)

Problems to address: what if people add an exploration module after that? E.g. EGreedy, which expects a RANDOM or MEAN exploration type? Maybe we want to discourage that, but that seems a bit too opinionated to my taste.
An intermediate approach would be:

actor = QValueActor(..., action_map="softmax", temperature=1.0)
with set_exploration_type(ExplorationType.RANDOM):
    actor(data) # Uses softmax
with set_exploration_type(ExplorationType.MODE):
    actor(data) # Uses argmax

actor = QValueActor(..., action_map="argmax")  # This is also the default behaviour
with set_exploration_type(ExplorationType.RANDOM):
    actor(data) # Uses argmax
with set_exploration_type(ExplorationType.MODE):
    actor(data) # Uses argmax

You mentioned in a chat the possibility of adding noise to the action values. I think this is easily implemented as a separate TDModule so IMO we should not make it an option of QValueActor but add it to the stack of tools available. Maybe at a later stage build a recipe to combine things together given some algo (e.g. alpha_zero_qvalue(value_net, **kwargs) would give you the actor packaged as required)

dtsaras Mar 19, 2024
Author

@vmoens What I had in mind is a method where a user can register a custom exploration type:

register_exploration_type(ExplorationType.CUSTOM)
actor = QValueActor(...)
action_strategy = torch.argmax
actor.register_action_strategy(ExplorationType.CUSTOM, action_strategy)

I find this approach somewhat intuitive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QValueActor potential new features and improvements. #2012

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

QValueActor potential new features and improvements. #2012

dtsaras Mar 16, 2024

Replies: 1 comment · 3 replies

albertbou92 Mar 18, 2024

dtsaras Mar 19, 2024 Author

vmoens Mar 19, 2024 Collaborator

dtsaras Mar 19, 2024 Author

dtsaras
Mar 16, 2024

Replies: 1 comment 3 replies

albertbou92
Mar 18, 2024

dtsaras Mar 19, 2024
Author

vmoens Mar 19, 2024
Collaborator

dtsaras Mar 19, 2024
Author