Wrong softmax temperature when choosing the action for the next step

Dear maintainers, I have found that the softmax temperature you use in the code in the master branch when expanding the next state is different from what is stated in your publication, as well as the original code. 

Instead of setting temperature = 0, to use the deterministic softmax version, you set the parameter to the same value as the one used by the planner, which made led to the policy not learning in the Atari environment in my experiments, as random actions would get inserted into the replay buffer. 

The issue can be fixed by changing the temp parameter in line 94 in file piIW_alphazero.py.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong softmax temperature when choosing the action for the next step #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Wrong softmax temperature when choosing the action for the next step #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions