Dear maintainers, I have found that the softmax temperature you use in the code in the master branch when expanding the next state is different from what is stated in your publication, as well as the original code.
Instead of setting temperature = 0, to use the deterministic softmax version, you set the parameter to the same value as the one used by the planner, which made led to the policy not learning in the Atari environment in my experiments, as random actions would get inserted into the replay buffer.
The issue can be fixed by changing the temp parameter in line 94 in file piIW_alphazero.py.