-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Hello Authors of SmartPlay,
Thank you for providing this nice testbed.
I am trying to replicate the scores for Table 2, follow your env setting on git repo.
eg. For RockPaperScissorBasic (RPS) game
challenges:
all:
Error/Mistake Handling: 1
Generalization: 2
Instruction Following: 3
Learning from Interactions: 3
Long Text Understanding: 2
Planning: 1
Understanding the Odds: 3
Reasoning: 1
Spatial Reasoning: 1
recorded settings:
RockPaperScissorBasic:
iter: 20
steps: 50
human score: 43
min score: 0
I run with GPT-4, but the score i get for
RPS and Hanoi is 0.70 and 0.30
which is different from Table 2
GPT-4-0613 0.91 0.83
GPT-4-0314 0.98 0.90
Could you please share more details regarding LLM inference parameters. The temperature, top_p, frequency_penalty.
Hopefully I could use them to replicate your score on the paper Table2.
Thank you.