Replicate the scores for Table 2

Hello Authors of SmartPlay,

Thank you for providing this nice testbed.
I am trying to replicate the scores for Table 2, follow your env setting on git repo. 
eg. For RockPaperScissorBasic (RPS) game 
challenges:
  all:
    Error/Mistake Handling: 1
    Generalization: 2
    Instruction Following: 3
    Learning from Interactions: 3
    Long Text Understanding: 2
    Planning: 1
    Understanding the Odds: 3
    Reasoning: 1
    Spatial Reasoning: 1
recorded settings:
  RockPaperScissorBasic: 
    iter: 20
    steps: 50
    human score: 43
    min score: 0

I run with GPT-4, but the score i get for  
RPS	and Hanoi is 0.70 and 0.30 
which is different from Table 2

GPT-4-0613     0.91	0.83
GPT-4-0314     0.98	0.90


Could you please share more details regarding LLM inference parameters. The temperature, top_p, frequency_penalty.

Hopefully I could use them to replicate your score on the paper Table2.

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replicate the scores for Table 2 #29

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Replicate the scores for Table 2 #29

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions