You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+14-9Lines changed: 14 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,6 +7,8 @@ This repo contains the evaluation code for the paper "[SciCode: A Research Codin
7
7
8
8
## 🔔News
9
9
10
+
**[2025-02-17]: SciCode benchmark is available at [HuggingFace Datasets](https://huggingface.co/datasets/Zilinghan/scicode)!**
11
+
10
12
**[2025-02-01]: Results for DeepSeek-R1, DeepSeek-V3, and OpenAI o3-mini are added.**
11
13
12
14
**[2025-01-24]: SciCode has been integrated with [`inspect_ai`](https://inspect.ai-safety-institute.org.uk/) for easier and faster model evaluations.**
@@ -54,27 +56,30 @@ SciCode sources challenging and realistic research-level coding problems across
## Instructions to evaluate a new model using `inspect_ai` (recommended)
60
+
57
61
58
-
## Instructions to evaluate a new model
62
+
Scicode has been integrated with `inspect_ai` for easier and faster model evaluation. You need to run the following steps ro run:
59
63
60
64
1. Clone this repository `git clone [email protected]:scicode-bench/SciCode.git`
61
65
2. Install the `scicode` package with `pip install -e .`
62
66
3. Download the [numeric test results](https://drive.google.com/drive/folders/1W5GZW6_bdiDAiipuFMqdUhvUaHIj6-pR?usp=drive_link) and save them as `./eval/data/test_data.h5`
63
-
4. Run `eval/scripts/gencode_json.py` to generate new model outputs (see the [`eval/scripts` readme](eval/scripts/)) for more information
64
-
5. Run `eval/scripts/test_generated_code.py` to evaluate the unittests
65
-
66
-
67
-
## Instructions to evaluate a new model using `inspect_ai` (recommended)
68
-
69
-
Scicode has been integrated with `inspect_ai` for easier and faster model evaluation, compared with the methods above. You need to run the first three steps in the [above section](#instructions-to-evaluate-a-new-model), and then go to the `eval/inspect_ai` directory, setup correspoinding API key, and run the following command:
67
+
4. Go to the `eval/inspect_ai` directory, setup correspoinding API key, and run the following command:
For more detailed information of using `inspect_ai`, see [`eval/inspect_ai` readme](eval/inspect_ai/)
75
+
💡 For more detailed information of using `inspect_ai`, see [`eval/inspect_ai` readme](eval/inspect_ai/)
76
+
77
+
## Instructions to evaluate a new model in two steps (deprecated)
78
+
79
+
It should be noted that this is a deprecated way to evaluating models, and using `inspect_ai` is the recommended way. Please use this method only if `inspect_ai` does not work for your need. You need to run the first three steps in the above section, then run the following two commands:
80
+
81
+
4. Run `eval/scripts/gencode.py` to generate new model outputs (see the [`eval/scripts` readme](eval/scripts/)) for more information
82
+
5. Run `eval/scripts/test_generated_code.py` to evaluate the unittests
Please note that we do not plan to release the ground truth code for each problem to the public. However, we have made a dev set available that includes the ground truth code in `eval/data/problems_dev.jsonl`.
@@ -44,11 +44,11 @@ In this repository, **we only support evaluating with previously generated code
44
44
45
45
### Command-Line Arguments
46
46
47
-
When running the `gencode_json.py` script, you can use the following options:
47
+
When running the `gencode.py` script, you can use the following options:
48
48
49
49
-`--model`: Specifies the model name to be used for generating code (e.g., `gpt-4o` or `litellm/together_ai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo`).
50
+
-`--split`: Specifies which problem split (either `validation` or `test`) to run on.
50
51
-`--output-dir`: Directory where the generated code outputs will be saved. Default is `eval_results/generated_code`.
51
-
-`--input-path`: Directory containing the JSON files describing the problems. Default is `eval/data/problems_all.jsonl`.
52
52
-`--prompt-dir`: Directory where prompt files are saved. Default is `eval_results/prompt`.
53
53
-`--with-background`: If enabled, includes the problem background in the generated code.
54
54
-`--temperature`: Controls the randomness of the output. Default is 0.
@@ -66,7 +66,7 @@ Download the [numeric test results](https://drive.google.com/drive/folders/1W5GZ
66
66
To evaluate the generated code using a specific model, go to the root of this repo and use the following command:
Replace `"model_name"` with the appropriate model name, and include `--with-background` if the code is generated with **scientist-annotated background**.
0 commit comments