TestGenEval: A Large Scale Test Generation Benchmark

TestGenEval consists of 1,210 code test file pairs from 11 large, well-maintained repositories (3,523-78,287 stars). We use these file pairs to construct two testing tasks: 1) unit test completion for the first, last and additional tests and 2) full file unit test generation. Our benchmark is easy to run and extend, as we have docker containers for each version of each repository with coverage and mutation testing dependencies installed. For both task we use execution based metrics, including pass@1, pass@5 along with code coverage improvement, and mutation score improvement compared to the gold (human written) tests. Code and test files in \benchmark are long in length (on average 782 LOC per code file and 677 LOC per test file) and high coverage (median coverage of 60.4%).

We measure the following metrics for the test completion task:

pass@k (k = 1, 5)
coverage improvement (how much generated test improves existing coverage)
coverage improvement@pass (coverage improvement averaged only over passing tests)
average pass@5

We measure the following metrics for the test generation task:

pass@1
all pass@1 (all tests generated in suite pass)
coverage (coverage of generated tests)
coverage@pass (coverage of generated tests for passing examples)
mutation score (mutation score of generated tests)
mutation score@pass (mutation score of generated tests for passing examples)

Datasets

TestGenEvalLite

Docker images for testbeds used in the TestGenEvalLite dataset has been built and tested.

TestGenEval

Docker images for testbeds used in the TestGenEval dataset has been built and tested.

Setup

To setup the repository run

git clone [email protected]:facebookresearch/testgeneval.git
cd testgeneval
conda env create -f testgeneval.yaml
conda activate testgeneval

Modify the .env_template file with the appropriate values and rename it to .env (specifically make sure to set SWEBENCH_DOCKER_FORK_DIR to the current directory where the repository was cloned)

The env template setup is important, make sure you do this

Building TestGenEval

To build the docker images locally (adapted from SWEBench Docker) run one of these commands:

TestGenEvalLite - TestGenEvalLite for faster evaluation

make -f Makefile.testgenevallite

TestGenEval - full TestGenEval (takes hours to a full day to build)

make -f Makefile.testgeneval

OR

You can simply just run with images pushed to Dockerhub

To pull all images (TestGenEval) run

python scripts/pull_images.py --makefile Makefile.testgeneval

To pull lite images (TestGenEvalLite) run

python scripts/pull_images.py --makefile Makefile.testgenevallite

TestGenEval Datasets

The TestGenEval datasets are available on huggingface:

Running TestGenEval

Running TestGenEval is relatively simple.

There is a python script that will run both prediction and inference.

If you built docker images locally:

python run_pipeline.py \
--results_dir results \
--dataset_name_or_path kjain14/testgenevallite \
--model meta-llama/Meta-Llama-3.1-8B-Instruct

Otherwise to pull from Dockerhub:

python run_pipeline.py \
--results_dir results \
--dataset_name_or_path kjain14/testgenevallite \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--namespace kdjain

Adding a new model to TestGenEval

Adding a new model is quite simple. Under inference/configs create a new file with the system prompts and the function to add prompts to the dataset.

add_prompts_to_dataset should output a prompt for all four settings: full, first, last, extra. The preds_context attribute of each datapoint contains the preamble of the file, the first test, the file without the last test and the file with the last test (full file)

Once you update this file our standard evaluation flow will work.

TestGenEval creation

All creation scripts are housed in the creation subdirectory.

transform_swebench.py is the main script that takes the SWEBench dataset and converts it for test generation.

filter_unittests.py takes the baseline results and filters out datapoints with no coverage of gold tests (gold tests must cover the code under test).

Licensing

The majority of code in this repository is licensed under CC-by-NC, however the third party code/files may be subject to different licenses.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
creation		creation
docker		docker
inference		inference
results		results
scripts		scripts
swebench_docker		swebench_docker
.env_template		.env_template
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile.testgeneval		Makefile.testgeneval
Makefile.testgenevallite		Makefile.testgenevallite
README.md		README.md
generate_report.py		generate_report.py
generate_report_baseline.py		generate_report_baseline.py
run_evaluation.py		run_evaluation.py
run_evaluation_baseline.py		run_evaluation_baseline.py
run_pipeline.py		run_pipeline.py
run_pipeline_all.py		run_pipeline_all.py
testgeneval.yaml		testgeneval.yaml
vllm.yaml		vllm.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TestGenEval: A Large Scale Test Generation Benchmark

Datasets

TestGenEvalLite

TestGenEval

Setup

Building TestGenEval

TestGenEval Datasets

Running TestGenEval

Adding a new model to TestGenEval

TestGenEval creation

Licensing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

facebookresearch/testgeneval

Folders and files

Latest commit

History

Repository files navigation

TestGenEval: A Large Scale Test Generation Benchmark

Datasets

TestGenEvalLite

TestGenEval

Setup

Building TestGenEval

TestGenEval Datasets

Running TestGenEval

Adding a new model to TestGenEval

TestGenEval creation

Licensing

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages