|
| 1 | +# How to Create a Benchmark |
| 2 | + |
| 3 | +To create a benchmark, you need to write a function that returns a `BenchmarkResult` instance. |
| 4 | +This object is dictionary-like and holds information about the model results on the benchmark. |
| 5 | + |
| 6 | +For example if you submitted an |
| 7 | +EfficientNet model to an ImageNet benchmark, the instance would contain information on its performance (Top 1/5 Accuracy), the model name, the name |
| 8 | +of the dataset and task, and so on. The object also contains methods for serialising the results to JSON, and some server checking methods that call the sotabench.com API to check if the results can be accepted. |
| 9 | + |
| 10 | +If you want to see the full API for `BenchmarkResult`, then skip to the end of this section. |
| 11 | +Otherwise we will go through a step-by-step example in PyTorch for creating a benchmarl. |
| 12 | + |
| 13 | +## The Bare Necessities |
| 14 | + |
| 15 | +Start a new project and make a benchmark file, e.g. `mnist.py`. Begin by writing a skeleton function |
| 16 | +as follows: |
| 17 | + |
| 18 | +```python |
| 19 | +from sotabenchapi.core import BenchmarkResult |
| 20 | + |
| 21 | +def evaluate_mnist(...) -> BenchmarkResult: |
| 22 | + |
| 23 | + # your evaluation logic here |
| 24 | + results = {...} # dict with keys as metric names, values as metric results |
| 25 | + |
| 26 | + return BenchmarkResult(results=results) |
| 27 | +``` |
| 28 | + |
| 29 | +This is the core structure of an evaluation method for sotabench: we have a function that takes in user inputs, |
| 30 | +we do some evaluation, and we pass in some results and other outputs to a `BenchmarkResult` instance. Essentially you can write any benchmark around this format, |
| 31 | +and take in any input that you want for your evaluation. It is designed to be flexible. |
| 32 | + |
| 33 | +For example, it could be as simple as taking a json of predictions as an input if that's all you need. Or if you |
| 34 | +want more information about the model, you could request a model function or class as an input and pass the data to the |
| 35 | +model yourself. It is up to you how you want to design your benchmark. |
| 36 | + |
| 37 | +## Sotabench Metadata |
| 38 | + |
| 39 | +So that benchmark results can be displayed on sotabench.com, you will need your submissions to have metadata about the model name, |
| 40 | +the dataset name and the task. For example, "EfficientNet", "Imagenet", "Image Classification". |
| 41 | + |
| 42 | +In the context of your benchmark function: |
| 43 | + |
| 44 | +```python |
| 45 | +from sotabenchapi.core import BenchmarkResult |
| 46 | + |
| 47 | +DATASET_NAME = 'ImageNet' |
| 48 | +TASK = 'Image Classification' |
| 49 | + |
| 50 | +def evaluate_mnist(model_name, ...) -> BenchmarkResult: |
| 51 | + |
| 52 | + # your evaluation logic here |
| 53 | + results = {...} # dict with keys as metric names, values as metric results |
| 54 | + |
| 55 | + return BenchmarkResult(results=results, model=model_name, dataset=DATASET_NAME, task=TASK) |
| 56 | +``` |
| 57 | + |
| 58 | +Here the dataset name and task name will be fixed for the benchmark, but the model name |
| 59 | +can be specified as an input. You can add additional metadata to connect things like the |
| 60 | +ArXiv paper id - see the API documentation at the end of this section for more information. |
| 61 | + |
| 62 | +## Example: An MNIST benchmark in PyTorch |
| 63 | + |
| 64 | +Let's see how we might make a PyTorch friendly benchmark which adheres to the framework's abstractions. |
| 65 | + |
| 66 | +The first thing we need for evaluation is a dataset! Let's use the MNIST dataset from the `torchvision` library, |
| 67 | +along with a `DataLoader`: |
| 68 | + |
| 69 | +```python |
| 70 | +from sotabenchapi.core import BenchmarkResult |
| 71 | +from torch.utils.data import DataLoader |
| 72 | +import torchvision.datasets as datasets |
| 73 | + |
| 74 | +def evaluate_mnist(data_root: str, batch_size: int = 32, num_workers: int = 4) -> BenchmarkResult: |
| 75 | + |
| 76 | + dataset = datasets.MNIST(data_root, train=False, download=True) |
| 77 | + loader = DataLoader(dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True) |
| 78 | + |
| 79 | + return BenchmarkResult(dataset=dataset.__name__) |
| 80 | +``` |
| 81 | + |
| 82 | +We've set `train=false` since we want to use the testing split for evaluation. We've also added a `data_root` parameter |
| 83 | +just so the use can specify where they want the data downloaded. |
| 84 | + |
| 85 | +We should also probably allow for the user to put in their own transforms since this is a vision dataset, so |
| 86 | +let's modify further: |
| 87 | + |
| 88 | +```python |
| 89 | +from sotabenchapi.core import BenchmarkResult |
| 90 | +from torch.utils.data import DataLoader |
| 91 | +import torchvision.datasets as datasets |
| 92 | + |
| 93 | +def evaluate_mnist(data_root: str, batch_size: int = 32, num_workers: int = 4, |
| 94 | + input_transform=None, target_transform=None) -> BenchmarkResult: |
| 95 | + |
| 96 | + dataset = datasets.MNIST(data_root, transform=input_transform, target_transform=target_transform, |
| 97 | + train=False, download=True) |
| 98 | + loader = DataLoader(dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True) |
| 99 | + |
| 100 | + return BenchmarkResult(dataset=dataset.__name__) |
| 101 | +``` |
| 102 | + |
| 103 | +Great, so we have a dataset set up. Let's now take in a model. We could do this in a number of ways, for example, |
| 104 | +we could accept a model function as an input (that takes in data and outputs predictions). Since we are using PyTorch, |
| 105 | +where most modules are subclasses of `nn.Module`, let's do it in an object-oriented way by accepting a model object input: |
| 106 | + |
| 107 | +```python |
| 108 | +from sotabenchapi.core import BenchmarkResult |
| 109 | +import torchvision.datasets as datasets |
| 110 | +from torchbench.utils import send_model_to_device |
| 111 | + |
| 112 | +def evaluate_mnist(model, data_root: str, batch_size: int = 32, num_workers: int = 4, |
| 113 | + input_transform=None, target_transform=None) -> BenchmarkResult: |
| 114 | + |
| 115 | + model, device = send_model_to_device(model, device='cuda') |
| 116 | + model.eval() |
| 117 | + |
| 118 | + dataset = datasets.MNIST(data_root, transform=input_transform, target_transform=target_transform, |
| 119 | + train=False, download=True) |
| 120 | + loader = DataLoader(dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True) |
| 121 | + |
| 122 | + return BenchmarkResult(dataset=dataset.__name__) |
| 123 | +``` |
| 124 | + |
| 125 | +Here we have reused a function from `torchbench` for sending the model to a cuda device, but this is optional - you can |
| 126 | +decide how models are processed in your own benchmark however you see fit. |
| 127 | + |
| 128 | +Now that we have a model and a dataset, let's loop through and evaluate the model: |
| 129 | + |
| 130 | +```python |
| 131 | +from sotabenchapi.core import BenchmarkResult |
| 132 | +from torch.utils.data import DataLoader |
| 133 | +import torchvision.datasets as datasets |
| 134 | +from torchbench.utils import send_model_to_device, default_data_to_device, AverageMeter, accuracy |
| 135 | +import tqdm |
| 136 | +import torch |
| 137 | + |
| 138 | +def evaluate_mnist(model, data_root: str, batch_size: int = 32, num_workers: int = 4, |
| 139 | + input_transform=None, target_transform=None) -> BenchmarkResult: |
| 140 | + |
| 141 | + model, device = send_model_to_device(model, device='cuda') |
| 142 | + model.eval() |
| 143 | + |
| 144 | + dataset = datasets.MNIST(data_root, transform=input_transform, target_transform=target_transform, |
| 145 | + train=False, download=True) |
| 146 | + loader = DataLoader(dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True) |
| 147 | + |
| 148 | + top1 = AverageMeter() |
| 149 | + top5 = AverageMeter() |
| 150 | + |
| 151 | + with torch.no_grad(): |
| 152 | + for i, (input, target) in enumerate(tqdm.tqdm(loader)): |
| 153 | + |
| 154 | + input, target = default_data_to_device(input, target, device=device) |
| 155 | + output = model(input) |
| 156 | + prec1, prec5 = accuracy(output, target, topk=(1, 5)) |
| 157 | + top1.update(prec1.item(), input.size(0)) |
| 158 | + top5.update(prec5.item(), input.size(0)) |
| 159 | + |
| 160 | + results = {'Top 1 Accuracy': top1.avg, 'Top 5 Accuracy': top5.avg} |
| 161 | + |
| 162 | + return BenchmarkResult(dataset=dataset.__name__, results=results) |
| 163 | +``` |
| 164 | + |
| 165 | +We've used some more utility functions from `torchbench`, but again, you can use whatever you want to do evaluation. |
| 166 | +You can see we've passed a results dictionary into the `BenchmarkResult` object. Great! So we have a function that |
| 167 | +takes in a model and evaluates on a dataset. But how do we connect to Sotabench? Well, we need to have the user pass |
| 168 | +in some metadata information about the model name and paper id. We also need to specify a bit more about our benchmark, |
| 169 | +e.g. the task in this case is "Image Classification": |
| 170 | + |
| 171 | +```python |
| 172 | +from sotabenchapi.core import BenchmarkResult |
| 173 | +from torch.utils.data import DataLoader |
| 174 | +import torchvision.datasets as datasets |
| 175 | +from torchbench.utils import send_model_to_device, default_data_to_device, AverageMeter, accuracy |
| 176 | +import tqdm |
| 177 | +import torch |
| 178 | + |
| 179 | +def evaluate_mnist(model, data_root: str, batch_size: int = 32, num_workers: int = 4, |
| 180 | + input_transform=None, target_transform=None, model_name:str = None, |
| 181 | + arxiv_id:str = None) -> BenchmarkResult: |
| 182 | + |
| 183 | + model, device = send_model_to_device(model, device='cuda') |
| 184 | + model.eval() |
| 185 | + |
| 186 | + dataset = datasets.MNIST(data_root, transform=input_transform, target_transform=target_transform, |
| 187 | + train=False, download=True) |
| 188 | + loader = DataLoader(dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True) |
| 189 | + |
| 190 | + top1 = AverageMeter() |
| 191 | + top5 = AverageMeter() |
| 192 | + |
| 193 | + with torch.no_grad(): |
| 194 | + for i, (input, target) in enumerate(tqdm.tqdm(loader)): |
| 195 | + |
| 196 | + input, target = default_data_to_device(input, target, device=device) |
| 197 | + output = model(input) |
| 198 | + prec1, prec5 = accuracy(output, target, topk=(1, 5)) |
| 199 | + top1.update(prec1.item(), input.size(0)) |
| 200 | + top5.update(prec5.item(), input.size(0)) |
| 201 | + |
| 202 | + results = {'Top 1 Accuracy': top1.avg, 'Top 5 Accuracy': top5.avg} |
| 203 | + |
| 204 | + return BenchmarkResult(task='Image Classification', dataset=dataset.__name__, results=results, |
| 205 | + model=model_name, arxiv_id=arxiv_id) |
| 206 | +``` |
| 207 | + |
| 208 | +And you're set! The task string connects to the taxonomy on sotabench, the rest gives context to the |
| 209 | +result - for example the model's name and the paper it is from. |
| 210 | + |
| 211 | +The final step is to publish this as a PyPi library. This will enable your users to write a `sotabench.py` file |
| 212 | +that imports your benchmark and passes their model and other parameters into it. When they connect to sotabench.com, |
| 213 | +sotabench.com will download your library and evaluate their model with it, and then publish the results to your |
| 214 | +benchmark page. |
| 215 | + |
| 216 | +## Other Examples |
| 217 | + |
| 218 | +The [torchbench](https://www.github.com/paperswithcode/torchbench) library is a good reference for benchmark implementations, |
| 219 | +which you can base your own benchmarks on. |
| 220 | + |
| 221 | +## API for BenchmarkResult |
| 222 | + |
| 223 | +```eval_rst |
| 224 | +
|
| 225 | +.. automodule:: sotabenchapi.core.results |
| 226 | + :members: |
| 227 | +``` |
0 commit comments