Skip to content

Commit 524d93c

Browse files
committed
Update docs
1 parent f87a65d commit 524d93c

File tree

5 files changed

+251
-28
lines changed

5 files changed

+251
-28
lines changed

docs/01.create-benchmark.md

+227
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,227 @@
1+
# How to Create a Benchmark
2+
3+
To create a benchmark, you need to write a function that returns a `BenchmarkResult` instance.
4+
This object is dictionary-like and holds information about the model results on the benchmark.
5+
6+
For example if you submitted an
7+
EfficientNet model to an ImageNet benchmark, the instance would contain information on its performance (Top 1/5 Accuracy), the model name, the name
8+
of the dataset and task, and so on. The object also contains methods for serialising the results to JSON, and some server checking methods that call the sotabench.com API to check if the results can be accepted.
9+
10+
If you want to see the full API for `BenchmarkResult`, then skip to the end of this section.
11+
Otherwise we will go through a step-by-step example in PyTorch for creating a benchmarl.
12+
13+
## The Bare Necessities
14+
15+
Start a new project and make a benchmark file, e.g. `mnist.py`. Begin by writing a skeleton function
16+
as follows:
17+
18+
```python
19+
from sotabenchapi.core import BenchmarkResult
20+
21+
def evaluate_mnist(...) -> BenchmarkResult:
22+
23+
# your evaluation logic here
24+
results = {...} # dict with keys as metric names, values as metric results
25+
26+
return BenchmarkResult(results=results)
27+
```
28+
29+
This is the core structure of an evaluation method for sotabench: we have a function that takes in user inputs,
30+
we do some evaluation, and we pass in some results and other outputs to a `BenchmarkResult` instance. Essentially you can write any benchmark around this format,
31+
and take in any input that you want for your evaluation. It is designed to be flexible.
32+
33+
For example, it could be as simple as taking a json of predictions as an input if that's all you need. Or if you
34+
want more information about the model, you could request a model function or class as an input and pass the data to the
35+
model yourself. It is up to you how you want to design your benchmark.
36+
37+
## Sotabench Metadata
38+
39+
So that benchmark results can be displayed on sotabench.com, you will need your submissions to have metadata about the model name,
40+
the dataset name and the task. For example, "EfficientNet", "Imagenet", "Image Classification".
41+
42+
In the context of your benchmark function:
43+
44+
```python
45+
from sotabenchapi.core import BenchmarkResult
46+
47+
DATASET_NAME = 'ImageNet'
48+
TASK = 'Image Classification'
49+
50+
def evaluate_mnist(model_name, ...) -> BenchmarkResult:
51+
52+
# your evaluation logic here
53+
results = {...} # dict with keys as metric names, values as metric results
54+
55+
return BenchmarkResult(results=results, model=model_name, dataset=DATASET_NAME, task=TASK)
56+
```
57+
58+
Here the dataset name and task name will be fixed for the benchmark, but the model name
59+
can be specified as an input. You can add additional metadata to connect things like the
60+
ArXiv paper id - see the API documentation at the end of this section for more information.
61+
62+
## Example: An MNIST benchmark in PyTorch
63+
64+
Let's see how we might make a PyTorch friendly benchmark which adheres to the framework's abstractions.
65+
66+
The first thing we need for evaluation is a dataset! Let's use the MNIST dataset from the `torchvision` library,
67+
along with a `DataLoader`:
68+
69+
```python
70+
from sotabenchapi.core import BenchmarkResult
71+
from torch.utils.data import DataLoader
72+
import torchvision.datasets as datasets
73+
74+
def evaluate_mnist(data_root: str, batch_size: int = 32, num_workers: int = 4) -> BenchmarkResult:
75+
76+
dataset = datasets.MNIST(data_root, train=False, download=True)
77+
loader = DataLoader(dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True)
78+
79+
return BenchmarkResult(dataset=dataset.__name__)
80+
```
81+
82+
We've set `train=false` since we want to use the testing split for evaluation. We've also added a `data_root` parameter
83+
just so the use can specify where they want the data downloaded.
84+
85+
We should also probably allow for the user to put in their own transforms since this is a vision dataset, so
86+
let's modify further:
87+
88+
```python
89+
from sotabenchapi.core import BenchmarkResult
90+
from torch.utils.data import DataLoader
91+
import torchvision.datasets as datasets
92+
93+
def evaluate_mnist(data_root: str, batch_size: int = 32, num_workers: int = 4,
94+
input_transform=None, target_transform=None) -> BenchmarkResult:
95+
96+
dataset = datasets.MNIST(data_root, transform=input_transform, target_transform=target_transform,
97+
train=False, download=True)
98+
loader = DataLoader(dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True)
99+
100+
return BenchmarkResult(dataset=dataset.__name__)
101+
```
102+
103+
Great, so we have a dataset set up. Let's now take in a model. We could do this in a number of ways, for example,
104+
we could accept a model function as an input (that takes in data and outputs predictions). Since we are using PyTorch,
105+
where most modules are subclasses of `nn.Module`, let's do it in an object-oriented way by accepting a model object input:
106+
107+
```python
108+
from sotabenchapi.core import BenchmarkResult
109+
import torchvision.datasets as datasets
110+
from torchbench.utils import send_model_to_device
111+
112+
def evaluate_mnist(model, data_root: str, batch_size: int = 32, num_workers: int = 4,
113+
input_transform=None, target_transform=None) -> BenchmarkResult:
114+
115+
model, device = send_model_to_device(model, device='cuda')
116+
model.eval()
117+
118+
dataset = datasets.MNIST(data_root, transform=input_transform, target_transform=target_transform,
119+
train=False, download=True)
120+
loader = DataLoader(dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True)
121+
122+
return BenchmarkResult(dataset=dataset.__name__)
123+
```
124+
125+
Here we have reused a function from `torchbench` for sending the model to a cuda device, but this is optional - you can
126+
decide how models are processed in your own benchmark however you see fit.
127+
128+
Now that we have a model and a dataset, let's loop through and evaluate the model:
129+
130+
```python
131+
from sotabenchapi.core import BenchmarkResult
132+
from torch.utils.data import DataLoader
133+
import torchvision.datasets as datasets
134+
from torchbench.utils import send_model_to_device, default_data_to_device, AverageMeter, accuracy
135+
import tqdm
136+
import torch
137+
138+
def evaluate_mnist(model, data_root: str, batch_size: int = 32, num_workers: int = 4,
139+
input_transform=None, target_transform=None) -> BenchmarkResult:
140+
141+
model, device = send_model_to_device(model, device='cuda')
142+
model.eval()
143+
144+
dataset = datasets.MNIST(data_root, transform=input_transform, target_transform=target_transform,
145+
train=False, download=True)
146+
loader = DataLoader(dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True)
147+
148+
top1 = AverageMeter()
149+
top5 = AverageMeter()
150+
151+
with torch.no_grad():
152+
for i, (input, target) in enumerate(tqdm.tqdm(loader)):
153+
154+
input, target = default_data_to_device(input, target, device=device)
155+
output = model(input)
156+
prec1, prec5 = accuracy(output, target, topk=(1, 5))
157+
top1.update(prec1.item(), input.size(0))
158+
top5.update(prec5.item(), input.size(0))
159+
160+
results = {'Top 1 Accuracy': top1.avg, 'Top 5 Accuracy': top5.avg}
161+
162+
return BenchmarkResult(dataset=dataset.__name__, results=results)
163+
```
164+
165+
We've used some more utility functions from `torchbench`, but again, you can use whatever you want to do evaluation.
166+
You can see we've passed a results dictionary into the `BenchmarkResult` object. Great! So we have a function that
167+
takes in a model and evaluates on a dataset. But how do we connect to Sotabench? Well, we need to have the user pass
168+
in some metadata information about the model name and paper id. We also need to specify a bit more about our benchmark,
169+
e.g. the task in this case is "Image Classification":
170+
171+
```python
172+
from sotabenchapi.core import BenchmarkResult
173+
from torch.utils.data import DataLoader
174+
import torchvision.datasets as datasets
175+
from torchbench.utils import send_model_to_device, default_data_to_device, AverageMeter, accuracy
176+
import tqdm
177+
import torch
178+
179+
def evaluate_mnist(model, data_root: str, batch_size: int = 32, num_workers: int = 4,
180+
input_transform=None, target_transform=None, model_name:str = None,
181+
arxiv_id:str = None) -> BenchmarkResult:
182+
183+
model, device = send_model_to_device(model, device='cuda')
184+
model.eval()
185+
186+
dataset = datasets.MNIST(data_root, transform=input_transform, target_transform=target_transform,
187+
train=False, download=True)
188+
loader = DataLoader(dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True)
189+
190+
top1 = AverageMeter()
191+
top5 = AverageMeter()
192+
193+
with torch.no_grad():
194+
for i, (input, target) in enumerate(tqdm.tqdm(loader)):
195+
196+
input, target = default_data_to_device(input, target, device=device)
197+
output = model(input)
198+
prec1, prec5 = accuracy(output, target, topk=(1, 5))
199+
top1.update(prec1.item(), input.size(0))
200+
top5.update(prec5.item(), input.size(0))
201+
202+
results = {'Top 1 Accuracy': top1.avg, 'Top 5 Accuracy': top5.avg}
203+
204+
return BenchmarkResult(task='Image Classification', dataset=dataset.__name__, results=results,
205+
model=model_name, arxiv_id=arxiv_id)
206+
```
207+
208+
And you're set! The task string connects to the taxonomy on sotabench, the rest gives context to the
209+
result - for example the model's name and the paper it is from.
210+
211+
The final step is to publish this as a PyPi library. This will enable your users to write a `sotabench.py` file
212+
that imports your benchmark and passes their model and other parameters into it. When they connect to sotabench.com,
213+
sotabench.com will download your library and evaluate their model with it, and then publish the results to your
214+
benchmark page.
215+
216+
## Other Examples
217+
218+
The [torchbench](https://www.github.com/paperswithcode/torchbench) library is a good reference for benchmark implementations,
219+
which you can base your own benchmarks on.
220+
221+
## API for BenchmarkResult
222+
223+
```eval_rst
224+
225+
.. automodule:: sotabenchapi.core.results
226+
:members:
227+
```

docs/04.create-benchmark.md

-1
This file was deleted.

docs/conf.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -27,10 +27,10 @@
2727
from sotabenchapi.version import __version__
2828

2929

30-
project = "torchbench"
31-
author = "Robert Stojnic <[email protected]>"
30+
project = "Sotabench API"
31+
author = "Sotabench Team <[email protected]>"
3232
description = (
33-
"Easily benchmark Machine Learning models on selected tasks and datasets."
33+
"Easily benchmark deep learning models"
3434
)
3535
copyright = f"{datetime.now():%Y}, {author}"
3636

docs/index.md

+13-11
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,24 @@
1-
# Welcome to torchbench documentation!
1+
# Welcome to the Sotabench API Documentation
22

3+
This documentation details how to use the `sotabenchapi` library to connect
4+
with [sotabench](http://www.sotabench.com).
35

4-
## Contents
6+
Using this library you will be able to create your own public research benchmarks, allowing the community to submit and
7+
evaluate models on them, and have the results submitted to the [sotabench](http://www.sotabench.com) resource.
58

6-
```eval_rst
7-
.. toctree::
8-
:maxdepth: 2
9+
## Installation
910

10-
01.overview.md
11-
02.getting-started.md
12-
03.benchmark-model.md
13-
04.create-benchmark.md
11+
The library requires Python 3.6+. You can install via pip:
1412

13+
pip install sotabenchapi
14+
15+
## Contents
16+
17+
```eval_rst
1518
.. toctree::
1619
:maxdepth: 2
1720
18-
api/index.md
21+
01.create-benchmark.md
1922
```
2023

2124

@@ -24,5 +27,4 @@
2427
```eval_rst
2528
* :ref:`genindex`
2629
* :ref:`modindex`
27-
* :ref:`search`
2830
```

sotabenchapi/core/results.py

+8-13
Original file line numberDiff line numberDiff line change
@@ -8,13 +8,9 @@
88

99

1010
class BenchmarkResult:
11-
"""BenchmarkResult represents the results of a benchmark.
12-
13-
It takes in inputs from a benchmark evaluation and stores them to a JSON at
14-
``evaluation.json``.
15-
16-
This file is then processed to store and show results on the sotabench
17-
platform.
11+
"""BenchmarkResult encapsulates data for the results of a model on a benchmark,
12+
and methods for serialising that data and checking the parameters with the
13+
sotabench.com resource.
1814
1915
Most of the inputs are optional - so when you create a benchmark, you can
2016
choose which subset of arguments you want to store (that are relevant for
@@ -86,15 +82,14 @@ def __init__(
8682
self.to_dict()
8783

8884
def to_dict(self) -> dict:
89-
"""Performs evaluation and return build results.
85+
"""Serialises the benchmark result data
9086
91-
Performs evaluation using a benchmark function and returns a
92-
dictionary of the build results.
93-
94-
If an environmental variable is set
95-
(``SOTABENCH_STORE_RESULTS == True``) then will also save a JSON called
87+
If an environmental variable is set, e.g.
88+
(``SOTABENCH_STORE_FILENAME == 'evaluation.json'``) then will also save a JSON called
9689
``evaluation.json``
9790
91+
The method also checks for errors with the sotabench.com server if in check mode.
92+
9893
Returns:
9994
dict: A dictionary containing results
10095
"""

0 commit comments

Comments
 (0)