| 
2 | 2 | 
 
  | 
3 | 3 | 
 
  | 
4 | 4 | 
 
  | 
5 |  | -(sec:clustbench-usage)=  | 
6 |  | -# Using *clustbench*  | 
7 | 5 | 
 
  | 
8 | 6 | 
 
  | 
9 |  | -The Python version of the *clustering-benchmarks* package  | 
10 |  | -can be installed from [PyPI](https://pypi.org/project/clustering-benchmarks/),  | 
11 |  | -e.g., via a call to:  | 
12 | 7 | 
 
  | 
13 |  | -```  | 
14 |  | -pip3 install clustering-benchmarks  | 
15 |  | -```  | 
16 | 8 | 
 
  | 
17 |  | -from the command line. Alternatively, please use your favourite Python  | 
18 |  | -package manager.  | 
19 | 9 | 
 
  | 
20 | 10 | 
 
  | 
21 |  | -Once installed, we can import it by calling:  | 
22 | 11 | 
 
  | 
23 | 12 | 
 
  | 
24 | 13 | 
 
  | 
25 |  | -```python  | 
26 |  | -import clustbench  | 
27 |  | -```  | 
28 | 14 | 
 
  | 
29 |  | -Below we discuss its basic features/usage.  | 
30 | 15 | 
 
  | 
31 | 16 | 
 
  | 
32 |  | -::::{note}  | 
33 |  | -*To learn more about Python,  | 
34 |  | -check out Marek's open-access (free!) textbook*  | 
35 |  | -[Minimalist Data Wrangling in Python](https://datawranglingpy.gagolewski.com/)  | 
36 |  | -{cite}`datawranglingpy`.  | 
37 |  | -::::  | 
38 | 17 | 
 
  | 
39 | 18 | 
 
  | 
40 |  | -## Fetching Benchmark Data  | 
41 | 19 | 
 
  | 
42 |  | -The datasets from the {ref}`sec:suite-v1` can be accessed easily.  | 
43 |  | -It is best to [download](https://github.com/gagolews/clustering-data-v1)  | 
44 |  | -the whole repository onto our disk first.  | 
45 |  | -Let us assume they are available in the following directory:  | 
46 | 20 | 
 
  | 
47 | 21 | 
 
  | 
48 | 22 | 
 
  | 
49 |  | -```python  | 
50 |  | -# load from a local library (download the suite manually)  | 
51 |  | -import os.path  | 
52 |  | -data_path = os.path.join("~", "Projects", "clustering-data-v1")  | 
53 |  | -```  | 
54 | 23 | 
 
  | 
55 |  | -Here is the list of the currently available benchmark batteries  | 
56 |  | -(dataset collections):  | 
57 | 24 | 
 
  | 
58 | 25 | 
 
  | 
59 | 26 | 
 
  | 
60 |  | -```python  | 
61 |  | -print(clustbench.get_battery_names(path=data_path))  | 
62 |  | -## ['fcps', 'g2mg', 'graves', 'h2mg', 'mnist', 'other', 'sipu', 'uci', 'wut']  | 
63 |  | -```  | 
64 | 27 | 
 
  | 
65 |  | -We can list the datasets in an example battery by calling:  | 
66 | 28 | 
 
  | 
67 | 29 | 
 
  | 
68 | 30 | 
 
  | 
69 |  | -```python  | 
70 |  | -battery = "wut"  | 
71 |  | -print(clustbench.get_dataset_names(battery, path=data_path))  | 
72 |  | -## ['circles', 'cross', 'graph', 'isolation', 'labirynth', 'mk1', 'mk2', 'mk3', 'mk4', 'olympic', 'smile', 'stripes', 'trajectories', 'trapped_lovers', 'twosplashes', 'windows', 'x1', 'x2', 'x3', 'z1', 'z2', 'z3']  | 
73 |  | -```  | 
74 | 31 | 
 
  | 
75 |  | -For instance, let us load the `wut/x2` dataset:  | 
76 | 32 | 
 
  | 
77 | 33 | 
 
  | 
78 | 34 | 
 
  | 
79 |  | - | 
80 |  | -```python  | 
81 |  | -dataset = "x2"  | 
82 |  | -b = clustbench.load_dataset(battery, dataset, path=data_path)  | 
83 |  | -```  | 
84 |  | - | 
85 |  | -The above call returned a named tuple.  | 
86 |  | -For instance, the README file can be accessed via the `description`  | 
87 |  | -field:  | 
88 |  | - | 
89 |  | - | 
90 |  | - | 
91 |  | -```python  | 
92 |  | -print(b.description)  | 
93 |  | -## Author: Eliza Kaczorek (Warsaw University of Technology)  | 
94 |  | -##   | 
95 |  | -## `labels0` come from the Author herself.  | 
96 |  | -## `labels1` were generated by Marek Gagolewski.  | 
97 |  | -## `0` denotes the noise class (if present).  | 
98 |  | -```  | 
99 |  | - | 
100 |  | -What is more, the `data` field is the data matrix, `labels` gives the list  | 
101 |  | -of all ground truth partitions (encoded as label vectors),  | 
102 |  | -and `n_clusters` gives the corresponding numbers of subsets.  | 
103 |  | -In case of any doubt, we can always consult the official documentation  | 
104 |  | -of the {any}`clustbench.load_dataset` function.  | 
105 |  | - | 
106 |  | -For instance, here is the shape (*n* and *d*) of the data matrix,  | 
107 |  | -the number of reference partitions, and their cardinalities *k*,  | 
108 |  | -respectively:  | 
109 |  | - | 
110 |  | - | 
111 |  | - | 
112 |  | -```python  | 
113 |  | -print(b.data.shape, len(b.labels), b.n_clusters)  | 
114 |  | -## (120, 2) 2 [3 4]  | 
115 |  | -```  | 
116 |  | - | 
117 |  | -The following figure (generated via a call to  | 
118 |  | -[`genieclust`](https://genieclust.gagolewski.com/)`.plots.plot_scatter`)  | 
119 |  | -illustrates the benchmark dataset at hand.  | 
120 |  | - | 
121 |  | - | 
122 |  | - | 
123 |  | -```python  | 
124 |  | -import genieclust  | 
125 |  | -for i in range(len(b.labels)):  | 
126 |  | -    plt.subplot(1, len(b.labels), i+1)  | 
127 |  | -    genieclust.plots.plot_scatter(  | 
128 |  | -        b.data, labels=b.labels[i]-1, axis="equal", title=f"labels{i}"  | 
129 |  | -    )  | 
130 |  | -plt.show()  | 
131 |  | -```  | 
132 |  | - | 
133 |  | -(fig:using-clustbench-example1)=  | 
134 |  | -```{figure} clustbench-usage-figures/using-clustbench-example1-1.*  | 
135 |  | -An example benchmark dataset and the corresponding ground truth labels.  | 
136 |  | -```  | 
137 |  | - | 
138 |  | - | 
139 |  | -## Fetching Precomputed Results  | 
140 |  | - | 
141 |  | -Let us study one of the sets of  | 
142 |  | -[precomputed clustering results](https://github.com/gagolews/clustering-results-v1)  | 
143 |  | -stored in the following directory:  | 
144 |  | - | 
145 |  | - | 
146 |  | - | 
147 |  | - | 
148 |  | -```python  | 
149 |  | -results_path = os.path.join("~", "Projects", "clustering-results-v1", "original")  | 
150 |  | -```  | 
151 |  | - | 
152 |  | -They can be fetched by calling:  | 
153 |  | - | 
154 |  | - | 
155 |  | - | 
156 |  | -```python  | 
157 |  | -method_group = "Genie"  # or "*" for everything  | 
158 |  | -res = clustbench.load_results(  | 
159 |  | -    method_group, b.battery, b.dataset, b.n_clusters, path=results_path  | 
160 |  | -)  | 
161 |  | -print(list(res.keys()))  | 
162 |  | -## ['Genie_G0.1', 'Genie_G0.3', 'Genie_G0.5', 'Genie_G0.7', 'Genie_G1.0']  | 
163 |  | -```  | 
164 |  | - | 
165 |  | -We thus have got access to precomputed data  | 
166 |  | -generated by the [*Genie*](https://genieclust.gagolewski.com)  | 
167 |  | -algorithm with different `gini_threshold` parameter settings.  | 
168 |  | - | 
169 |  | - | 
170 |  | - | 
171 |  | -## Computing External Cluster Validity Measures  | 
172 |  | - | 
173 |  | - | 
174 |  | -Different  | 
175 |  | -{ref}`external cluster validity measures <sec:external-validity-measures>`  | 
176 |  | -can be computed by calling {any}`clustbench.get_score`:  | 
177 |  | - | 
178 |  | - | 
179 |  | - | 
180 |  | - | 
181 |  | -```python  | 
182 |  | -pd.Series({  # for aesthetics  | 
183 |  | -    method: clustbench.get_score(b.labels, res[method])  | 
184 |  | -    for method in res.keys()  | 
185 |  | -})  | 
186 |  | -## Genie_G0.1    0.870000  | 
187 |  | -## Genie_G0.3    0.870000  | 
188 |  | -## Genie_G0.5    0.590909  | 
189 |  | -## Genie_G0.7    0.666667  | 
190 |  | -## Genie_G1.0    0.010000  | 
191 |  | -## dtype: float64  | 
192 |  | -```  | 
193 |  | - | 
194 |  | -By default, normalised clustering accuracy is applied.  | 
195 |  | -As explained in the tutorial, we compare the predicted clusterings against  | 
196 |  | -{ref}`all <sec:many-partitions>` the reference partitions  | 
197 |  | -({ref}`ignoring noise points <sec:noise-points>`)  | 
198 |  | -and report the maximal score.  | 
199 |  | - | 
200 |  | -Let us depict the results for the `"Genie_G0.3"` method:  | 
201 |  | - | 
202 |  | - | 
203 |  | - | 
204 |  | -```python  | 
205 |  | -method = "Genie_G0.3"  | 
206 |  | -for i, k in enumerate(res[method].keys()):  | 
207 |  | -    plt.subplot(1, len(res[method]), i+1)  | 
208 |  | -    genieclust.plots.plot_scatter(  | 
209 |  | -        b.data, labels=res[method][k]-1, axis="equal", title=f"{method}; k={k}"  | 
210 |  | -    )  | 
211 |  | -plt.show()  | 
212 |  | -```  | 
213 |  | - | 
214 |  | -(fig:using-clustbench-example2)=  | 
215 |  | -```{figure} clustbench-usage-figures/using-clustbench-example2-3.*  | 
216 |  | -Results generated by Genie.  | 
217 |  | -```  | 
218 |  | - | 
219 |  | - | 
220 |  | -## Applying Clustering Methods Manually  | 
221 |  | - | 
222 |  | -Naturally, the aim of this benchmark framework is also to test new methods.  | 
223 |  | -We can use {any}`clustbench.fit_predict_many` to generate  | 
224 |  | -all the partitions required to compare ourselves against the reference labels.  | 
225 |  | - | 
226 |  | -For instance, let us investigate the behaviour of the k-means algorithm:  | 
227 |  | - | 
228 |  | - | 
229 |  | - | 
230 |  | -```python  | 
231 |  | -import sklearn.cluster  | 
232 |  | -m = sklearn.cluster.KMeans(n_init=10)  | 
233 |  | -res["KMeans"] = clustbench.fit_predict_many(m, b.data, b.n_clusters)  | 
234 |  | -clustbench.get_score(b.labels, res["KMeans"])  | 
235 |  | -## 0.9848484848484849  | 
236 |  | -```  | 
237 |  | - | 
238 |  | -We see that k-means (which specialises in detecting symmetric Gaussian-like blobs)  | 
239 |  | -performs better than *Genie* on this particular dataset.  | 
240 |  | - | 
241 |  | - | 
242 |  | - | 
243 |  | -```python  | 
244 |  | -method = "KMeans"  | 
245 |  | -for i, k in enumerate(res[method].keys()):  | 
246 |  | -    plt.subplot(1, len(res[method]), i+1)  | 
247 |  | -    genieclust.plots.plot_scatter(  | 
248 |  | -        b.data, labels=res[method][k]-1, axis="equal", title=f"{method}; k={k}"  | 
249 |  | -    )  | 
250 |  | -plt.show()  | 
251 |  | -```  | 
252 |  | - | 
253 |  | -(fig:using-clustbench-example3)=  | 
254 |  | -```{figure} clustbench-usage-figures/using-clustbench-example3-5.*  | 
255 |  | -Results generated by K-Means.  | 
256 |  | -```  | 
257 |  | - | 
258 |  | -For more functions, please refer to the package's documentation (in the next section).  | 
259 |  | -Moreover, {ref}`sec:colouriser` describes a standalone application  | 
260 |  | -that can be used to prepare our own two-dimensional datasets.  | 
261 |  | - | 
262 |  | -Note that you do not have to use the *clustering-benchmark* package  | 
263 |  | -to access the benchmark datasets from our repository.  | 
264 |  | -The {ref}`sec:how-to-access` section mentions that most operations  | 
265 |  | -involve simple operations on files and directories which you can  | 
266 |  | -implement manually. The package was developed merely for the users'  | 
267 |  | -convenience.  | 
0 commit comments