|
241 | 241 | "cell_type": "markdown",
|
242 | 242 | "metadata": {},
|
243 | 243 | "source": [
|
244 |
| - "## 2.2 Apply data validation to cells in a spreadsheet\n", |
| 244 | + "## 2.2 Methods for data anonymisation\n", |
| 245 | + "\n", |
| 246 | + "There are a wide range of techniques available to support anonymisation. Broadly, though, they fit into two types:\n", |
| 247 | + "\n", |
| 248 | + "- __Redaction__: in which we remove fields or line-item information while maintaining sufficient integrity to permit semantic analysis;\n", |
| 249 | + "- __Aggregation__: in which we deliberately aggregate data to ensure outlier anonymity;\n", |
| 250 | + "\n", |
| 251 | + "### 2.2.1 Redaction methods\n", |
| 252 | + "\n", |
| 253 | + "Before we start doing anything, we need to understand our dataset, and understand how we intend to redact it _while maintaining its internal integrity so that we can continue to conduct analysis_.\n", |
| 254 | + "\n", |
| 255 | + "#### Attribute suppression\n", |
| 256 | + "\n", |
| 257 | + "An `attribute` is also known as a `field`. This method requires that we delete an entire field. It is one of the first, and easiest, steps we can take.\n", |
| 258 | + "\n", |
| 259 | + "- Remove data we do not need\n", |
| 260 | + "- Remove data we cannot easily redact\n", |
| 261 | + "\n", |
| 262 | + "This is a destructive step since suppression deletes the original data.\n", |
| 263 | + "\n", |
| 264 | + "#### Record suppression\n", |
| 265 | + "\n", |
| 266 | + "Some data are outliers; sufficiently rare that - in and of itself - these data can't be anonymised. With record suppression we remove all data related to these individuals. However, tread carefully.\n", |
| 267 | + "\n", |
| 268 | + "Outliers may be of significant interest if their status is part of the study. If a person's illness is unusual for the area where they live, for their ethnicity, gender or sexual orientation, then that would make them an outlier. However, that would also be important for understanding the disease.\n", |
| 269 | + "\n", |
| 270 | + "On the other hand, if their location, ethnicity, gender or sexual orientation have no bearing on the disease, then these could be safely redacted.\n", |
| 271 | + "\n", |
| 272 | + "#### Pseudonymisation\n", |
| 273 | + "\n", |
| 274 | + "Pseudonymisation is the replacement of identifying data with randomised values. This can be reversable, if you create a key between the data and the generated values, but irreversable if you deliberately throw aways the keys. Persistent pseudonyms support linkage between the same individual across different datasets.\n", |
| 275 | + "\n", |
| 276 | + "- `strings`: pseudonymise through replacement;\n", |
| 277 | + "\n", |
| 278 | + "#### Generalisation\n", |
| 279 | + "\n", |
| 280 | + "Generalisation is a deliberate reduction in the precision of data, such as converting a person's age into a range, or a precise location into a less precise location.\n", |
| 281 | + "\n", |
| 282 | + "- `range`: conversion of precise numbers into quantiles or statistical ranges;\n", |
| 283 | + "- `cluster`: aggregation of geospatial data into statistically less significant clusters - this can also be used to mask outliers;\n", |
| 284 | + "\n", |
| 285 | + "Design the data ranges with appropriate sizes. Sometimes quantiles are the most appropriate, sometimes we use statistical definitions (such as geospatial ranges that are designed to include sufficient numbers of people so as to reduce deanonymisation).\n", |
| 286 | + "\n", |
| 287 | + "#### Shuffling\n", |
| 288 | + "\n", |
| 289 | + "Shuffling is where data are rearranged such that the individual attribute values are still represented in the dataset, but generally, do not correspond to the original records. This is not appropriate for all data. Swapping diseases amongst different patients will certainly render the data anonymous, but will also confuse any epidemiological analysis.\n", |
| 290 | + "\n", |
| 291 | + "#### Data Perturbation\n", |
| 292 | + "\n", |
| 293 | + "Perturbation involves adding random noise to data to \"blur\" it. This can include rounding, shifting dates, or adding geospatial displacement (jitter) to coordinate data. This means artificially moving the precision within a small range to obscure the exact details of the person.\n", |
| 294 | + "\n", |
| 295 | + "- `dates`: shift exact dates by days or months;\n", |
| 296 | + "- `rounding`: round off to the nearest decile or whole number, depending on the precision of the data;\n", |
| 297 | + "- `coordinates`: perturb the data through geospatial displacement (jitter);\n", |
| 298 | + "\n", |
| 299 | + "Care must be taken not to add too little or too much perturbation.\n", |
| 300 | + "\n", |
| 301 | + "### 2.2.2 Aggregation methods\n", |
| 302 | + "\n", |
| 303 | + "Aggregation is far more destructive than is redaction. We will lose resolution on patient morphology, and we will lose the direct relationships between data in exchange for summaries of that data. But we will gain security for the individuals concerned.\n", |
| 304 | + "\n", |
| 305 | + "Where redaction is guided by the data almost exclusively, aggregation is guided by the research objectives for the data. Any form of aggregation will limit what can be done and awareness of these limitations is critical.\n", |
| 306 | + "\n", |
| 307 | + "Census data are usually aggregated in this way, with the individual microdata (responses from each household) only made available to accredited researchers, while the aggregated versions are made available to the public.\n", |
| 308 | + "\n", |
| 309 | + "Our objective will be to create groups of data and then perform aggregations on each group. The range of aggregations we can form include:\n", |
| 310 | + "\n", |
| 311 | + "- `count`: count of the individual members of the group;\n", |
| 312 | + "- `totals`: sums of values, and sums of sub-groups within the values (e.g. total duration of illness, and duration of each type of illness);\n", |
| 313 | + "- `averages`: including `mean`, `median` and `mode` of data sequences;\n", |
| 314 | + "- `distributions`: including `quantiles`, `normals` or other types of distribution.\n", |
| 315 | + "\n", |
| 316 | + "The groups can be by specific `categories` or `geospatial` ranges.\n", |
| 317 | + "\n", |
| 318 | + "In many ways, an entire course of statistics is required to perform aggregations well.\n", |
| 319 | + "\n", |
| 320 | + "<div class=\"alert alert-block alert-warning\">\n", |
| 321 | + " <p><b>Aggregations require familiarity and experience with the data being aggregated.</b> It's very difficult to simply pick up a random dataset and know how to aggregate it\n", |
| 322 | + " in a way that supports analysis and extracts meaning from it. You are unlikely to be responsible for aggregating data you don't have experience with, and when you have that\n", |
| 323 | + " experience, knowing how to aggregate it will become clearer.</p>\n", |
| 324 | + "</div>\n", |
| 325 | + "\n", |
| 326 | + "\n", |
| 327 | + "---" |
| 328 | + ] |
| 329 | + }, |
| 330 | + { |
| 331 | + "cell_type": "markdown", |
| 332 | + "metadata": {}, |
| 333 | + "source": [ |
| 334 | + "## 2.3 Apply data validation to cells in a spreadsheet\n", |
245 | 335 | "\n",
|
246 | 336 | "Your `types` - at this stage - are only a guide. You will have no feedback, or error messages like you get when running Python code, if any of the data types in your field columns are wrong. There are a few ways to get that feedback so you can correct things, but we'll start with data validation in spreadsheet cells.\n",
|
247 | 337 | "\n",
|
248 | 338 | "The following is adapted from a [Microsoft Office tutorial](https://support.office.com/en-gb/article/apply-data-validation-to-cells-29fecbcc-d1b9-42c1-9d76-eff3ce5f7249). This approach will work in OpenOffice as well as Google Sheets, although the specific steps are different.\n",
|
249 | 339 | "\n",
|
250 | 340 | "Microsoft has an example file you can [download](http://download.microsoft.com/download/9/6/8/968A9140-2E13-4FDC-B62C-C1D98D2B0FE6/Data%20Validation%20Examples.xlsx).\n",
|
251 | 341 | "\n",
|
252 |
| - "### 2.2.1 Specify validation for data types\n", |
| 342 | + "### 2.3.1 Specify validation for data types\n", |
253 | 343 | "\n",
|
254 | 344 | "The process is straightforward:\n",
|
255 | 345 | "\n",
|
|
287 | 377 | "\n",
|
288 | 378 | "Now - only for new data - if a user tries to enter a value that is not valid, a pop-up appears with the message, \"This value doesn’t match the data validation restrictions for this cell.\" We'll run validation on your existing data shortly, but first a detour into `lists`.\n",
|
289 | 379 | "\n",
|
290 |
| - "### 2.2.2 Lists are a special type\n", |
| 380 | + "### 2.3.2 Lists are a special type\n", |
291 | 381 | "\n",
|
292 | 382 | "Before you can validate a `list` type, you need to specify valid terms. In Excel, this requires an [extra set of steps](https://support.office.com/en-us/article/create-a-drop-down-list-7693307a-59ef-400a-b769-c5402dce407b).\n",
|
293 | 383 | "\n",
|
|
306 | 396 | " - Convert your list to a table with __Ctrl+T__, then from the __Table Design__ tab give your table a name, permitting you to reference the table name and column (e.g. `=CityTable[City]`)\n",
|
307 | 397 | " - From the __Formulas__ tab select __Name Manager__, create a __New__ item with an appropriate name (e.g. `CityList`), and reference the cells (e.g. `=Sheet1!A4:A10`), which then lets you reference your list anywhere (e.g. `=CityList`)\n",
|
308 | 398 | "\n",
|
309 |
| - "### 2.2.3 Validate and get error messages for your existing data\n", |
| 399 | + "### 2.3.3 Validate and get error messages for your existing data\n", |
310 | 400 | "\n",
|
311 | 401 | "After you've specified validation rules on your existing data you might be disappoined. Excel does not automatically notify you whether these cells contain invalid data. Here's a quick way to [highlight existing invalid cells](https://support.office.com/en-us/article/more-on-data-validation-f38dee73-9900-4ca6-9301-8a5f6e1f0c4c) by circling the values:\n",
|
312 | 402 | "\n",
|
|
348 | 438 | "cell_type": "markdown",
|
349 | 439 | "metadata": {},
|
350 | 440 | "source": [
|
351 |
| - "## 2.3 Saving your validated file as a comma-separated-value\n", |
| 441 | + "## 2.4 Saving your validated file as a comma-separated-value\n", |
352 | 442 | "\n",
|
353 | 443 | "Comma separated value files (`.csv`) are text files in which the comma character `,` separates each field of text. Where a comma appears in the value - whether a `string` or `number` - the value is then surrounded by quotation marks, e.g. `100, 200, \"20,000\"` indicates three values in three separate fields.\n",
|
354 | 444 | "\n",
|
|
372 | 462 | "cell_type": "markdown",
|
373 | 463 | "metadata": {},
|
374 | 464 | "source": [
|
375 |
| - "## 2.4 Validating your data and JSON schema using CSVLint\n", |
| 465 | + "## 2.5 Validating your data and JSON schema using CSVLint [optional]\n", |
376 | 466 | "\n",
|
377 | 467 | "In the next lesson, we'll learn how to validate your data using Python directly in a Jupyter Notebook, for now we'll use an online resource provided by the Open Data Institute called [CSVLint](https://csvlint.io/).\n",
|
378 | 468 | "\n",
|
|
416 | 506 | "cell_type": "markdown",
|
417 | 507 | "metadata": {},
|
418 | 508 | "source": [
|
419 |
| - "## 2.5 Lesson tutorial\n", |
| 509 | + "## 2.6 Lesson tutorial\n", |
420 | 510 | "\n",
|
421 | 511 | "<div class=\"alert alert-block alert-success\">\n",
|
422 | 512 | " <p><b>Tutorial:</b></p>\n",
|
|
429 | 519 | " </ul>\n",
|
430 | 520 | "</div>\n",
|
431 | 521 | "\n",
|
432 |
| - "Please complete the tutorial before continuing with this series." |
| 522 | + "Please complete the tutorial before continuing with this series. If you are participating in a taught class, please send your tutorial submission via the required process (email or online)." |
433 | 523 | ]
|
434 | 524 | }
|
435 | 525 | ],
|
|
0 commit comments