Skip to content

Commit 606fcf7

Browse files
committed
Fixing Issue on page /data-visualise.html #41 and Issue on page /data-tidy.html #43
1 parent dd9f387 commit 606fcf7

5 files changed

+144
-28
lines changed

data-tidy.ipynb

+140-22
Original file line numberDiff line numberDiff line change
@@ -74,15 +74,26 @@
7474
"\n",
7575
"Tidy data aren't going to be appropriate *every* time and in every case, but they're a really, really good default for tabular data. Once you use it as your default, it's easier to think about how to perform subsequent operations.\n",
7676
"\n",
77-
"Having said that tidy data are great, they are, but one of **pandas**' advantages relative to other data analysis libraries is that it isn't *too* tied to tidy data and can navigate awkward non-tidy data manipulation tasks happily too."
77+
"Having said that tidy data are great, they are, but one of **pandas**' advantages relative to other data analysis libraries is that it isn't *too* tied to tidy data and can navigate awkward non-tidy data manipulation tasks happily too.\n",
78+
"\n",
79+
"There are two common problems you find in data that are ingested that make them not tidy:\n",
80+
"\n",
81+
"1. A variable might be spread across multiple columns.\n",
82+
"2. An observation might be scattered across multiple rows.\n",
83+
"\n",
84+
"For the former, we need to \"melt\" the wide data, with multiple columns, into long data.\n",
85+
"\n",
86+
"For the latter, we need to unstack or pivot the multiple rows into columns (ie go from long to wide.)\n",
87+
"\n",
88+
"We'll see both below."
7889
]
7990
},
8091
{
8192
"cell_type": "markdown",
8293
"id": "deb8cf13",
8394
"metadata": {},
8495
"source": [
85-
"## Make Data Tidy with **pandas**"
96+
"## Tools to Make Data Tidy with **pandas**"
8697
]
8798
},
8899
{
@@ -92,7 +103,7 @@
92103
"source": [
93104
"### Melt\n",
94105
"\n",
95-
"`melt()` can help you go from untidy to tidy data (from wide data to long data), and is a *really* good one to remember.\n",
106+
"`melt()` can help you go from \"wider\" data to \"longer\" data, and is a *really* good one to remember.\n",
96107
"\n",
97108
"![](https://pandas.pydata.org/docs/_images/reshaping_melt.png)\n",
98109
"\n",
@@ -132,7 +143,61 @@
132143
"Perform a `melt()` that uses `job` as the id instead of `first` and `last`.\n",
133144
"```\n",
134145
"\n",
135-
"### Wide to Long\n",
146+
"How does this relate to tidy data? Sometimes you'll have a variable spread over multiple columns that you want to turn tidy. Let's look at this example that uses cases of [tuburculosis from the World Health Organisation](https://www.who.int/teams/global-tuberculosis-programme/data).\n",
147+
"\n",
148+
"First let's open the data and look at the top of the file."
149+
]
150+
},
151+
{
152+
"cell_type": "code",
153+
"execution_count": null,
154+
"id": "bfa121cf",
155+
"metadata": {},
156+
"outputs": [],
157+
"source": [
158+
"from pathlib import Path\n",
159+
"\n",
160+
"df_tb = pd.read_parquet(Path(\"data/who_tb_cases.parquet\"))\n",
161+
"df_tb.head()"
162+
]
163+
},
164+
{
165+
"cell_type": "markdown",
166+
"id": "583e419d",
167+
"metadata": {},
168+
"source": [
169+
"You can see that we have two columns for a single variable, year. Let's now melt this."
170+
]
171+
},
172+
{
173+
"cell_type": "code",
174+
"execution_count": null,
175+
"id": "dc03ccd9",
176+
"metadata": {},
177+
"outputs": [],
178+
"source": [
179+
"df_tb.melt(\n",
180+
" id_vars=[\"country\"],\n",
181+
" var_name=\"year\",\n",
182+
" value_vars=[\"1999\", \"2000\"],\n",
183+
" value_name=\"cases\",\n",
184+
")"
185+
]
186+
},
187+
{
188+
"cell_type": "markdown",
189+
"id": "74d81b30",
190+
"metadata": {},
191+
"source": [
192+
"We now have one observation per row, and one variable per column: tidy!"
193+
]
194+
},
195+
{
196+
"cell_type": "markdown",
197+
"id": "488f5f10",
198+
"metadata": {},
199+
"source": [
200+
"### A simpler wide to long\n",
136201
"\n",
137202
"If you don't want the headscratching of `melt()`, there's also `wide_to_long()`, which is really useful for typical data cleaning cases where you have data like this:"
138203
]
@@ -279,19 +344,73 @@
279344
"id": "39a210bb",
280345
"metadata": {},
281346
"source": [
282-
"### Pivoting data from tidy to, err, untidy data\n",
347+
"### Pivoting data from long to wide\n",
283348
"\n",
284-
"At the start of this chapter, we said you should use tidy data--one row per observation, one column per variable--whenever you can. But there are times when you will want to take your lovingly prepared tidy data and pivot it into a wider format. `pivot()` and `pivot_table()` help you to do that.\n",
349+
"`pivot()` and `pivot_table()` help you to sort out data in which a single observation is scattered over multiple rows.\n",
285350
"\n",
286-
"![](https://pandas.pydata.org/docs/_images/reshaping_pivot.png)\n",
287-
"\n",
288-
"This can be especially useful for time series data, where operations like `shift()` or `diff()` are typically applied assuming that an entry in one row follows (in time) from the one above. Here's an example:"
351+
"![](https://pandas.pydata.org/docs/_images/reshaping_pivot.png)\n"
352+
]
353+
},
354+
{
355+
"cell_type": "markdown",
356+
"id": "8acbcedf",
357+
"metadata": {},
358+
"source": [
359+
"Here's an example dataframe where observations are spread over multiple rows:"
289360
]
290361
},
291362
{
292363
"cell_type": "code",
293364
"execution_count": null,
294-
"id": "405da609",
365+
"id": "fa612456",
366+
"metadata": {},
367+
"outputs": [],
368+
"source": [
369+
"df_tb_cp = pd.read_parquet(Path(\"data/who_tb_case_and_pop.parquet\"))\n",
370+
"df_tb_cp.head()"
371+
]
372+
},
373+
{
374+
"cell_type": "markdown",
375+
"id": "0d4a4077",
376+
"metadata": {},
377+
"source": [
378+
"You see that we have, for each year-country, \"case\" and \"population\" in different rows."
379+
]
380+
},
381+
{
382+
"cell_type": "markdown",
383+
"id": "e7c9ed1b",
384+
"metadata": {},
385+
"source": [
386+
"Now let's pivot this to see the difference:"
387+
]
388+
},
389+
{
390+
"cell_type": "code",
391+
"execution_count": null,
392+
"id": "e584cf37",
393+
"metadata": {},
394+
"outputs": [],
395+
"source": [
396+
"pivoted = df_tb_cp.pivot(\n",
397+
" index=[\"country\", \"year\"], columns=[\"type\"], values=\"count\"\n",
398+
").reset_index()\n",
399+
"pivoted"
400+
]
401+
},
402+
{
403+
"cell_type": "markdown",
404+
"id": "d82307c0",
405+
"metadata": {},
406+
"source": [
407+
"Pivots are especially useful for time series data, where operations like `shift()` or `diff()` are typically applied assuming that an entry in one row follows (in time) from the one above. When we do `shift()` we often want to shift a single variable in time, but if a single observation (in this case a date) is over multiple rows, the timing is going go awry. Let's see an example."
408+
]
409+
},
410+
{
411+
"cell_type": "code",
412+
"execution_count": null,
413+
"id": "97c6d139",
295414
"metadata": {},
296415
"outputs": [],
297416
"source": [
@@ -300,13 +419,12 @@
300419
"data = {\n",
301420
" \"value\": np.random.randn(20),\n",
302421
" \"variable\": [\"A\"] * 10 + [\"B\"] * 10,\n",
303-
" \"category\": np.random.choice([\"type1\", \"type2\", \"type3\", \"type4\"], 20),\n",
304422
" \"date\": (\n",
305-
" list(pd.date_range(\"1/1/2000\", periods=10, freq=\"M\"))\n",
306-
" + list(pd.date_range(\"1/1/2000\", periods=10, freq=\"M\"))\n",
423+
" list(pd.date_range(\"1/1/2000\", periods=10, freq=\"ME\"))\n",
424+
" + list(pd.date_range(\"1/1/2000\", periods=10, freq=\"ME\"))\n",
307425
" ),\n",
308426
"}\n",
309-
"df = pd.DataFrame(data, columns=[\"date\", \"variable\", \"category\", \"value\"])\n",
427+
"df = pd.DataFrame(data, columns=[\"date\", \"variable\", \"value\"])\n",
310428
"df.sample(5)"
311429
]
312430
},
@@ -315,7 +433,7 @@
315433
"id": "90a5b930",
316434
"metadata": {},
317435
"source": [
318-
"If we just run `shift()` on this, it's going to shift variable B's and A's together even though these overlap in time. So we pivot to a wider format (and then we can shift safely)."
436+
"If we just run `shift()` on the above, it's going to shift variable B's and A's together even though these overlap in time and are different variables. So we pivot to a wider format (and then we can shift in time safely)."
319437
]
320438
},
321439
{
@@ -333,26 +451,26 @@
333451
"id": "841c4821",
334452
"metadata": {},
335453
"source": [
336-
"To go back to the original structure, albeit without the `category` columns, apply `.unstack().reset_index()`.\n",
454+
"```{admonition} Exercise\n",
455+
"Why is the first entry NaN?\n",
456+
"```\n",
457+
"\n",
337458
"\n",
338459
"```{admonition} Exercise\n",
339-
"Perform a `pivot()` that applies to both the `variable` and `category` columns. (Hint: remember that you will need to pass multiple objects via a list.)\n",
460+
"Perform a `pivot()` that applies to both the `variable` and `category` columns in the example from above where category is defined such that `df[\"category\"] = np.random.choice([\"type1\", \"type2\", \"type3\", \"type4\"], 20). (Hint: remember that you will need to pass multiple objects via a list.)\n",
340461
"```"
341462
]
342463
}
343464
],
344465
"metadata": {
345-
"interpreter": {
346-
"hash": "9d7534ecd9fbc7d385378f8400cf4d6cb9c6175408a574f1c99c5269f08771cc"
347-
},
348466
"jupytext": {
349467
"cell_metadata_filter": "-all",
350468
"encoding": "# -*- coding: utf-8 -*-",
351469
"formats": "md:myst",
352470
"main_language": "python"
353471
},
354472
"kernelspec": {
355-
"display_name": "Python 3 (ipykernel)",
473+
"display_name": ".venv",
356474
"language": "python",
357475
"name": "python3"
358476
},
@@ -366,7 +484,7 @@
366484
"name": "python",
367485
"nbconvert_exporter": "python",
368486
"pygments_lexer": "ipython3",
369-
"version": "3.10.12"
487+
"version": "3.10.0"
370488
},
371489
"toc-showtags": true
372490
},

data-visualise.ipynb

+3-6
Original file line numberDiff line numberDiff line change
@@ -445,7 +445,7 @@
445445
" How many columns?\n",
446446
"\n",
447447
"2. What does the `bill_depth_mm` variable in the `penguins` data frame describe?\n",
448-
" Read the help for `?penguins` to find out.\n",
448+
" Read the help for `load_penguins()` to find out, eg run `help(load_penguins)`.\n",
449449
"\n",
450450
"3. Make a scatterplot of `bill_depth_mm` vs. `bill_length_mm`.\n",
451451
" That is, make a scatterplot with `bill_depth_mm` on the y-axis and `bill_length_mm` on the x-axis.\n",
@@ -1101,17 +1101,14 @@
11011101
}
11021102
],
11031103
"metadata": {
1104-
"interpreter": {
1105-
"hash": "9d7534ecd9fbc7d385378f8400cf4d6cb9c6175408a574f1c99c5269f08771cc"
1106-
},
11071104
"jupytext": {
11081105
"cell_metadata_filter": "-all",
11091106
"encoding": "# -*- coding: utf-8 -*-",
11101107
"formats": "md:myst",
11111108
"main_language": "python"
11121109
},
11131110
"kernelspec": {
1114-
"display_name": "Python 3 (ipykernel)",
1111+
"display_name": ".venv",
11151112
"language": "python",
11161113
"name": "python3"
11171114
},
@@ -1125,7 +1122,7 @@
11251122
"name": "python",
11261123
"nbconvert_exporter": "python",
11271124
"pygments_lexer": "ipython3",
1128-
"version": "3.10.12"
1125+
"version": "3.10.0"
11291126
},
11301127
"toc-showtags": true
11311128
},

data/who_tb_case_and_pop.parquet

3.81 KB
Binary file not shown.

data/who_tb_cases.parquet

1.29 KB
Binary file not shown.

welcome.md

+1
Original file line numberDiff line numberDiff line change
@@ -24,3 +24,4 @@ We thank the following contributors:
2424
- [William Chiu](https://github.com/crossxwill)
2525
- [udurraniAtPresage](https://github.com/udurraniAtPresage)
2626
- [Josh Holman](https://github.com/TheJolman)
27+
- [Kenytt Avery](https://github.com/ProfAvery)

0 commit comments

Comments
 (0)