aeturrell
diff --git a/‎data-tidy.ipynb
+140-22 b/‎data-tidy.ipynb
+140-22
diff --git a/‎data-visualise.ipynb
+3-6 b/‎data-visualise.ipynb
+3-6
diff --git a/‎data/who_tb_case_and_pop.parquet
3.81 KB b/‎data/who_tb_case_and_pop.parquet
3.81 KB
diff --git a/‎data/who_tb_cases.parquet
1.29 KB b/‎data/who_tb_cases.parquet
1.29 KB
diff --git a/‎welcome.md
+1 b/‎welcome.md
+1
@@ -74,15 +74,26 @@
     "\n",
     "Tidy data aren't going to be appropriate *every* time and in every case, but they're a really, really good default for tabular data. Once you use it as your default, it's easier to think about how to perform subsequent operations.\n",
     "\n",
-    "Having said that tidy data are great, they are, but one of **pandas**' advantages relative to other data analysis libraries is that it isn't *too* tied to tidy data and can navigate awkward non-tidy data manipulation tasks happily too."
+    "Having said that tidy data are great, they are, but one of **pandas**' advantages relative to other data analysis libraries is that it isn't *too* tied to tidy data and can navigate awkward non-tidy data manipulation tasks happily too.\n",
+    "\n",
+    "There are two common problems you find in data that are ingested that make them not tidy:\n",
+    "\n",
+    "1. A variable might be spread across multiple columns.\n",
+    "2. An observation might be scattered across multiple rows.\n",
+    "\n",
+    "For the former, we need to \"melt\" the wide data, with multiple columns, into long data.\n",
+    "\n",
+    "For the latter, we need to unstack or pivot the multiple rows into columns (ie go from long to wide.)\n",
+    "\n",
+    "We'll see both below."
    ]
   },
   {
    "cell_type": "markdown",
    "id": "deb8cf13",
    "metadata": {},
    "source": [
-    "## Make Data Tidy with **pandas**"
+    "## Tools to Make Data Tidy with **pandas**"
    ]
   },
   {
@@ -92,7 +103,7 @@
    "source": [
     "### Melt\n",
     "\n",
-    "`melt()` can help you go from untidy to tidy data (from wide data to long data), and is a *really* good one to remember.\n",
+    "`melt()` can help you go from \"wider\" data to \"longer\" data, and is a *really* good one to remember.\n",
     "\n",
     "![](https://pandas.pydata.org/docs/_images/reshaping_melt.png)\n",
     "\n",
@@ -132,7 +143,61 @@
     "Perform a `melt()` that uses `job` as the id instead of `first` and `last`.\n",
     "```\n",
     "\n",
-    "### Wide to Long\n",
+    "How does this relate to tidy data? Sometimes you'll have a variable spread over multiple columns that you want to turn tidy. Let's look at this example that uses cases of [tuburculosis from the World Health Organisation](https://www.who.int/teams/global-tuberculosis-programme/data).\n",
+    "\n",
+    "First let's open the data and look at the top of the file."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bfa121cf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "\n",
+    "df_tb = pd.read_parquet(Path(\"data/who_tb_cases.parquet\"))\n",
+    "df_tb.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "583e419d",
+   "metadata": {},
+   "source": [
+    "You can see that we have two columns for a single variable, year. Let's now melt this."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dc03ccd9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_tb.melt(\n",
+    "    id_vars=[\"country\"],\n",
+    "    var_name=\"year\",\n",
+    "    value_vars=[\"1999\", \"2000\"],\n",
+    "    value_name=\"cases\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "74d81b30",
+   "metadata": {},
+   "source": [
+    "We now have one observation per row, and one variable per column: tidy!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "488f5f10",
+   "metadata": {},
+   "source": [
+    "### A simpler wide to long\n",
     "\n",
     "If you don't want the headscratching of `melt()`, there's also `wide_to_long()`, which is really useful for typical data cleaning cases where you have data like this:"
    ]
@@ -279,19 +344,73 @@
    "id": "39a210bb",
    "metadata": {},
    "source": [
-    "### Pivoting data from tidy to, err, untidy data\n",
+    "### Pivoting data from long to wide\n",
     "\n",
-    "At the start of this chapter, we said you should use tidy data--one row per observation, one column per variable--whenever you can. But there are times when you will want to take your lovingly prepared tidy data and pivot it into a wider format. `pivot()` and `pivot_table()` help you to do that.\n",
+    "`pivot()` and `pivot_table()` help you to sort out data in which a single observation is scattered over multiple rows.\n",
     "\n",
-    "![](https://pandas.pydata.org/docs/_images/reshaping_pivot.png)\n",
-    "\n",
-    "This can be especially useful for time series data, where operations like `shift()` or `diff()` are typically applied assuming that an entry in one row follows (in time) from the one above. Here's an example:"
+    "![](https://pandas.pydata.org/docs/_images/reshaping_pivot.png)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8acbcedf",
+   "metadata": {},
+   "source": [
+    "Here's an example dataframe where observations are spread over multiple rows:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "405da609",
+   "id": "fa612456",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_tb_cp = pd.read_parquet(Path(\"data/who_tb_case_and_pop.parquet\"))\n",
+    "df_tb_cp.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0d4a4077",
+   "metadata": {},
+   "source": [
+    "You see that we have, for each year-country, \"case\" and \"population\" in different rows."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e7c9ed1b",
+   "metadata": {},
+   "source": [
+    "Now let's pivot this to see the difference:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e584cf37",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pivoted = df_tb_cp.pivot(\n",
+    "    index=[\"country\", \"year\"], columns=[\"type\"], values=\"count\"\n",
+    ").reset_index()\n",
+    "pivoted"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d82307c0",
+   "metadata": {},
+   "source": [
+    "Pivots are especially useful for time series data, where operations like `shift()` or `diff()` are typically applied assuming that an entry in one row follows (in time) from the one above. When we do `shift()` we often want to shift a single variable in time, but if a single observation (in this case a date) is over multiple rows, the timing is going go awry. Let's see an example."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "97c6d139",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -300,13 +419,12 @@
     "data = {\n",
     "    \"value\": np.random.randn(20),\n",
     "    \"variable\": [\"A\"] * 10 + [\"B\"] * 10,\n",
-    "    \"category\": np.random.choice([\"type1\", \"type2\", \"type3\", \"type4\"], 20),\n",
     "    \"date\": (\n",
-    "        list(pd.date_range(\"1/1/2000\", periods=10, freq=\"M\"))\n",
-    "        + list(pd.date_range(\"1/1/2000\", periods=10, freq=\"M\"))\n",
+    "        list(pd.date_range(\"1/1/2000\", periods=10, freq=\"ME\"))\n",
+    "        + list(pd.date_range(\"1/1/2000\", periods=10, freq=\"ME\"))\n",
     "    ),\n",
     "}\n",
-    "df = pd.DataFrame(data, columns=[\"date\", \"variable\", \"category\", \"value\"])\n",
+    "df = pd.DataFrame(data, columns=[\"date\", \"variable\", \"value\"])\n",
     "df.sample(5)"
    ]
   },
@@ -315,7 +433,7 @@
    "id": "90a5b930",
    "metadata": {},
    "source": [
-    "If we just run `shift()` on this, it's going to shift variable B's and A's together even though these overlap in time. So we pivot to a wider format (and then we can shift safely)."
+    "If we just run `shift()` on the above, it's going to shift variable B's and A's together even though these overlap in time and are different variables. So we pivot to a wider format (and then we can shift in time safely)."
    ]
   },
   {
@@ -333,26 +451,26 @@
    "id": "841c4821",
    "metadata": {},
    "source": [
-    "To go back to the original structure, albeit without the `category` columns, apply `.unstack().reset_index()`.\n",
+    "```{admonition} Exercise\n",
+    "Why is the first entry NaN?\n",
+    "```\n",
+    "\n",
     "\n",
     "```{admonition} Exercise\n",
-    "Perform a `pivot()` that applies to both the `variable` and `category` columns. (Hint: remember that you will need to pass multiple objects via a list.)\n",
+    "Perform a `pivot()` that applies to both the `variable` and `category` columns in the example from above where category is defined such that `df[\"category\"] = np.random.choice([\"type1\", \"type2\", \"type3\", \"type4\"], 20). (Hint: remember that you will need to pass multiple objects via a list.)\n",
     "```"
    ]
   }
  ],
  "metadata": {
-  "interpreter": {
-   "hash": "9d7534ecd9fbc7d385378f8400cf4d6cb9c6175408a574f1c99c5269f08771cc"
-  },
   "jupytext": {
    "cell_metadata_filter": "-all",
    "encoding": "# -*- coding: utf-8 -*-",
    "formats": "md:myst",
    "main_language": "python"
   },
   "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
+   "display_name": ".venv",
    "language": "python",
    "name": "python3"
   },
@@ -366,7 +484,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.12"
+   "version": "3.10.0"
   },
   "toc-showtags": true
  },
 
@@ -445,7 +445,7 @@
     "    How many columns?\n",
     "\n",
     "2.  What does the `bill_depth_mm` variable in the `penguins` data frame describe?\n",
-    "    Read the help for `?penguins` to find out.\n",
+    "    Read the help for `load_penguins()` to find out, eg run `help(load_penguins)`.\n",
     "\n",
     "3.  Make a scatterplot of `bill_depth_mm` vs. `bill_length_mm`.\n",
     "    That is, make a scatterplot with `bill_depth_mm` on the y-axis and `bill_length_mm` on the x-axis.\n",
@@ -1101,17 +1101,14 @@
   }
  ],
  "metadata": {
-  "interpreter": {
-   "hash": "9d7534ecd9fbc7d385378f8400cf4d6cb9c6175408a574f1c99c5269f08771cc"
-  },
   "jupytext": {
    "cell_metadata_filter": "-all",
    "encoding": "# -*- coding: utf-8 -*-",
    "formats": "md:myst",
    "main_language": "python"
   },
   "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
+   "display_name": ".venv",
    "language": "python",
    "name": "python3"
   },
@@ -1125,7 +1122,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.12"
+   "version": "3.10.0"
   },
   "toc-showtags": true
  },
 
@@ -24,3 +24,4 @@ We thank the following contributors:
 - [William Chiu](https://github.com/crossxwill)
 - [udurraniAtPresage](https://github.com/udurraniAtPresage)
 - [Josh Holman](https://github.com/TheJolman)
+- [Kenytt Avery](https://github.com/ProfAvery)