ProjectPythia · ktyle · Nov 20, 2023 · Nov 20, 2023 · clyne · Nov 21, 2023
@@ -56,6 +56,7 @@ parts:
       - file: core/pandas
         sections:
           - file: core/pandas/pandas
+          - file: core/pandas/pandas_json
       - file: core/data-formats
         sections:
           - file: core/data-formats/netcdf-cf

@@ -0,0 +1,392 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6e124235-3846-4fcb-b533-10fa5856b4b4",
+   "metadata": {},
+   "source": [
+    "\n",
+    "# Pandas: Working with a JSON file"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6e75bb80-da84-47a9-ae2d-210eb06d492e",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9d226e98-85a1-4197-a757-448bd2bf4563",
+   "metadata": {},
+   "source": [
+    "## Overview\n",
+    "In this notebook, we will create a [Pandas Dataframe](https://pandas.pydata.org/docs/user_guide/dsintro.html#dataframe) from a remotely-served [JSON](https://www.json.org/) file. This particular file contains forecasted [solar wind](https://www.swpc.noaa.gov/phenomena/solar-wind) parameters from NOAA's [Space Weather Prediction Center](https://www.swpc.noaa.gov).\n",
+    "\n",
+    "1. Read in a JSON file\n",
+    "1. Reformat the `Dataframe`\n",
+    "1. Visualize the dataset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "daeabc7d-d4f6-4f9f-aad7-679123b9d2fb",
+   "metadata": {},
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "| Concepts | Importance | Notes |\n",
+    "| --- | --- | --- |\n",
+    "| [Pandas](https://foundations.projectpythia.org/core/pandas/pandas.html) | Necessary | |\n",
+    "\n",
+    "- **Time to learn**: 10 minutes\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7b875611-44ef-4b1f-9453-f2fa84bb4d82",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c732c9d1-0e00-4d9b-8e29-73ce71c99499",
+   "metadata": {},
+   "source": [
+    "## Imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4d48491f-4332-4eff-af72-0382c6c5794a",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import pandas as pd"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "690e69eb-5ec0-458d-962f-a25cdf8983dc",
+   "metadata": {},
+   "source": [
+    "## Read in a JSON file"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "630ace3c-7e81-45a8-a884-a045d7afbae6",
+   "metadata": {},
+   "source": [
+    "### NOAA's SWPC has a variety of forecast output in JSON format. Here, we create a `Dataframe` Pandas' `read_json` method from the current 1-day plasma forecast."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3be45f7e-8f1b-4645-8eff-6d7d0c6976a4",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "df = pd.read_json(\n",
+    "    \"https://services.swpc.noaa.gov/products/solar-wind/plasma-1-day.json\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2128b21f-7d34-45f6-b08f-23287a7761ea",
+   "metadata": {},
+   "source": [
+    "Examine the `Dataframe`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "deea1ad0-78ef-4b47-8ed4-951b2f8e2a5b",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1b34e2ce-19a9-46b6-a346-da9cc814a8a6",
+   "metadata": {},
+   "source": [
+    "## Reformat the `Dataframe`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "61ff9856-2a64-4606-8c13-4cc1bbe6f384",
+   "metadata": {},
+   "source": [
+    "Notice that the column headers look to be in the `Dataframe`'s first row. Let's modify it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5d198bce-c48e-4c77-9785-ab26bfaac669",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# Set the columns to be the values of the first row. Then drop that first row.\n",
+    "df = df.rename(columns=df.iloc[0]).drop(df.index[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e3c8f341-a5f6-46f0-ac1d-2d1feeb4631d",
+   "metadata": {},
+   "source": [
+    "Examine the reformatted `Dataframe`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7a457cb3-6203-45d5-9467-c8c5ffc07e52",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "43bfed63-0fa2-4d3a-8a0b-162386dace6f",
+   "metadata": {},
+   "source": [
+    "### Set the `Dataframe`'s index to the timestamped column."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5e1f2933-6193-4f2d-b012-7c5f18a3999b",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "df.index"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4b3ade9e-eebd-49d0-a321-624939d42d7d",
+   "metadata": {},
+   "source": [
+    "Currently, the `Dataframe` has a *default index* (i.e., a range of integers). For time series data (i.e., time is the independent variable), it is [good practice](https://pandas.pydata.org/docs/user_guide/timeseries.html) to use a time-based column as the index."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "83cdc00a-2a38-4ef7-ac33-c2f6875c2dc3",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "df = df.set_index('time_tag')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9abc6605-9676-48ef-a9c5-57ef13b21eed",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7d4e6843-638b-4801-af35-9ddbc191378a",
+   "metadata": {},
+   "source": [
+    "### Check and edit the `dtypes` of the independent and dependent variables"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7fc15d05-b3cb-4575-a4c5-c25123613bf7",
+   "metadata": {},
+   "source": [
+    "In this case, the `Dataframe`'s index corresponds to the independent variable, and the columns correspond to the dependent variables."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b8341c2d-7d4b-4960-93e6-b90670aab9a3",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "df.index"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0e165dc3-1a51-406d-9b31-b92a16721a1d",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "df.dtypes"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "90c69195-8111-4c0f-bb6f-17aa036b7147",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "They are all `object`s ... and as a result won't be amenable to typical time-series visualization methods. Change them to more appropriate `dtype`s ... `float32` for the dependent variables, and `datetime64` for the time-based one."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "32beea01-3390-4eb6-9d62-090ad70ba2a9",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "for col in df.columns:\n",
+    "    df[col] = df[col].astype(\"float32\")\n",
+    "df.index = pd.to_datetime(df.index)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "eba917fd-6167-4726-ae0f-79c064d4ef72",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "df.index = pd.to_datetime(df.index)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "38611e12-3f66-4e39-ac83-2021df2bf63e",
+   "metadata": {},
+   "source": [
+    "## Visualize the dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "634749f1-f613-4e90-aff8-6efdd038d251",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "df.temperature.plot(figsize=(10, 8));"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5d085493-0e98-4e6f-a190-a948bd44f53a",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5a43f367-1bdf-46b6-b253-f40ff965ff02",
+   "metadata": {},
+   "source": [
+    "## Summary\n",
+    "Pandas has several reader functions for differently-formatted tabular datasets. In this notebook, we created a `Dataframe` via Pandas `read_json` function, and then manipulated the `Dataframe` to allow for a useful time-series visualization."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "419af5fe-4cfc-466a-a4b0-1df66ddae8f8",
+   "metadata": {},
+   "source": [
+    "<div class=\"admonition alert alert-warning\">\n",
+    "    <p class=\"admonition-title\" style=\"font-weight:bold\">Note:</p>\n",
+    "    There is no strict format specification for JSON files. The strategy we followed to create and reformat the <code>Dataframe</code> in this notebook will likely need to change for other JSON-formatted datasets you may encounter!\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "62c4be7e-84a8-497d-a799-a74e5201c567",
+   "metadata": {},
+   "source": [
+    "### What's next?\n",
+    "Future [Project Pythia Foundations](https://foundations.projectpythia.org) Pandas notebooks will explore additional file format-specific reader methods."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dd8c3c60-d703-4b22-b796-8c385130cfa2",
+   "metadata": {},
+   "source": [
+    "## Resources and references\n",
+    "1. [pandas](https://pandas.pydata.org)\n",
+    "1. [JSON](https://json.io)\n",
+    "1. [NOAA SWPC](https://www.swpc.noaa.gov)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}