|
5 | 5 | "colab": { |
6 | 6 | "name": "Feature Engineering.ipynb", |
7 | 7 | "provenance": [], |
8 | | - "collapsed_sections": [], |
9 | 8 | "include_colab_link": true |
10 | 9 | }, |
11 | 10 | "kernelspec": { |
|
24 | 23 | "colab_type": "text" |
25 | 24 | }, |
26 | 25 | "source": [ |
27 | | - "<a href=\"https://colab.research.google.com/github/datacommonsorg/api-python/blob/master/notebooks/intro_data_science/Feature_Engineering.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" |
| 26 | + "<a href=\"https://colab.research.google.com/github/datacommonsorg/api-python/blob/tutorials/notebooks/intro_data_science/Feature_Engineering.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" |
28 | 27 | ] |
29 | 28 | }, |
30 | 29 | { |
|
45 | 44 | "source": [ |
46 | 45 | "# Exploring Feature Engineering\n", |
47 | 46 | "\n", |
48 | | - "Welcome! In this lesson, we'll be exploring various techniques for feature engineering. We'll be walking through the steps one takes to set up your data for your machine learning models, starting with acquiring and exploring the data, working through different transformations and feature representation choices, and analyzing how those design decisions affect our model's results. \n", |
| 47 | + "Welcome! In this lesson, we'll be exploring various techniques for feature engineering. We'll be walking through the steps one takes to set up your data for your machine learning models, starting with acquiring and exploring the data, working through different transformations and feature representation choices, and analyzing how those design decisions affect our model's results.\n", |
49 | 48 | "\n", |
50 | 49 | "## Learning Objectives:\n", |
51 | 50 | "In this lesson, we'll be covering\n", |
|
62 | 61 | "\n", |
63 | 62 | "And for help with Pandas and manipulating data frames, take a look at the [Pandas Documentation](https://pandas.pydata.org/docs/reference/index.html).\n", |
64 | 63 | "\n", |
65 | | - "We'll be using the scikit-learn library for implementing our models today. Documentation can be found [here](https://scikit-learn.org/stable/modules/classes.html). \n", |
| 64 | + "We'll be using the scikit-learn library for implementing our models today. Documentation can be found [here](https://scikit-learn.org/stable/modules/classes.html).\n", |
66 | 65 | "\n", |
67 | 66 | "As usual, if you have any other questions, please reach out to your course staff!\n" |
68 | 67 | ] |
|
87 | 86 | { |
88 | 87 | "cell_type": "code", |
89 | 88 | "metadata": { |
90 | | - "id": "gUETYfc0EuGg", |
91 | | - "colab": { |
92 | | - "base_uri": "https://localhost:8080/" |
93 | | - }, |
94 | | - "outputId": "cd5c105a-79ab-4959-b511-d11b4ff99a3e" |
| 89 | + "id": "gUETYfc0EuGg" |
95 | 90 | }, |
96 | 91 | "source": [ |
97 | 92 | "# We need to install the Data Commons API, since they don't ship natively with\n", |
|
106 | 101 | "# Import the two methods from heatmap library to make pretty correlation plots\n", |
107 | 102 | "!pip install heatmapz --upgrade --quiet" |
108 | 103 | ], |
109 | | - "execution_count": null, |
110 | | - "outputs": [ |
111 | | - { |
112 | | - "output_type": "stream", |
113 | | - "text": [ |
114 | | - "Collecting heatmapz\n", |
115 | | - " Downloading heatmapz-0.0.4-py3-none-any.whl (5.8 kB)\n", |
116 | | - "Requirement already satisfied: seaborn in /usr/local/lib/python3.7/dist-packages (from heatmapz) (0.11.1)\n", |
117 | | - "Requirement already satisfied: matplotlib>=3.0.3 in /usr/local/lib/python3.7/dist-packages (from heatmapz) (3.2.2)\n", |
118 | | - "Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from heatmapz) (1.1.5)\n", |
119 | | - "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.0.3->heatmapz) (0.10.0)\n", |
120 | | - "Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.0.3->heatmapz) (2.4.7)\n", |
121 | | - "Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.0.3->heatmapz) (2.8.2)\n", |
122 | | - "Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.0.3->heatmapz) (1.3.1)\n", |
123 | | - "Requirement already satisfied: numpy>=1.11 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.0.3->heatmapz) (1.19.5)\n", |
124 | | - "Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from cycler>=0.10->matplotlib>=3.0.3->heatmapz) (1.15.0)\n", |
125 | | - "Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas->heatmapz) (2018.9)\n", |
126 | | - "Requirement already satisfied: scipy>=1.0 in /usr/local/lib/python3.7/dist-packages (from seaborn->heatmapz) (1.4.1)\n", |
127 | | - "Installing collected packages: heatmapz\n", |
128 | | - "Successfully installed heatmapz-0.0.4\n" |
129 | | - ], |
130 | | - "name": "stdout" |
131 | | - } |
132 | | - ] |
| 104 | + "execution_count": 11, |
| 105 | + "outputs": [] |
133 | 106 | }, |
134 | 107 | { |
135 | 108 | "cell_type": "code", |
|
155 | 128 | "import matplotlib.pyplot as plt\n", |
156 | 129 | "from heatmap import heatmap, corrplot" |
157 | 130 | ], |
158 | | - "execution_count": null, |
| 131 | + "execution_count": 12, |
159 | 132 | "outputs": [] |
160 | 133 | }, |
161 | 134 | { |
|
189 | 162 | "colab": { |
190 | 163 | "base_uri": "https://localhost:8080/" |
191 | 164 | }, |
192 | | - "outputId": "996adc68-c555-4d2c-8a35-95b76d5f6bcc" |
| 165 | + "outputId": "09c53b77-c1e7-4732-b400-6e5be7cb1b79" |
193 | 166 | }, |
194 | 167 | "source": [ |
195 | 168 | "# Choose your state:\n", |
|
199 | 172 | "county_dcids = dc.get_places_in([your_state_dcid], \"County\")[your_state_dcid]\n", |
200 | 173 | "print(county_dcids)" |
201 | 174 | ], |
202 | | - "execution_count": null, |
| 175 | + "execution_count": 13, |
203 | 176 | "outputs": [ |
204 | 177 | { |
205 | 178 | "output_type": "stream", |
| 179 | + "name": "stdout", |
206 | 180 | "text": [ |
207 | 181 | "['geoId/06001', 'geoId/06003', 'geoId/06005', 'geoId/06007', 'geoId/06009', 'geoId/06011', 'geoId/06013', 'geoId/06015', 'geoId/06017', 'geoId/06019', 'geoId/06021', 'geoId/06023', 'geoId/06025', 'geoId/06027', 'geoId/06029', 'geoId/06031', 'geoId/06033', 'geoId/06035', 'geoId/06037', 'geoId/06039', 'geoId/06041', 'geoId/06043', 'geoId/06045', 'geoId/06047', 'geoId/06049', 'geoId/06051', 'geoId/06053', 'geoId/06055', 'geoId/06057', 'geoId/06059', 'geoId/06061', 'geoId/06063', 'geoId/06065', 'geoId/06067', 'geoId/06069', 'geoId/06071', 'geoId/06073', 'geoId/06075', 'geoId/06077', 'geoId/06079', 'geoId/06081', 'geoId/06083', 'geoId/06085', 'geoId/06087', 'geoId/06089', 'geoId/06091', 'geoId/06093', 'geoId/06095', 'geoId/06097', 'geoId/06099', 'geoId/06101', 'geoId/06103', 'geoId/06105', 'geoId/06107', 'geoId/06109', 'geoId/06111', 'geoId/06113', 'geoId/06115']\n" |
208 | | - ], |
209 | | - "name": "stdout" |
| 182 | + ] |
210 | 183 | } |
211 | 184 | ] |
212 | 185 | }, |
|
261 | 234 | "# with one column per feature.\n", |
262 | 235 | "\n", |
263 | 236 | "stat_vars_to_query = [\n", |
264 | | - " \"CumulativeCount_MedicalTest_ConditionCOVID_19_Positive\",\n", |
| 237 | + " \"CumulativeCount_MedicalConditionIncident_COVID_19_ConfirmedOrProbableCase\",\n", |
265 | 238 | " \"Count_Person\",\n", |
266 | 239 | " \"Count_Person_MarriedAndNotSeparated\",\n", |
267 | 240 | " \"Median_Income_Person\",\n", |
268 | 241 | " \"Count_Household_With4OrMorePerson\"\n", |
269 | | - " \n", |
| 242 | + "\n", |
270 | 243 | "]\n", |
271 | 244 | "\n", |
272 | 245 | "raw_df = dcp.build_multivariate_dataframe(county_dcids, stat_vars_to_query)\n", |
|
934 | 907 | " \"Count_Person_MarriedAndNotSeparated\",\n", |
935 | 908 | " \"Median_Income_Person\",\n", |
936 | 909 | " \"Count_Household_With4OrMorePerson\"\n", |
937 | | - " \n", |
| 910 | + "\n", |
938 | 911 | "]\n", |
939 | 912 | "\n", |
940 | 913 | "# Get data from Data Commons\n", |
|
2499 | 2472 | "\n", |
2500 | 2473 | "\n", |
2501 | 2474 | "**1.3B)** How would you approach handling any NaN or empty values in a dataframe? Should we remove that row? Remove the feature? Or should we replace NaNs with a particular value (and if so, how do you decide what value that should be)?\n", |
2502 | | - " \n", |
| 2475 | + "\n", |
2503 | 2476 | "\n", |
2504 | 2477 | "**1.3C)** Take a look at the dataframe outputted by the code box above from section 1.2. Are there any values that need to be cleaned? If so, write code to implement the answers to the above questions using the code box below.\n", |
2505 | 2478 | "\n", |
|
2540 | 2513 | "\n", |
2541 | 2514 | "Sometimes transforming the data can reveal interesting combinations, or better scale our data. Here are some things to look out for:\n", |
2542 | 2515 | "\n", |
2543 | | - "* If your data has a skewed distribution or large changes in magnitude, it may be helpful to take the $log()$ of your data to bring it closer to normal. \n", |
| 2516 | + "* If your data has a skewed distribution or large changes in magnitude, it may be helpful to take the $log()$ of your data to bring it closer to normal.\n", |
2544 | 2517 | "* Othertimes it may be helpful to bin close values together (for example, create groupings by age 0-10, 11-20, 21-30, etc.)\n", |
2545 | 2518 | "* When working with population or demographic data, it's often also prudent to consider whether the features you are using should be scaled by population.\n", |
2546 | 2519 | "\n", |
|
3281 | 3254 | }, |
3282 | 3255 | "source": [ |
3283 | 3256 | "### 2.2) Feature Representations\n", |
3284 | | - "If any of your data is discrete, getting a good encoding of discrete features is particularly important. You want to create “opportunities” for your model to find the underlying regularities. \n", |
| 3257 | + "If any of your data is discrete, getting a good encoding of discrete features is particularly important. You want to create “opportunities” for your model to find the underlying regularities.\n", |
3285 | 3258 | "\n", |
3286 | 3259 | "**2.2A) For each of the following encodings, name an example of data the encoding would work well on, as well as an example of data it would not work as well for. Explain your answers.**\n", |
3287 | 3260 | "\n", |
3288 | | - "* *Numeric* Assign each of these values a number, say 1.0/k, 2.0/k, . . . , 1.0. \n", |
| 3261 | + "* *Numeric* Assign each of these values a number, say 1.0/k, 2.0/k, . . . , 1.0.\n", |
3289 | 3262 | "\n", |
3290 | 3263 | "* *Thermometer code* Use a vector of length k binary variables, where we convert discrete input value $0 < j < k$ into a vector in which the first j values are 1.0 and the rest are 0.0.\n", |
3291 | 3264 | "\n", |
|
3317 | 3290 | "source": [ |
3318 | 3291 | "### 2.3) Standardization\n", |
3319 | 3292 | "It is typically useful to scale numeric data, so that it tends to be in the range [−1, +1]. Without performing this transformation, if you have\n", |
3320 | | - "one feature with much larger values than another, it will take the learning algorithm a lot of work to find parameters that can put them on an equal basis. \n", |
| 3293 | + "one feature with much larger values than another, it will take the learning algorithm a lot of work to find parameters that can put them on an equal basis.\n", |
3321 | 3294 | "\n", |
3322 | 3295 | "Typically, we use the transformation\n", |
3323 | 3296 | "$$ \\phi(x) = \\frac{\\bar{x} − x}{\\sigma} $$\n", |
|
0 commit comments