Skip to content

Commit 6977d10

Browse files
committed
add more expected outputs
1 parent 5ed96f8 commit 6977d10

File tree

3 files changed

+329
-57
lines changed

3 files changed

+329
-57
lines changed

Diff for: 2024/12-pydata-global/index.md

+185-30
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ Create an empty notebook and connect it to a runtime.
6161
%pip install --upgrade bigframes --quiet
6262
```
6363

64-
Click the **🞂** button or press *Shift + Enter* to run the code cell.
64+
Click the **Run cell** button or press *Shift + Enter* to run the code cell.
6565

6666

6767
## Read a public dataset
@@ -96,7 +96,7 @@ Data are collected by the Alcoholic Beverages Division within the Iowa Departmen
9696

9797

9898
In BigQuery, query the
99-
[bigquery-public-data.iowa_liquor_sales.sales](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=iowa_liquor_sales&t=sales&page=table).
99+
[bigquery-public-data.iowa_liquor_sales.sales](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=iowa_liquor_sales&t=sales&page=table)
100100
to analyze the Iowa liquor retail sales. Use the `bigframes.pandas.read_gbq()`
101101
method to create a DataFrame from a query string or table ID.
102102

@@ -323,7 +323,7 @@ Check how good the fit is by using the `score` method.
323323
model.score(feature_columns, label_columns).to_pandas()
324324
```
325325

326-
**Expected output:**
326+
**Sample output:**
327327

328328
```
329329
mean_absolute_error mean_squared_error mean_squared_log_error median_absolute_error r2_score explained_variance
@@ -390,7 +390,12 @@ it reflects the average zip code, which won't necessarily be the same as (1)
390390
because different zip codes have different populations.
391391

392392
```
393-
df = bpd.read_gbq_table("bigquery-public-data.iowa_liquor_sales.sales")
393+
df = (
394+
bpd.read_gbq_table("bigquery-public-data.iowa_liquor_sales.sales")
395+
.assign(
396+
zip_code=lambda _: _["zip_code"].str.replace(".0", "")
397+
)
398+
)
394399
census_state = bpd.read_gbq(
395400
"bigquery-public-data.census_bureau_acs.state_2020_5yr",
396401
index_col="geo_id",
@@ -410,7 +415,7 @@ average_per_zip = volume_per_pop["liters_per_pop"].mean()
410415
average_per_zip
411416
```
412417

413-
**Expected output:** `37.468`
418+
**Expected output:** `67.139`
414419

415420
Plot these averages, similar to above.
416421

@@ -517,11 +522,6 @@ Instead, use a more traditional natural language processing package, NLTK, to
517522
process these data. Technology called a "stemmer" can merge plural and singular
518523
nouns into the same value, for example.
519524

520-
### Create a new notebook
521-
522-
Click the arrow in BigQuery Studio's tabbed editor and select
523-
**Create new Python notebook**.
524-
525525

526526
### Using NLTK to stem words
527527

@@ -715,19 +715,11 @@ Now, deploy your function to the dataset you just created. Add a
715715
steps.
716716

717717
```
718-
import bigframes.pandas as bpd
719-
720-
721-
bpd.options.bigquery.ordering_mode = "partial"
722-
bpd.options.display.repr_mode = "deferred"
723-
724-
725718
@bpd.remote_function(
726719
dataset=f"{project_id}.functions",
727720
name="lemmatize",
728721
# TODO: Replace this with your version of nltk.
729722
packages=["nltk==3.9.1"],
730-
# Replace this with your service account email.
731723
cloud_function_service_account=f"bigframes-no-permissions@{project_id}.iam.gserviceaccount.com",
732724
cloud_function_ingress_settings="internal-only",
733725
)
@@ -756,21 +748,27 @@ Deployment should take about two minutes.
756748

757749
### Using the remote functions
758750

759-
Once the deployment completes, you can switch back to your original notebook
760-
to test this function.
751+
Once the deployment completes, you can test this function.
761752

762753
```
763-
import bigframes.pandas as bpd
764-
765-
bpd.options.bigquery.ordering_mode = "partial"
766-
bpd.options.display.repr_mode = "deferred"
767-
768754
lemmatize = bpd.read_gbq_function(f"{project_id}.functions.lemmatize")
769755
770756
words = bpd.Series(["whiskies", "whisky", "whiskey", "vodkas", "vodka"])
771757
words.apply(lemmatize).to_pandas()
772758
```
773759

760+
**Expected output:**
761+
762+
```
763+
0 whiskey
764+
1 whiskey
765+
2 whiskey
766+
3 vodka
767+
4 vodka
768+
769+
dtype: string
770+
```
771+
774772
## Comparing alcohol consumption by county
775773

776774
Now that the `lemmatize` function is available, use it to combine categories.
@@ -793,6 +791,24 @@ categories = (
793791
categories.to_pandas()
794792
```
795793

794+
**Expected output:**
795+
796+
```
797+
category_name total_orders
798+
0 100 PROOF VODKA 99124
799+
1 100% AGAVE TEQUILA 724374
800+
2 AGED DARK RUM 59433
801+
3 AMARETTO - IMPORTED 102
802+
4 AMERICAN ALCOHOL 24351
803+
... ... ...
804+
98 WATERMELON SCHNAPPS 17844
805+
99 WHISKEY LIQUEUR 1442732
806+
100 WHITE CREME DE CACAO 7213
807+
101 WHITE CREME DE MENTHE 2459
808+
102 WHITE RUM 436553
809+
103 rows × 2 columns
810+
```
811+
796812
Next, create a DataFrame of all words in the categories, except for a few
797813
filler words like punctuation and "item".
798814

@@ -815,6 +831,19 @@ words = words[
815831
words.to_pandas()
816832
```
817833

834+
**Expected output:**
835+
836+
```
837+
category_name total_orders word num_words
838+
0 100 PROOF VODKA 99124 100 3
839+
1 100 PROOF VODKA 99124 proof 3
840+
2 100 PROOF VODKA 99124 vodka 3
841+
... ... ... ... ...
842+
252 WHITE RUM 436553 white 2
843+
253 WHITE RUM 436553 rum 2
844+
254 rows × 4 columns
845+
```
846+
818847
Note that by lemmatizing after grouping, you are reducing the load on your Cloud
819848
Function. It is possible to apply the lemmatize function on each of the several
820849
million rows in the database, but it would cost more than applying it after
@@ -825,7 +854,20 @@ lemmas = words.assign(lemma=lambda _: _["word"].apply(lemmatize))
825854
lemmas.to_pandas()
826855
```
827856

828-
Now that the words have been lemmatize, you need to select the lemma that best
857+
**Expected output:**
858+
859+
```
860+
category_name total_orders word num_words lemma
861+
0 100 PROOF VODKA 99124 100 3 100
862+
1 100 PROOF VODKA 99124 proof 3 proof
863+
2 100 PROOF VODKA 99124 vodka 3 vodka
864+
... ... ... ... ... ...
865+
252 WHITE RUM 436553 white 2 white
866+
253 WHITE RUM 436553 rum 2 rum
867+
254 rows × 5 columns
868+
```
869+
870+
Now that the words have been lemmatized, you need to select the lemma that best
829871
summarizes the category. Since there aren't many function words in the categories,
830872
use the heuristic that if a word appears in multiple other categories, it's
831873
likely better as a summarizing word (e.g. whiskey).
@@ -858,6 +900,20 @@ categories_mapping = categories_with_max[
858900
categories_mapping.to_pandas()
859901
```
860902

903+
**Expected output:**
904+
905+
```
906+
category_name total_orders word num_words lemma total_orders_with_lemma max_lemma_count
907+
0 100 PROOF VODKA 99124 vodka 3 vodka 7575769 7575769
908+
1 100% AGAVE TEQUILA 724374 tequila 3 tequila 1601092 1601092
909+
2 AGED DARK RUM 59433 rum 3 rum 3226633 3226633
910+
... ... ... ... ... ... ... ...
911+
100 WHITE CREME DE CACAO 7213 white 4 white 446225 446225
912+
101 WHITE CREME DE MENTHE 2459 white 4 white 446225 446225
913+
102 WHITE RUM 436553 rum 2 rum 3226633 3226633
914+
103 rows × 7 columns
915+
```
916+
861917
Now that there is a single lemma summarizing each category, merge this to the
862918
original DataFrame.
863919

@@ -867,6 +923,19 @@ df_with_lemma = df.merge(
867923
on="category_name",
868924
how="left"
869925
)
926+
df_with_lemma[df_with_lemma['category_name'].notnull()].peek()
927+
```
928+
929+
**Expected output:**
930+
931+
```
932+
invoice_and_item_number ... lemma total_orders_with_lemma max_lemma_count
933+
0 S30989000030 ... vodka 7575769 7575769
934+
1 S30538800106 ... vodka 7575769 7575769
935+
2 S30601200013 ... vodka 7575769 7575769
936+
3 S30527200047 ... vodka 7575769 7575769
937+
4 S30833600058 ... vodka 7575769 7575769
938+
5 rows × 30 columns
870939
```
871940

872941
### Comparing counties
@@ -900,12 +969,39 @@ county_max_lemma = county_lemma[
900969
county_max_lemma.to_pandas()
901970
```
902971

972+
**Expected output:**
973+
974+
```
975+
volume_sold_liters volume_sold_int64
976+
county lemma
977+
SCOTT vodka 6044393.1 6044393
978+
APPANOOSE whiskey 292490.44 292490
979+
HAMILTON whiskey 329118.92 329118
980+
... ... ... ...
981+
WORTH whiskey 100542.85 100542
982+
MITCHELL vodka 158791.94 158791
983+
RINGGOLD whiskey 65107.8 65107
984+
101 rows × 2 columns
985+
```
986+
903987
How different are the counties from each other?
904988

905989
```
906990
county_max_lemma.groupby("lemma").size().to_pandas()
907991
```
908992

993+
**Expected output:**
994+
995+
```
996+
lemma
997+
american 1
998+
liqueur 1
999+
vodka 15
1000+
whiskey 83
1001+
1002+
dtype: Int64
1003+
```
1004+
9091005
In most counties, whiskey is the most popular product by volume, with vodka most
9101006
popular in 15 counties. Compare this to the most popular liquor types statewide.
9111007

@@ -919,6 +1015,21 @@ total_liters = (
9191015
total_liters.to_pandas()
9201016
```
9211017

1018+
**Expected output:**
1019+
1020+
```
1021+
volume_sold_liters
1022+
lemma
1023+
vodka 85356422.950001
1024+
whiskey 85112339.980001
1025+
rum 33891011.72
1026+
american 19994259.64
1027+
imported 14985636.61
1028+
tequila 12357782.37
1029+
cocktails/rtd 7406769.87
1030+
...
1031+
```
1032+
9221033
Whiskey and vodka have nearly the same volume, with vodka a bit higher than
9231034
whiskey statewide.
9241035

@@ -957,9 +1068,39 @@ difference from the statewide proportion in each county.
9571068
# that drink _less_ of a particular liquor than expected.
9581069
largest_per_county = cohens_h.groupby("county").agg({"cohens_h_int": "max"})
9591070
counties = cohens_h[cohens_h['cohens_h_int'] == largest_per_county["cohens_h_int"]]
960-
counties.to_pandas()
1071+
counties.sort_values('cohens_h', ascending=False).to_pandas()
9611072
```
9621073

1074+
**Expected output:**
1075+
1076+
```
1077+
cohens_h cohens_h_int
1078+
county lemma
1079+
EL PASO liqueur 1.289667 1289667
1080+
ADAMS whiskey 0.373591 373590
1081+
IDA whiskey 0.306481 306481
1082+
OSCEOLA whiskey 0.295524 295523
1083+
PALO ALTO whiskey 0.293697 293696
1084+
... ... ... ...
1085+
MUSCATINE rum 0.053757 53757
1086+
MARION rum 0.053427 53427
1087+
MITCHELL vodka 0.048212 48212
1088+
WEBSTER rum 0.044896 44895
1089+
CERRO GORDO cocktails/rtd 0.027496 27495
1090+
100 rows × 2 columns
1091+
```
1092+
1093+
The larger the Cohen's h value, the more likely it is that there is a
1094+
statistically significant difference in the amount of that type of alcohol
1095+
consumed compared to the state averages. For the smaller positive values, the
1096+
difference in consumption is different than the statewide average, but it may
1097+
be due to random differences.
1098+
1099+
An aside: EL PASO county doesn't appear to be a
1100+
[county in Iowa](https://en.wikipedia.org/wiki/List_of_counties_in_Iowa)
1101+
this may indicate another need for data cleanup before fully depending on these
1102+
results.
1103+
9631104
### Visualizing counties
9641105

9651106
Join with
@@ -983,6 +1124,20 @@ counties_plus = (
9831124
counties_plus
9841125
```
9851126

1127+
**Expected output:**
1128+
1129+
```
1130+
county lemma cohens_h cohens_h_int geo_id state_fips_code ...
1131+
0 ALLAMAKEE american 0.087931 87930 19005 19 ...
1132+
1 BLACK HAWK american 0.106256 106256 19013 19 ...
1133+
2 WINNESHIEK american 0.093101 93101 19191 19 ...
1134+
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1135+
96 CLINTON tequila 0.075708 75707 19045 19 ...
1136+
97 POLK tequila 0.087438 87438 19153 19 ...
1137+
98 LEE schnapps 0.064663 64663 19111 19 ...
1138+
99 rows × 23 columns
1139+
```
1140+
9861141
Use GeoPandas to visualize these differences on a map.
9871142

9881143
```
@@ -1013,10 +1168,10 @@ Alternatively, delete the Cloud Functions, service accounts, and datasets create
10131168

10141169
## Congratulations!
10151170

1016-
You have analyzed structured and unstructured data using BigQuery DataFrames.
1171+
You have cleaned and analyzed structured data using BigQuery DataFrames.
10171172
Along the way you've explored Google Cloud's Public Datasets, Python notebooks
1018-
in BigQuery Studio, BigQuery ML, Vertex AI, and natural language to Python
1019-
features of BigQuery Studio. Fantastic job!
1173+
in BigQuery Studio, BigQuery ML, BigQuery Remote Functions, and the power of
1174+
BigQuery DataFrames. Fantastic job!
10201175

10211176

10221177
### Next steps

Diff for: bigquery-dataframes-iowa-liquor-sales/codelab.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
"format": "html",
44
"prefix": "https://storage.googleapis.com",
55
"mainga": "UA-49880327-14",
6-
"updated": "2024-12-04T11:11:22-06:00",
6+
"updated": "2024-12-04T13:47:41-06:00",
77
"id": "bigquery-dataframes-iowa-liquor-sales",
88
"duration": 0,
99
"title": "Exploratory data analysis of Iowa liquor sales using the BigQuery DataFrames package",

0 commit comments

Comments
 (0)