@@ -61,7 +61,7 @@ Create an empty notebook and connect it to a runtime.
61
61
%pip install --upgrade bigframes --quiet
62
62
```
63
63
64
- Click the ** 🞂 ** button or press * Shift + Enter* to run the code cell.
64
+ Click the ** Run cell ** button or press * Shift + Enter* to run the code cell.
65
65
66
66
67
67
## Read a public dataset
@@ -96,7 +96,7 @@ Data are collected by the Alcoholic Beverages Division within the Iowa Departmen
96
96
97
97
98
98
In BigQuery, query the
99
- [ bigquery-public-data.iowa_liquor_sales.sales] ( https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=iowa_liquor_sales&t=sales&page=table ) .
99
+ [ bigquery-public-data.iowa_liquor_sales.sales] ( https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=iowa_liquor_sales&t=sales&page=table )
100
100
to analyze the Iowa liquor retail sales. Use the ` bigframes.pandas.read_gbq() `
101
101
method to create a DataFrame from a query string or table ID.
102
102
@@ -323,7 +323,7 @@ Check how good the fit is by using the `score` method.
323
323
model.score(feature_columns, label_columns).to_pandas()
324
324
```
325
325
326
- ** Expected output:**
326
+ ** Sample output:**
327
327
328
328
```
329
329
mean_absolute_error mean_squared_error mean_squared_log_error median_absolute_error r2_score explained_variance
@@ -390,7 +390,12 @@ it reflects the average zip code, which won't necessarily be the same as (1)
390
390
because different zip codes have different populations.
391
391
392
392
```
393
- df = bpd.read_gbq_table("bigquery-public-data.iowa_liquor_sales.sales")
393
+ df = (
394
+ bpd.read_gbq_table("bigquery-public-data.iowa_liquor_sales.sales")
395
+ .assign(
396
+ zip_code=lambda _: _["zip_code"].str.replace(".0", "")
397
+ )
398
+ )
394
399
census_state = bpd.read_gbq(
395
400
"bigquery-public-data.census_bureau_acs.state_2020_5yr",
396
401
index_col="geo_id",
@@ -410,7 +415,7 @@ average_per_zip = volume_per_pop["liters_per_pop"].mean()
410
415
average_per_zip
411
416
```
412
417
413
- ** Expected output:** ` 37.468 `
418
+ ** Expected output:** ` 67.139 `
414
419
415
420
Plot these averages, similar to above.
416
421
@@ -517,11 +522,6 @@ Instead, use a more traditional natural language processing package, NLTK, to
517
522
process these data. Technology called a "stemmer" can merge plural and singular
518
523
nouns into the same value, for example.
519
524
520
- ### Create a new notebook
521
-
522
- Click the arrow in BigQuery Studio's tabbed editor and select
523
- ** Create new Python notebook** .
524
-
525
525
526
526
### Using NLTK to stem words
527
527
@@ -715,19 +715,11 @@ Now, deploy your function to the dataset you just created. Add a
715
715
steps.
716
716
717
717
```
718
- import bigframes.pandas as bpd
719
-
720
-
721
- bpd.options.bigquery.ordering_mode = "partial"
722
- bpd.options.display.repr_mode = "deferred"
723
-
724
-
725
718
@bpd.remote_function(
726
719
dataset=f"{project_id}.functions",
727
720
name="lemmatize",
728
721
# TODO: Replace this with your version of nltk.
729
722
packages=["nltk==3.9.1"],
730
- # Replace this with your service account email.
731
723
cloud_function_service_account=f"bigframes-no-permissions@{project_id}.iam.gserviceaccount.com",
732
724
cloud_function_ingress_settings="internal-only",
733
725
)
@@ -756,21 +748,27 @@ Deployment should take about two minutes.
756
748
757
749
### Using the remote functions
758
750
759
- Once the deployment completes, you can switch back to your original notebook
760
- to test this function.
751
+ Once the deployment completes, you can test this function.
761
752
762
753
```
763
- import bigframes.pandas as bpd
764
-
765
- bpd.options.bigquery.ordering_mode = "partial"
766
- bpd.options.display.repr_mode = "deferred"
767
-
768
754
lemmatize = bpd.read_gbq_function(f"{project_id}.functions.lemmatize")
769
755
770
756
words = bpd.Series(["whiskies", "whisky", "whiskey", "vodkas", "vodka"])
771
757
words.apply(lemmatize).to_pandas()
772
758
```
773
759
760
+ ** Expected output:**
761
+
762
+ ```
763
+ 0 whiskey
764
+ 1 whiskey
765
+ 2 whiskey
766
+ 3 vodka
767
+ 4 vodka
768
+
769
+ dtype: string
770
+ ```
771
+
774
772
## Comparing alcohol consumption by county
775
773
776
774
Now that the ` lemmatize ` function is available, use it to combine categories.
@@ -793,6 +791,24 @@ categories = (
793
791
categories.to_pandas()
794
792
```
795
793
794
+ ** Expected output:**
795
+
796
+ ```
797
+ category_name total_orders
798
+ 0 100 PROOF VODKA 99124
799
+ 1 100% AGAVE TEQUILA 724374
800
+ 2 AGED DARK RUM 59433
801
+ 3 AMARETTO - IMPORTED 102
802
+ 4 AMERICAN ALCOHOL 24351
803
+ ... ... ...
804
+ 98 WATERMELON SCHNAPPS 17844
805
+ 99 WHISKEY LIQUEUR 1442732
806
+ 100 WHITE CREME DE CACAO 7213
807
+ 101 WHITE CREME DE MENTHE 2459
808
+ 102 WHITE RUM 436553
809
+ 103 rows × 2 columns
810
+ ```
811
+
796
812
Next, create a DataFrame of all words in the categories, except for a few
797
813
filler words like punctuation and "item".
798
814
@@ -815,6 +831,19 @@ words = words[
815
831
words.to_pandas()
816
832
```
817
833
834
+ ** Expected output:**
835
+
836
+ ```
837
+ category_name total_orders word num_words
838
+ 0 100 PROOF VODKA 99124 100 3
839
+ 1 100 PROOF VODKA 99124 proof 3
840
+ 2 100 PROOF VODKA 99124 vodka 3
841
+ ... ... ... ... ...
842
+ 252 WHITE RUM 436553 white 2
843
+ 253 WHITE RUM 436553 rum 2
844
+ 254 rows × 4 columns
845
+ ```
846
+
818
847
Note that by lemmatizing after grouping, you are reducing the load on your Cloud
819
848
Function. It is possible to apply the lemmatize function on each of the several
820
849
million rows in the database, but it would cost more than applying it after
@@ -825,7 +854,20 @@ lemmas = words.assign(lemma=lambda _: _["word"].apply(lemmatize))
825
854
lemmas.to_pandas()
826
855
```
827
856
828
- Now that the words have been lemmatize, you need to select the lemma that best
857
+ ** Expected output:**
858
+
859
+ ```
860
+ category_name total_orders word num_words lemma
861
+ 0 100 PROOF VODKA 99124 100 3 100
862
+ 1 100 PROOF VODKA 99124 proof 3 proof
863
+ 2 100 PROOF VODKA 99124 vodka 3 vodka
864
+ ... ... ... ... ... ...
865
+ 252 WHITE RUM 436553 white 2 white
866
+ 253 WHITE RUM 436553 rum 2 rum
867
+ 254 rows × 5 columns
868
+ ```
869
+
870
+ Now that the words have been lemmatized, you need to select the lemma that best
829
871
summarizes the category. Since there aren't many function words in the categories,
830
872
use the heuristic that if a word appears in multiple other categories, it's
831
873
likely better as a summarizing word (e.g. whiskey).
@@ -858,6 +900,20 @@ categories_mapping = categories_with_max[
858
900
categories_mapping.to_pandas()
859
901
```
860
902
903
+ ** Expected output:**
904
+
905
+ ```
906
+ category_name total_orders word num_words lemma total_orders_with_lemma max_lemma_count
907
+ 0 100 PROOF VODKA 99124 vodka 3 vodka 7575769 7575769
908
+ 1 100% AGAVE TEQUILA 724374 tequila 3 tequila 1601092 1601092
909
+ 2 AGED DARK RUM 59433 rum 3 rum 3226633 3226633
910
+ ... ... ... ... ... ... ... ...
911
+ 100 WHITE CREME DE CACAO 7213 white 4 white 446225 446225
912
+ 101 WHITE CREME DE MENTHE 2459 white 4 white 446225 446225
913
+ 102 WHITE RUM 436553 rum 2 rum 3226633 3226633
914
+ 103 rows × 7 columns
915
+ ```
916
+
861
917
Now that there is a single lemma summarizing each category, merge this to the
862
918
original DataFrame.
863
919
@@ -867,6 +923,19 @@ df_with_lemma = df.merge(
867
923
on="category_name",
868
924
how="left"
869
925
)
926
+ df_with_lemma[df_with_lemma['category_name'].notnull()].peek()
927
+ ```
928
+
929
+ ** Expected output:**
930
+
931
+ ```
932
+ invoice_and_item_number ... lemma total_orders_with_lemma max_lemma_count
933
+ 0 S30989000030 ... vodka 7575769 7575769
934
+ 1 S30538800106 ... vodka 7575769 7575769
935
+ 2 S30601200013 ... vodka 7575769 7575769
936
+ 3 S30527200047 ... vodka 7575769 7575769
937
+ 4 S30833600058 ... vodka 7575769 7575769
938
+ 5 rows × 30 columns
870
939
```
871
940
872
941
### Comparing counties
@@ -900,12 +969,39 @@ county_max_lemma = county_lemma[
900
969
county_max_lemma.to_pandas()
901
970
```
902
971
972
+ ** Expected output:**
973
+
974
+ ```
975
+ volume_sold_liters volume_sold_int64
976
+ county lemma
977
+ SCOTT vodka 6044393.1 6044393
978
+ APPANOOSE whiskey 292490.44 292490
979
+ HAMILTON whiskey 329118.92 329118
980
+ ... ... ... ...
981
+ WORTH whiskey 100542.85 100542
982
+ MITCHELL vodka 158791.94 158791
983
+ RINGGOLD whiskey 65107.8 65107
984
+ 101 rows × 2 columns
985
+ ```
986
+
903
987
How different are the counties from each other?
904
988
905
989
```
906
990
county_max_lemma.groupby("lemma").size().to_pandas()
907
991
```
908
992
993
+ ** Expected output:**
994
+
995
+ ```
996
+ lemma
997
+ american 1
998
+ liqueur 1
999
+ vodka 15
1000
+ whiskey 83
1001
+
1002
+ dtype: Int64
1003
+ ```
1004
+
909
1005
In most counties, whiskey is the most popular product by volume, with vodka most
910
1006
popular in 15 counties. Compare this to the most popular liquor types statewide.
911
1007
@@ -919,6 +1015,21 @@ total_liters = (
919
1015
total_liters.to_pandas()
920
1016
```
921
1017
1018
+ ** Expected output:**
1019
+
1020
+ ```
1021
+ volume_sold_liters
1022
+ lemma
1023
+ vodka 85356422.950001
1024
+ whiskey 85112339.980001
1025
+ rum 33891011.72
1026
+ american 19994259.64
1027
+ imported 14985636.61
1028
+ tequila 12357782.37
1029
+ cocktails/rtd 7406769.87
1030
+ ...
1031
+ ```
1032
+
922
1033
Whiskey and vodka have nearly the same volume, with vodka a bit higher than
923
1034
whiskey statewide.
924
1035
@@ -957,9 +1068,39 @@ difference from the statewide proportion in each county.
957
1068
# that drink _less_ of a particular liquor than expected.
958
1069
largest_per_county = cohens_h.groupby("county").agg({"cohens_h_int": "max"})
959
1070
counties = cohens_h[cohens_h['cohens_h_int'] == largest_per_county["cohens_h_int"]]
960
- counties.to_pandas()
1071
+ counties.sort_values('cohens_h', ascending=False). to_pandas()
961
1072
```
962
1073
1074
+ ** Expected output:**
1075
+
1076
+ ```
1077
+ cohens_h cohens_h_int
1078
+ county lemma
1079
+ EL PASO liqueur 1.289667 1289667
1080
+ ADAMS whiskey 0.373591 373590
1081
+ IDA whiskey 0.306481 306481
1082
+ OSCEOLA whiskey 0.295524 295523
1083
+ PALO ALTO whiskey 0.293697 293696
1084
+ ... ... ... ...
1085
+ MUSCATINE rum 0.053757 53757
1086
+ MARION rum 0.053427 53427
1087
+ MITCHELL vodka 0.048212 48212
1088
+ WEBSTER rum 0.044896 44895
1089
+ CERRO GORDO cocktails/rtd 0.027496 27495
1090
+ 100 rows × 2 columns
1091
+ ```
1092
+
1093
+ The larger the Cohen's h value, the more likely it is that there is a
1094
+ statistically significant difference in the amount of that type of alcohol
1095
+ consumed compared to the state averages. For the smaller positive values, the
1096
+ difference in consumption is different than the statewide average, but it may
1097
+ be due to random differences.
1098
+
1099
+ An aside: EL PASO county doesn't appear to be a
1100
+ [ county in Iowa] ( https://en.wikipedia.org/wiki/List_of_counties_in_Iowa )
1101
+ this may indicate another need for data cleanup before fully depending on these
1102
+ results.
1103
+
963
1104
### Visualizing counties
964
1105
965
1106
Join with
@@ -983,6 +1124,20 @@ counties_plus = (
983
1124
counties_plus
984
1125
```
985
1126
1127
+ ** Expected output:**
1128
+
1129
+ ```
1130
+ county lemma cohens_h cohens_h_int geo_id state_fips_code ...
1131
+ 0 ALLAMAKEE american 0.087931 87930 19005 19 ...
1132
+ 1 BLACK HAWK american 0.106256 106256 19013 19 ...
1133
+ 2 WINNESHIEK american 0.093101 93101 19191 19 ...
1134
+ ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1135
+ 96 CLINTON tequila 0.075708 75707 19045 19 ...
1136
+ 97 POLK tequila 0.087438 87438 19153 19 ...
1137
+ 98 LEE schnapps 0.064663 64663 19111 19 ...
1138
+ 99 rows × 23 columns
1139
+ ```
1140
+
986
1141
Use GeoPandas to visualize these differences on a map.
987
1142
988
1143
```
@@ -1013,10 +1168,10 @@ Alternatively, delete the Cloud Functions, service accounts, and datasets create
1013
1168
1014
1169
## Congratulations!
1015
1170
1016
- You have analyzed structured and unstructured data using BigQuery DataFrames.
1171
+ You have cleaned and analyzed structured data using BigQuery DataFrames.
1017
1172
Along the way you've explored Google Cloud's Public Datasets, Python notebooks
1018
- in BigQuery Studio, BigQuery ML, Vertex AI , and natural language to Python
1019
- features of BigQuery Studio . Fantastic job!
1173
+ in BigQuery Studio, BigQuery ML, BigQuery Remote Functions , and the power of
1174
+ BigQuery DataFrames . Fantastic job!
1020
1175
1021
1176
1022
1177
### Next steps
0 commit comments