For this homework, you will need the following datasets:
- Green Taxi dataset (2019 and 2020)
- Yellow Taxi dataset (2019 and 2020)
- For Hire Vehicle dataset (2019)
- Make sure you, at least, have them in GCS with a External Table OR a Native Table - use whichever method you prefer to accomplish that (Workflow Orchestration with pandas-gbq, dlt for gcs, dlt for BigQuery, gsutil, etc)
- You should have exactly
7,778,101
records in your Green Taxi table - You should have exactly
109,047,518
records in your Yellow Taxi table - You should have exactly
43,244,696
records in your FHV table - Build the staging models for green/yellow as shown in here
- Build the dimension/fact for taxi_trips joining with
dim_zones
as shown in here
Note: If you don't have access to GCP, you can spin up a local Postgres instance and ingest the datasets above
Provided you've got the following sources.yaml
version: 2
sources:
- name: raw_nyc_tripdata
database: "{{ env_var('DBT_BIGQUERY_PROJECT', 'dtc_zoomcamp_2025') }}"
schema: "{{ env_var('DBT_BIGQUERY_SOURCE_DATASET', 'raw_nyc_tripdata') }}"
tables:
- name: ext_green_taxi
- name: ext_yellow_taxi
with the following env variables setup where dbt
runs:
export DBT_BIGQUERY_PROJECT=myproject
export DBT_BIGQUERY_DATASET=my_nyc_tripdata
What does this .sql model compile to?
select *
from {{ source('raw_nyc_tripdata', 'ext_green_taxi' ) }}
select * from dtc_zoomcamp_2025.raw_nyc_tripdata.ext_green_taxi
select * from dtc_zoomcamp_2025.my_nyc_tripdata.ext_green_taxi
select * from myproject.raw_nyc_tripdata.ext_green_taxi
select * from myproject.my_nyc_tripdata.ext_green_taxi
select * from dtc_zoomcamp_2025.raw_nyc_tripdata.green_taxi
Say you have to modify the following dbt_model (fct_recent_taxi_trips.sql
) to enable Analytics Engineers to dynamically control the date range.
- In development, you want to process only the last 7 days of trips
- In production, you need to process the last 30 days for analytics
select *
from {{ ref('fact_taxi_trips') }}
where pickup_datetime >= CURRENT_DATE - INTERVAL '30' DAY
What would you change to accomplish that in a such way that command line arguments takes precedence over ENV_VARs, which takes precedence over DEFAULT value?
- Add
ORDER BY pickup_datetime DESC
andLIMIT {{ var("days_back", 30) }}
- Update the WHERE clause to
pickup_datetime >= CURRENT_DATE - INTERVAL '{{ var("days_back", 30) }}' DAY
- Update the WHERE clause to
pickup_datetime >= CURRENT_DATE - INTERVAL '{{ env_var("DAYS_BACK", "30") }}' DAY
- Update the WHERE clause to
pickup_datetime >= CURRENT_DATE - INTERVAL '{{ var("days_back", env_var("DAYS_BACK", "30")) }}' DAY
- Update the WHERE clause to
pickup_datetime >= CURRENT_DATE - INTERVAL '{{ env_var("DAYS_BACK", var("days_back", "30")) }}' DAY
Considering the data lineage below and that taxi_zone_lookup is the only materialization build (from a .csv seed file):
Select the option that does NOT apply for materializing fct_taxi_monthly_zone_revenue
:
dbt run
dbt run --select +models/core/dim_taxi_trips.sql+ --target prod
dbt run --select +models/core/fct_taxi_monthly_zone_revenue.sql
dbt run --select +models/core/
dbt run --select models/staging/+
Consider you're dealing with sensitive data (e.g.: PII), that is only available to your team and very selected few individuals, in the raw layer
of your DWH (e.g: a specific BigQuery dataset or PostgreSQL schema),
-
Among other things, you decide to obfuscate/masquerade that data through your staging models, and make it available in a different schema (a
staging layer
) for other Data/Analytics Engineers to explore -
And optionally, yet another layer (
service layer
), where you'll build your dimension (dim_
) and fact (fct_
) tables (assuming the Star Schema dimensional modeling) for Dashboarding and for Tech Product Owners/Managers
You decide to make a macro to wrap a logic around it:
{% macro resolve_schema_for(model_type) -%}
{%- set target_env_var = 'DBT_BIGQUERY_TARGET_DATASET' -%}
{%- set stging_env_var = 'DBT_BIGQUERY_STAGING_DATASET' -%}
{%- if model_type == 'core' -%} {{- env_var(target_env_var) -}}
{%- else -%} {{- env_var(stging_env_var, env_var(target_env_var)) -}}
{%- endif -%}
{%- endmacro %}
And use on your staging, dim_ and fact_ models as:
{{ config(
schema=resolve_schema_for('core'),
) }}
That all being said, regarding macro above, select all statements that are true to the models using it:
- Setting a value for
DBT_BIGQUERY_TARGET_DATASET
env var is mandatory, or it'll fail to compile - Setting a value for
DBT_BIGQUERY_STAGING_DATASET
env var is mandatory, or it'll fail to compile - When using
core
, it materializes in the dataset defined inDBT_BIGQUERY_TARGET_DATASET
- When using
stg
, it materializes in the dataset defined inDBT_BIGQUERY_STAGING_DATASET
, or defaults toDBT_BIGQUERY_TARGET_DATASET
- When using
staging
, it materializes in the dataset defined inDBT_BIGQUERY_STAGING_DATASET
, or defaults toDBT_BIGQUERY_TARGET_DATASET
Alright, in module 1, you had a SQL refresher, so now let's build on top of that with some serious SQL.
These are not meant to be easy - but they'll boost your SQL and Analytics skills to the next level.
So, without any further do, let's get started...
You might want to add some new dimensions year
(e.g.: 2019, 2020), quarter
(1, 2, 3, 4), year_quarter
(e.g.: 2019/Q1
, 2019-Q2
), and month
(e.g.: 1, 2, ..., 12), extracted from pickup_datetime, to your fct_taxi_trips
OR dim_taxi_trips.sql
models to facilitate filtering your queries
- Create a new model
fct_taxi_trips_quarterly_revenue.sql
- Compute the Quarterly Revenues for each year for based on
total_amount
- Compute the Quarterly YoY (Year-over-Year) revenue growth
- e.g.: In 2020/Q1, Green Taxi had -12.34% revenue growth compared to 2019/Q1
- e.g.: In 2020/Q4, Yellow Taxi had +34.56% revenue growth compared to 2019/Q4
Important Note: The Year-over-Year (YoY) growth percentages provided in the examples are purely illustrative. You will not be able to reproduce these exact values using the datasets provided for this homework.
Considering the YoY Growth in 2020, which were the yearly quarters with the best (or less worse) and worst results for green, and yellow
- green: {best: 2020/Q2, worst: 2020/Q1}, yellow: {best: 2020/Q2, worst: 2020/Q1}
- green: {best: 2020/Q2, worst: 2020/Q1}, yellow: {best: 2020/Q3, worst: 2020/Q4}
- green: {best: 2020/Q1, worst: 2020/Q2}, yellow: {best: 2020/Q2, worst: 2020/Q1}
- green: {best: 2020/Q1, worst: 2020/Q2}, yellow: {best: 2020/Q1, worst: 2020/Q2}
- green: {best: 2020/Q1, worst: 2020/Q2}, yellow: {best: 2020/Q3, worst: 2020/Q4}
- Create a new model
fct_taxi_trips_monthly_fare_p95.sql
- Filter out invalid entries (
fare_amount > 0
,trip_distance > 0
, andpayment_type_description in ('Cash', 'Credit card')
) - Compute the continous percentile of
fare_amount
partitioning by service_type, year and and month
Now, what are the values of p97
, p95
, p90
for Green Taxi and Yellow Taxi, in April 2020?
- green: {p97: 55.0, p95: 45.0, p90: 26.5}, yellow: {p97: 52.0, p95: 37.0, p90: 25.5}
- green: {p97: 55.0, p95: 45.0, p90: 26.5}, yellow: {p97: 31.5, p95: 25.5, p90: 19.0}
- green: {p97: 40.0, p95: 33.0, p90: 24.5}, yellow: {p97: 52.0, p95: 37.0, p90: 25.5}
- green: {p97: 40.0, p95: 33.0, p90: 24.5}, yellow: {p97: 31.5, p95: 25.5, p90: 19.0}
- green: {p97: 55.0, p95: 45.0, p90: 26.5}, yellow: {p97: 52.0, p95: 25.5, p90: 19.0}
Prerequisites:
- Create a staging model for FHV Data (2019), and DO NOT add a deduplication step, just filter out the entries where
where dispatching_base_num is not null
- Create a core model for FHV Data (
dim_fhv_trips.sql
) joining withdim_zones
. Similar to what has been done here - Add some new dimensions
year
(e.g.: 2019) andmonth
(e.g.: 1, 2, ..., 12), based onpickup_datetime
, to the core model to facilitate filtering for your queries
Now...
- Create a new model
fct_fhv_monthly_zone_traveltime_p90.sql
- For each record in
dim_fhv_trips.sql
, compute the timestamp_diff in seconds between dropoff_datetime and pickup_datetime - we'll call ittrip_duration
for this exercise - Compute the continous
p90
oftrip_duration
partitioning by year, month, pickup_location_id, and dropoff_location_id
For the Trips that respectively started from Newark Airport
, SoHo
, and Yorkville East
, in November 2019, what are dropoff_zones with the 2nd longest p90 trip_duration ?
- LaGuardia Airport, Chinatown, Garment District
- LaGuardia Airport, Park Slope, Clinton East
- LaGuardia Airport, Saint Albans, Howard Beach
- LaGuardia Airport, Rosedale, Bath Beach
- LaGuardia Airport, Yorkville East, Greenpoint
- Form for submitting: https://courses.datatalks.club/de-zoomcamp-2025/homework/hw4
- To be published after deadline