In this homework we'll put what we learned about Spark in practice.
For this homework we will be using the Yellow 2024-10 data from the official website:
wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-10.parquet
- Install Spark
- Run PySpark
- Create a local spark session
- Execute spark.version.
What's the output?
Note
To install PySpark follow this guide
Read the October 2024 Yellow into a Spark Dataframe.
Repartition the Dataframe to 4 partitions and save it to parquet.
What is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? Select the answer which most closely matches.
- 6MB
- 25MB
- 75MB
- 100MB
How many taxi trips were there on the 15th of October?
Consider only trips that started on the 15th of October.
- 85,567
- 105,567
- 125,567
- 145,567
What is the length of the longest trip in the dataset in hours?
- 122
- 142
- 162
- 182
Spark’s User Interface which shows the application's dashboard runs on which local port?
- 80
- 443
- 4040
- 8080
Load the zone lookup data into a temp view in Spark:
wget https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv
Using the zone lookup data and the Yellow October 2024 data, what is the name of the LEAST frequent pickup location Zone?
- Governor's Island/Ellis Island/Liberty Island
- Arden Heights
- Rikers Island
- Jamaica Bay
- Form for submitting: https://courses.datatalks.club/de-zoomcamp-2025/homework/hw5
- Deadline: See the website