This project contains a collection of code snippets designed for hands-on exploration of Apache Spark functionalities, with a strong focus on performance optimization.
It is the core of the breakout session "Drastically Reducing Processing Costs with Delta Lake" presented during the Data+AI Summit edition 2024.
Some problems addressed:
- Spill
- Thread Contention
- Read amplification (addressed with partition pruning, DFP, z-ordering, FP, ...)
- Write amplification (addressed with deletion vectors, ...)
- Spark UI retention limits
- ...
Each snippet is pedagogic, self-contained, executable, and focuses on one performance problem providing at least one possible solution.
Thanks to this, the user learns about problems/features by:
- exploring and reading the snippets
- running them
- following them on Spark UI
- monitoring them
- modifying them
- ...
A typical snippet has the folowing structure:
// Spark: <version of spark this snippet is intended to be used with>
// Local: <spark shell extra options when used locally, e.g. --master local[2] --driver-memory 1G >
// Databricks: <details about runtime used to test the snipped, unused for now>
// COMMAND ----------
/*
<brief description of problem and a possible solution high level>
# Symptom
<problem symptoms>
# Explanation
<explanation of the potential solution, the analysis or any other detail about how to address the problem>
# What to aim for concretely
<concrete steps to get to the metrics / indicator that the solution proposed is working properly>
*/
// COMMAND ----------
spark.sparkContext.setJobDescription("JOB")
val path = ???
val airports = spark.read.option("delimiter","^").option("header","true").csv(path)
/// ... more scala code ...
The very first time you use the project you need to do some setup.
Add the following to your ~/.bashrc
(~/.zshrc
for MacOS):
export PERF_HIKES_PATH=<this-path>
source $PERF_HIKES_PATH/source.sh # to define the aliases
You can use the project:
- locally using spark-shell or
- remotely using your own Databricks workspace
- Install spark locally and make sure
spark-shell
is in yourPATH
You can install Apache Spark and the Spark Shell by following the instructions on the Apache Spark website. If you want to reproduce as close as possible the expected behavior of each snippet, you should install the same version of Spark as the one mentioned in the snippet (at least same major.minor).
- Append the following to your
~/.bashrc
(~/.zshrc
for MacOS):
export PATH=$PATH:<spark-shell-directory>
- Then set up some sample datasets running in your shell:
spark-init-datasets-local
-
Configure the Databricks cli on your machine.
Be sure to:
- install Databricks cli versions 0.205 and above (cf. https://docs.databricks.com/en/dev-tools/cli/tutorial.html)
- define the Databricks profile (containing name, url and credentials of your Databricks workspace) in your
~/.databrickscfg
file.
-
Then set up some sample datasets:
spark-init-datasets-databricks <databricks-profile-name>
spark-run-local <snippet.scala>
First copy the snippet you want to your Databricks workspace:
spark-import-databricks <snippet.scala> <databricks-profile-name> <destination-folder> <extra-args: e.g., --overwrite>
Then run it as a notebook.
First you need to setup the environment as described above.
To write a snippet we encourage you to create a build.sbt
file, and load the project as a scala project:
val sparkVersion = "3.4.1"
val deltaVersion = "2.4.0"
val deps = Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-avro" % sparkVersion,
"io.delta" %% "delta-core" % deltaVersion
)
lazy val root = (project in file("."))
.settings(
name := "perf-hikes",
scalaVersion := "2.12.13",
libraryDependencies := deps
)
A new snippet must honor the description and the structure provided above.
The naming convention is: <problem-description>.<solution-description>.scala
Please follow the Apache Software Foundation Code of Conduct.
All contributions must go through a Pull Request review before being accepted.
Before opening a Pull Request, please consider the following questions:
- Is the change ready enough to ask the community to spend time reviewing?
- Is the change being proposed clearly explained and motivated?
When you contribute code, you affirm that the contribution is your original work and that you license the work to the project under the project's open source license. Whether or not you state this explicitly, by submitting any copyrighted material via pull request, email, or other means you agree to license the material under the project's open source license and warrant that you have the legal authority to do so.