Get Started with XGBoost4J-Spark with Apache Toree Jupyter Notebook

This is a getting started guide to XGBoost4J-Spark using an Apache Toree Jupyter notebook. At the end of this guide, you will be able to run a sample notebook that runs on NVIDIA GPUs.

Before you begin, please ensure that you have setup a Spark Cluster(Standalone or YARN). You should change --master config according to your cluster architecture. For example, set --master yarn for spark on YARN.

It is assumed that the SPARK_MASTER and SPARK_HOME environment variables are defined and point to the Spark Master URL (e.g. spark://localhost:7077), and the home directory for Apache Spark respectively.

Make sure you have jupyter notebook and sbt installed first.

Build the 'toree' locally to support scala 2.12, and install it.

# Download toree
wget https://github.com/apache/incubator-toree/archive/refs/tags/v0.5.0-incubating-rc4.tar.gz
tar -xvzf v0.5.0-incubating-rc4.tar.gz
# Build the Toree pip package.
cd incubator-toree-0.5.0-incubating-rc4
make pip-release
# Install Toree
pip install dist/toree-pip/toree-0.5.0.tar.gz

Prepare packages and dataset.

Make sure you have prepared the necessary packages and dataset by following this guide

Install a new kernel with gpu enabled and launch the notebook

Note: For ETL jobs, Set spark.task.resource.gpu.amount to 1/spark.executor.cores.

For ETL:

jupyter toree install                                \
--spark_home=${SPARK_HOME}                             \
--user                                          \
--toree_opts='--nosparkcontext'                         \
--kernel_name="ETL-Spark"                         \
--spark_opts='--master ${SPARK_MASTER} \
  --jars ${RAPIDS_JAR},${SAMPLE_JAR}       \
  --conf spark.plugins=com.nvidia.spark.SQLPlugin  \
  --conf spark.executor.extraClassPath=${RAPIDS_JAR} \
  --conf spark.executor.cores=10 \
  --conf spark.task.resource.gpu.amount=0.1 \
  --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
  --files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh'

For XGBoost:

jupyter toree install                                \
--spark_home=${SPARK_HOME}                             \
--user                                          \
--toree_opts='--nosparkcontext'                         \
--kernel_name="XGBoost-Spark"                         \
--spark_opts='--master ${SPARK_MASTER} \
 --jars ${RAPIDS_JAR},${SAMPLE_JAR}       \
 --conf spark.plugins=com.nvidia.spark.SQLPlugin  \
 --conf spark.executor.extraClassPath=${RAPIDS_JAR} \
 --conf spark.rapids.memory.gpu.pool=NONE \
 --conf spark.executor.resource.gpu.amount=1 \
 --conf spark.executor.cores=10 \
 --conf spark.task.resource.gpu.amount=1 \
 --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
 --files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh'

Launch the notebook:

jupyter notebook

Launch ETL Part

Mortgage ETL Notebook: Scala
Taxi ETL Notebook: Scala
Note: Agaricus does not have ETL part.

Launch XGBoost Part

Mortgage XGBoost Notebook: Scala
Taxi XGBoost Notebook: Scala
Agaricus XGBoost Notebook: Scala

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

toree.md

toree.md

Get Started with XGBoost4J-Spark with Apache Toree Jupyter Notebook

Files

toree.md

Latest commit

History

toree.md

File metadata and controls

Get Started with XGBoost4J-Spark with Apache Toree Jupyter Notebook