This is a getting started guide to XGBoost4J-Spark using an Apache Toree Jupyter notebook. At the end of this guide, you will be able to run a sample notebook that runs on NVIDIA GPUs.
Before you begin, please ensure that you have setup a Spark Cluster(Standalone or YARN).
You should change --master
config according to your cluster architecture. For example, set --master yarn
for spark on YARN.
It is assumed that the SPARK_MASTER
and SPARK_HOME
environment variables are defined and point to the Spark Master URL (e.g. spark://localhost:7077
),
and the home directory for Apache Spark respectively.
-
Make sure you have jupyter notebook and sbt installed first.
-
Build the 'toree' locally to support scala 2.12, and install it.
# Download toree wget https://github.com/apache/incubator-toree/archive/refs/tags/v0.5.0-incubating-rc4.tar.gz tar -xvzf v0.5.0-incubating-rc4.tar.gz # Build the Toree pip package. cd incubator-toree-0.5.0-incubating-rc4 make pip-release # Install Toree pip install dist/toree-pip/toree-0.5.0.tar.gz
-
Prepare packages and dataset.
Make sure you have prepared the necessary packages and dataset by following this guide
-
Install a new kernel with gpu enabled and launch the notebook
Note: For ETL jobs, Set
spark.task.resource.gpu.amount
to1/spark.executor.cores
.For ETL:
jupyter toree install \ --spark_home=${SPARK_HOME} \ --user \ --toree_opts='--nosparkcontext' \ --kernel_name="ETL-Spark" \ --spark_opts='--master ${SPARK_MASTER} \ --jars ${RAPIDS_JAR},${SAMPLE_JAR} \ --conf spark.plugins=com.nvidia.spark.SQLPlugin \ --conf spark.executor.extraClassPath=${RAPIDS_JAR} \ --conf spark.executor.cores=10 \ --conf spark.task.resource.gpu.amount=0.1 \ --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \ --files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh'
For XGBoost:
jupyter toree install \ --spark_home=${SPARK_HOME} \ --user \ --toree_opts='--nosparkcontext' \ --kernel_name="XGBoost-Spark" \ --spark_opts='--master ${SPARK_MASTER} \ --jars ${RAPIDS_JAR},${SAMPLE_JAR} \ --conf spark.plugins=com.nvidia.spark.SQLPlugin \ --conf spark.executor.extraClassPath=${RAPIDS_JAR} \ --conf spark.rapids.memory.gpu.pool=NONE \ --conf spark.executor.resource.gpu.amount=1 \ --conf spark.executor.cores=10 \ --conf spark.task.resource.gpu.amount=1 \ --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \ --files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh'
Launch the notebook:
jupyter notebook
-
Launch ETL Part
- Launch XGBoost Part