Name		Name	Last commit message	Last commit date
parent directory ..
agaricus		agaricus
aggregator		aggregator
app-parameters		app-parameters
assembly		assembly
mortgage		mortgage
taxi		taxi
utility		utility
.gitignore		.gitignore
README.md		README.md
main.py		main.py
pack_pyspark_example.sh		pack_pyspark_example.sh
pom.xml		pom.xml

README.md

Spark XGBoost Examples

Spark XGBoost examples here showcase the need for ETL+Training pipeline GPU acceleration. The Scala based XGBoost examples here use DMLC’s version. The pyspark based XGBoost examples requires installing RAPIDS via pip. Most data scientists spend a lot of time not only on Training models but also processing the large amounts of data needed to train these models. As you can see below, Pyspark+XGBoost training on GPUs can be up to 13X and data processing using RAPIDS Accelerator can also be accelerated with an end-to-end speed-up of 11X on GPU compared to CPU. In the public cloud, better performance can lead to significantly lower costs as demonstrated in this blog.

Note that the Training test result is based on 4 years Fannie Mea Single-Family Loan Performance Data with a 8 A100 GPU and 1024 CPU vcores cluster, the performance is affected by many aspects, including data size and type of GPU.

In this folder, there are three blue prints for users to learn about using Spark XGBoost and RAPIDS Accelerator on GPUs :

Mortgage Prediction
Agaricus Classification
Taxi Fare Prediction

For each of these examples we have prepared a sample dataset in this folder for testing. These datasets are only provided for convenience. In order to test for performance, please download the larger dataset from their respectives sources.

There are three sections in this readme section. In the first section, we will list the notebooks that can be run on Jupyter with Python or Scala (Spylon Kernel or Apache Toree Kernel).

In the second section, we have sample jar files and source code if users would like to build and run this as a Scala or a PySpark Spark-XGBoost application.

In the last section, we provide basic “Getting Started Guides” for setting up GPU Spark-XGBoost on different environments based on the Apache Spark scheduler such as YARN, Standalone or Kubernetes.

SECTION 1: SPARK-XGBOOST EXAMPLE NOTEBOOKS

Mortgage Notebooks
- Python
- Scala
  - Mortgage ETL
  - Mortgage Training Prediction
Agaricus Notebooks
- Python
  - Agaricus Training Classification
- Scala
  - Agaricus Training Classification
Taxi Notebook
- Python
  - Taxi Training Classification
- Scala
  - Taxi Training Classification

SECTION 2: BUILDING A PYSPARK OR A SCALA XGBOOST APPLICATION

The first step to build a Spark application is preparing packages and datasets needed to build the jars. Please use the instructions below for building the

Scala
Python

In addition, we have the source code for building reference applications. Below are source codes for the example Spark jobs:

Mortgage: Scala, Python
Taxi: Scala, Python
Agaricus: Scala, Python

SECTION 3: SETTING UP THE ENVIRONMENT

Please follow below steps to run the example Spark jobs in different Spark environments:

Getting started on on-premises clusters
Getting started on cloud service providers
- Amazon AWS
  - EC2
- Databricks
- GCP

Please follow below steps to run the example notebooks in different notebook environments:

Getting started for Jupyter Notebook applications

Note: Update the default value of spark.sql.execution.arrow.maxRecordsPerBatch to a larger number(such as 200000) will
significantly improve performance by accelerating data transfer between JVM and Python process.

For the CrossValidator job, we need to set spark.task.resource.gpu.amount=1 to allow only 1 training task running on 1 GPU(executor), otherwise the customized CrossValidator may schedule more than 1 xgboost training tasks into one executor simultaneously and trigger issue-131. For XGBoost job, if the number of shuffle stage tasks before training is less than the num_worker, the training tasks will be scheduled to run on part of nodes instead of all nodes due to Spark Data Locality feature. The workaround is to increase the partitions of the shuffle stage by setting spark.sql.files.maxPartitionBytes=RightNum. If you are running XGBoost scala notebooks on Dataproc, please make sure to update below configs to avoid job failure:

spark.dynamicAllocation.enabled=false
spark.task.resource.gpu.amount=1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XGBoost-Examples

XGBoost-Examples

README.md

Spark XGBoost Examples

SECTION 1: SPARK-XGBOOST EXAMPLE NOTEBOOKS

SECTION 2: BUILDING A PYSPARK OR A SCALA XGBOOST APPLICATION

SECTION 3: SETTING UP THE ENVIRONMENT

Files

XGBoost-Examples

Directory actions

More options

Directory actions

More options

Latest commit

History

XGBoost-Examples

Folders and files

parent directory

README.md

Spark XGBoost Examples

SECTION 1: SPARK-XGBOOST EXAMPLE NOTEBOOKS

SECTION 2: BUILDING A PYSPARK OR A SCALA XGBOOST APPLICATION

SECTION 3: SETTING UP THE ENVIRONMENT