This directory provides instructions to reproduce Intel Gaudi's results for MLPerf™ Inference submission.
MLPerf™ is a trademark and service mark of MLCommons Association in the United States and other countries.
All rights reserved. Unauthorized use is strictly prohibited.
Please follow the instructions provided in the Intel® Gaudi® Installation Guide to set up the environment.
Set CPU to performance mode:
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
In order to verify the CPU mode, run:
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
It is recommended to set the number of huge pages as provided below:
#set current hugepages
sudo sysctl -w vm.nr_hugepages=100000
#Remove old entry if exists in sysctl.conf
sudo sed --in-place '/nr_hugepages/d' /etc/sysctl.conf
#Insert huge pages settings to persist
echo "vm.nr_hugepages=100000" | sudo tee -a /etc/sysctl.conf
mkdir -p /path/to/Intel-HabanaLabs
export INTEL_HABANALABS_DIR=/path/to/Intel-HabanaLabs
This README is located in code/llama2-70b-99.9 directory corresponding to Intel-HabanaLabs submission for llama2-70b-99.9 benchmark. Download the whole code folder along with all subfolders and copy it under $INTEL_HABANALABS_DIR
docker run --privileged --security-opt seccomp=unconfined \
--name mlperf-intel-habanalabs -td \
-v /dev:/dev \
--device=/dev:/dev \
-v /sys/kernel/debug:/sys/kernel/debug \
-v /tmp:/tmp \
-v $INTEL_HABANALABS_DIR:/root/Intel-HabanaLabs/ \
--cap-add=sys_nice --cap-add=SYS_PTRACE \
--user root --workdir=/root --net=host \
--ulimit memlock=-1:-1 vault.habana.ai/gaudi-docker-mlperf/ver4.0/pytorch-installer-2.1.1:1.14.98-33
docker exec -it mlperf-intel-habanalabs bash
Choose one of the two available paths for downloading the model.
MLCommons hosts the model and preprocessed dataset for download exclusively by MLCommons Members. You must first agree to the confidentiality notice, then follow the link to a directory containing Rclone download instructions.
Using an e-mail registered for HuggingFace (if you do not have an account, you will need to create one), go to llama2-request-link and make a request. Having it accepted, go to HuggingFace Llama2 page and ask for an access to the model. Please note your HuggingFace authentication credentials as you may be required to provide them when cloning the below. The download requires Git Large Files Storage:
mkdir -p /mnt/weka/data/pytorch/llama2/
pushd /mnt/weka/data/pytorch/llama2/
apt-get update
apt-get install git-lfs
git clone https://huggingface.co/meta-llama/Llama-2-70b-chat-hf Llama-2-70b-chat-hf
popd
pushd /root/Intel-HabanaLabs/code/llama2-70b-99.9/
export EXPORT_DIR=${PWD}/open_orca
mkdir -p /mnt/weka/data/mlperf_inference/llama2/
export DATASET_PATH=/mnt/weka/data/mlperf_inference/llama2/processed-data.pkl
curl https://rclone.org/install.sh | bash
rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b \
secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b \
endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
rclone copy mlc-inference:mlcommons-inference-wg-public/open_orca ./open_orca -P
pushd open_orca
gzip -d open_orca_gpt4_tokenized_llama.sampled_24576.pkl.gz
popd
mv ${EXPORT_DIR}/open_orca_gpt4_tokenized_llama.sampled_24576.pkl ${DATASET_PATH}
md5sum ${DATASET_PATH}
popd
The md5sum of generated dataset file should be 5fe8be0a7ce5c3c9a028674fd24b00d5.
The same script was submitted for both 99 and 99.9 benchmarks - no additional improvements were made for low accuracy (99), and 99.9 results were used for 99 as well.
Source the necessary files:
cd /root/Intel-HabanaLabs/code/llama2-70b-99.9/
source functions.sh
To generate full submission results, run the following command:
build_mlperf_inference --output-dir <path_to_output_dir> --submission llama-99.9-fp8
The command produces results from accuracy and performance runs for both Offline and Server scenarios. Logs can be found under /output_dir/logs/model/, e.g. /results/logs/llama-99.9-fp8/
To generate results for Offline and Server scenarios separately, run the following commands:
build_mlperf_inference --output-dir <path_to_output_dir> --submission llama-99.9-fp8_Offline
build_mlperf_inference --output-dir <path_to_output_dir> --submission llama-99.9-fp8_Server
Logs can be found under /output_dir/logs/model/scenario/, e.g. /results/logs/llama-99.9-fp8/Offline/
To generate results for accuracy and performance separately, add --mode
flag as in one of the following commands:
build_mlperf_inference --output-dir <path_to_output_dir> --submission llama-99.9-fp8_Server --mode acc
build_mlperf_inference --output-dir <path_to_output_dir> --submission llama-99.9-fp8_Offline --mode perf
Logs can be found under /output_dir/logs/model/scenario/mode/, e.g. /results/logs/llama-99.9-fp8/Offline/accuracy/
All linear operators' input activations and weights (linear and matmul operators) are quantized to FP8-143. The weights are pre-quantized to FP8-143.
The submission contains measurement files required for FP8 quantization.
The following procedure was used to generate them:
cd /root/Intel-HabanaLabs/code/llama2-70b-99.9/llama
export QUANT_CONFIG=hqt/llama2-70b-8x/config_meas_maxabs.json
deepspeed --num_gpus 8 llama_greedy.py --model_name_or_path /mnt/weka/data/pytorch/llama2/Llama-2-70b-chat-hf/ \
--bf16 --attn_softmax_bf16 --use_hpu_graphs --use_kv_cache --batch_size 128 --reuse_cache \
--trim_logits --limit_hpu_graphs --dataset $EXPORT_DIR/open_orca_gpt4_tokenized_llama.calibration_1000.pkl
The Quantization Toolkit is described in the Intel Gaudi documentation.
Validated on | Intel Gaudi Software Version | Framework Version(s) | Mode |
---|---|---|---|
Gaudi 2 | 1.16.1 | PyTorch 2.2.2 | Inference |