-
You have a machine with one or more NVIDIA A100 80 GB or NVIDIA H100 GPUs. If you have fewer than four GPUs, you can configure GPU time-slicing. Time-slicing oversubscribes the GPUs to simulate the four GPUs that are required, though at lower performance.
-
You have access to Docker and Docker Compose to build container images. Refer to the installation documentation for Ubuntu from the Docker documentation.
-
You have Kubernetes installed and running on the machine with Ubuntu 22.04 or 20.04. Refer to the Kubernetes documentation or the NVIDIA Cloud Native Stack repository for more information.
-
You have access to Git and Git LFS to clone the repository to get access to the Dockerfile and software for container images.
-
You downloaded a Llama2 chat model weights from Meta or HuggingFace. Get the 13 billion or 7 billion parameter model.
Request access to the model from Meta or refer to the meta-llama/LLama-2-13b-chat-hf page from HuggingFace.
The directory with the model is shared as a host path volume mount with the Triton Inference Server pod.
Use the NVIDIA GPU Operator to install, configure, and manage the NVIDIA GPU driver and NVIDIA container runtime on the Kubernetes node.
-
Add the NVIDIA Helm repository:
$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ && helm repo update
-
Install the Operator:
$ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator
-
Optional: Configure GPU time-slicing if you have fewer than four GPUs.
-
Create a file,
time-slicing-config-all.yaml
, with the following content:apiVersion: v1 kind: ConfigMap metadata: name: time-slicing-config-all data: any: |- version: v1 flags: migStrategy: none sharing: timeSlicing: resources: - name: nvidia.com/gpu replicas: 4
The sample configuration creates four replicas from each GPU on the node.
-
Add the config map to the Operator namespace:
$ kubectl create -n gpu-operator -f time-slicing-config-all.yaml
-
Configure the device plugin with the config map and set the default time-slicing configuration:
$ kubectl patch clusterpolicy/cluster-policy \ -n gpu-operator --type merge \ -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config-all", "default": "any"}}}}'
-
Verify that at least
4
GPUs are allocatable:$ kubectl get nodes -l nvidia.com/gpu.present -o json | jq '.items[0].status.allocatable | with_entries(select(.key | startswith("nvidia.com/"))) | with_entries(select(.value != "0"))'
Example Output
{ "nvidia.com/gpu": "4" }
-
For more information or to adjust the configuration, refer to Install NVIDIA GPU Operator and Time-Slicing GPUs in Kubernetes in the NVIDIA GPU Operator documentation.
-
Get the Helm chart for the Operator:
$ helm fetch https://helm.ngc.nvidia.com/nvidia/cloud-native/charts/developer-llm-operator-0.1.0.tgz
-
Install the Operator:
$ helm install --generate-name ./developer-llm-operator-0.1.0.tgz \ -n kube-trailblazer-system --create-namespace
-
Optional: Confirm the controller pod is running:
$ kubectl get pods -n kube-trailblazer-system
Example Output
NAME READY STATUS RESTARTS AGE kube-trailblazer-controller-manager-868bf8dc84-p2zgc 2/2 Running 2 (20h ago) 21h
-
Clone the repository if you haven't already:
$ git lfs clone https://github.com/NVIDIA/GenerativeAIExamples.git
-
Build the container images:
$ cd GenerativeAIExamples/deploy/compose $ docker compose --env-file compose.env build
Building the images requires several minutes.
-
Start a local registry, tag the images, and push the images to the registry.
-
Start a local registry:
$ docker run -d -p 5000:5000 --name registry registry:2.7
-
Tag and push the images that are not publicly available:
$ docker tag llm-inference-server localhost:5000/llm-inference-server $ docker push localhost:5000/llm-inference-server $ docker tag chain-server localhost:5000/chain-server $ docker push localhost:5000/chain-server $ docker tag llm-playground localhost:5000/llm-playground $ docker push localhost:5000/llm-playground $ docker tag notebook-server localhost:5000/notebook-server $ docker push localhost:5000/notebook-server
-
Optional: Confirm the images are available from the local registry:
$ curl -sSL "http://localhost:5000/v2/_catalog"
Example Output
{"repositories":["chain-server","llm-inference-server","llm-playground","notebook-server"]}
-
-
Create a file, such as
rag-llm-pipeline.yaml
, with contents like the following example:apiVersion: package.nvidia.com/v1alpha1 kind: HelmPipeline metadata: name: rag-llm-pipeline spec: pipeline: - repoEntry: url: "file:///helm-charts/staging" chartSpec: chart: "rag-llm-pipeline" chartValues: triton: modelDirectory: "<PATH>/llama2_13b_chat_hf_v1/"
Modify the
modelDirectory
value to match the location and name of the model directory on the Kubernetes node. -
Apply the manifest:
$ kubectl apply -n kube-trailblazer-system -f rag-llm-pipeline.yaml
The Operator creates the
rag-llm-pipeline
namespace and creates deployments and services in the namespace. Downloading the container images and starting the pods can require a few minutes. -
Optional: Monitor progress.
-
View the logs from the Operator controller pod:
$ kubectl logs -n kube-trailblazer-system $(kubectl get pod -n kube-trailblazer-system -o=jsonpath='{.items[0].metadata.name}')
-
View the pods in the pipeline namespace:
$ kubectl get pods -n rag-llm-pipeline
Example Output
NAME READY STATUS RESTARTS AGE jupyter-notebook-server-6d6b46578d-98xdq 1/1 Running 0 21h llm-playground-6fd649ff8f-r2hp6 1/1 Running 0 22h milvu-etcd-6559759884-9rvpz 1/1 Running 0 22h milvus-minio-6fc5b9bdd4-d7l4z 1/1 Running 0 22h milvus-standalone-9bfb5d974-tsjtp 1/1 Running 0 22h query-router-77499f5459-6jjr9 1/1 Running 0 22h triton-inference-server-79d5c499b-26nqq 0/1 Running 0 22h
-
-
View the services and node ports:
$ kubectl get svc -n rag-llm-pipeline
Example Output
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE frontend-service NodePort 10.111.66.10 <none> 8090:30001/TCP 22h jupyter-notebook-service NodePort 10.110.101.174 <none> 8888:30000/TCP 22h llm ClusterIP 10.107.213.112 <none> 8001/TCP 22h milvus ClusterIP 10.102.86.183 <none> 19530/TCP 22h milvus-etcd ClusterIP 10.109.74.142 <none> 2379/TCP 22h milvus-minio ClusterIP 10.103.238.28 <none> 9010/TCP 22h query ClusterIP 10.110.199.69 <none> 8081/TCP 22h
The output shows that the chat web application,
frontend-service
, is mapped to port30001
on the Kubernetes host through a node port. The output also shows the Jupyter Notebook server is mapped to port30000
on the host.
-
Open a browser and access
http://localhost:30001
or replace localhost with the IP address of the Kubernetes node. -
Upload a PDF file as a knowledge base for retrieval.
-
Access
http://localhost:30001/converse
and click Knowledge Base. -
Browse to a local file and upload it to the web application.
-
When you return to the Converse tab to ask a question, enable the Use knowledge base checkbox.
-
-
Open a browser and access
http://localhost:30000
or replace localhost with the IP address of the Kubernetes node.Browse and run the notebooks that are part of the container image.