|
| 1 | +# Quick Start Guide for Distributed Workloads with the CodeFlare Stack |
| 2 | + |
| 3 | +<!--toc:start--> |
| 4 | +- [Quick Start Guide for Distributed Workloads with the CodeFlare Stack](#quick-start-guide-for-distributed-workloads-with-the-codeflare-stack) |
| 5 | + - [Prerequisites](#prerequisites) |
| 6 | + - [Resources](#resources) |
| 7 | + - [GPU Worker Node](#gpu-worker-node) |
| 8 | + - [Automatic deployment for developers (optional - skip to run steps manually)](#automatic-deployment-for-developers-optional-skip-to-run-steps-manually) |
| 9 | + - [Next Steps After Automatic Deployment](#next-steps-after-automatic-deployment) |
| 10 | + - [Red Hat OpenShift AI](#red-hat-openshift-ai) |
| 11 | + - [Configure Data Science Components](#configure-data-science-components) |
| 12 | + - [Preparing Your Cluster for GPU Workloads](#preparing-your-cluster-for-gpu-workloads) |
| 13 | + - [Installing the Operators](#installing-the-operators) |
| 14 | + - [Configure the Node Feature Discovery Operator](#configure-the-node-feature-discovery-operator) |
| 15 | + - [Configure the NVIDIA GPU Operator](#configure-the-nvidia-gpu-operator) |
| 16 | + - [Configure Kueue for Task Scheduling](#configure-kueue-for-task-scheduling) |
| 17 | + - [Launch a Jupyter Notebook Instance](#launch-a-jupyter-notebook-instance) |
| 18 | + - [Submit your first job](#submit-your-first-job) |
| 19 | + - [Clone the demo code](#clone-the-demo-code) |
| 20 | + - [Run the Guided Demo Notebooks](#run-the-guided-demo-notebooks) |
| 21 | + - [Cleaning up the AI Platform Install](#cleaning-up-the-ai-platform-install) |
| 22 | + - [Manual Cleanup Steps](#manual-cleanup-steps) |
| 23 | + - [Next Steps](#next-steps) |
| 24 | +<!--toc:end--> |
| 25 | + |
| 26 | +This quick start guide is intended to walk users through installation of the CodeFlare stack and an initial demo using the CodeFlare-SDK from within a Jupyter notebook environment. This will enable users to run and submit distributed workloads. |
| 27 | + |
| 28 | +The CodeFlare-SDK was built to make managing distributed compute infrastructure in the cloud easy and intuitive for Data Scientists. However, that means there |
| 29 | +needs to be some cloud infrastructure on the backend for users to get the benefit of using the SDK. Currently, we support the CodeFlare stack. |
| 30 | + |
| 31 | +This stack integrates well with Red Hat OpenShift AI and [Open Data Hub](https://opendatahub.io/), and helps to bring batch workloads, jobs, and queuing to the Data Science platform. Although this document will guide you through setting up with Red Hat OpenShift AI (RHOAI), the steps are also applicable if you are using Open Data Hub (ODH). Both platforms are available in the OperatorHub, and the installation and configuration steps are quite similar. This guide will proceed with RHOAI, but feel free to apply the instructions to ODH as needed. |
| 32 | + |
| 33 | +## Prerequisites |
| 34 | + |
| 35 | +### Resources |
| 36 | + |
| 37 | +In addition to the resources required by default Red Hat OpenShift AI deployments, you will need the following to deploy the Distributed Workloads stack infrastructure pods: |
| 38 | + |
| 39 | +```text |
| 40 | +Total: |
| 41 | + CPU: 1600m (1.6 vCPU) |
| 42 | + Memory: 2048Mi (2 GiB) |
| 43 | +``` |
| 44 | + |
| 45 | +> [!NOTE] |
| 46 | +> The above resources are just for the infrastructure pods. To be able to run actual |
| 47 | +> workloads on your cluster you will need additional resources based on the size |
| 48 | +> and type of workload. |
| 49 | +
|
| 50 | +### GPU Worker Node |
| 51 | + |
| 52 | +> [!IMPORTANT] |
| 53 | +> This step is necessary only if you require GPU capabilities for your workloads and your OpenShift cluster does not already include GPU-equipped nodes, follow these steps: |
| 54 | +
|
| 55 | +1. **Open the OpenShift Cluster Console.** |
| 56 | +2. Navigate to **your-cluster** -> **Machine pools**. |
| 57 | +3. Click on **“Add machine pool”**. |
| 58 | +4. Provide a **name** for the new machine pool. |
| 59 | +5. In the **“Compute node instance type”** dropdown, scroll all the way down and search for the GPU instance type `g4dn.xlarge` or similar. |
| 60 | +6. Click on **Add machine pool** to finalize the creation of your new GPU-enabled machine pool. |
| 61 | + |
| 62 | +After adding the machine pool, OpenShift will begin provisioning the new GPU worker |
| 63 | +node. This process can take a few minutes. Once completed, the new node will be ready to handle GPU-accelerated workloads. |
| 64 | + |
| 65 | +> [!NOTE] |
| 66 | +> The `g4dn.xlarge` instance type is used for GPU worker nodes. Ensure this instance type meets your application needs or select another as required. |
| 67 | +
|
| 68 | +## Automatic deployment for developers (optional - skip to run steps manually) |
| 69 | + |
| 70 | +As a quick alternative to the following manual deployment steps an automatic _Makefile_ script can be used to deploy the CodeFlare stack. This script also deploys the prerequisite operators and the entire CodeFlare stack. |
| 71 | + |
| 72 | +1. Clone the Repository |
| 73 | + |
| 74 | +```bash |
| 75 | +git clone https://github.com/project-codeflare/codeflare-operator.git |
| 76 | +cd codeflare-operator |
| 77 | +``` |
| 78 | + |
| 79 | +2. Run the Makefile script |
| 80 | + |
| 81 | +```bash |
| 82 | +make all-in-one |
| 83 | +``` |
| 84 | + |
| 85 | +> [!TIP] |
| 86 | +> Execute `make help` to list additional available operations. |
| 87 | +
|
| 88 | +### Next Steps After Automatic Deployment |
| 89 | + |
| 90 | +After the automatic deployment is complete, you can proceed directly to the section [Configure Kueue for Task Scheduling](#configure-kueue-for-task-scheduling) to finish setting up your environment. |
| 91 | + |
| 92 | +## Red Hat OpenShift AI |
| 93 | + |
| 94 | +This Quick Start guide assumes that you have administrator access to an OpenShift cluster and an existing Red Hat OpenShift AI (RHOAI) installation with version **>2.9** is present on your cluster. But the quick step to install RHOAI is as follows: |
| 95 | + |
| 96 | +1. Using the OpenShift web console, navigate to **Operators** -> **OperatorHub**. |
| 97 | +2. Search for `Red Hat OpenShift AI`. |
| 98 | +3. Install it using the `fast` channel. |
| 99 | + |
| 100 | +### Configure Data Science Components |
| 101 | + |
| 102 | +After the installation of the Red Hat OpenShift AI Operator, proceed to configure the necessary components for data science work: |
| 103 | + |
| 104 | +1. From the OpenShift web console, navigate to the installed RHOAI Operator. |
| 105 | +2. Look for tab labeled DSC Initialization. |
| 106 | +3. If it has not already been created - Locate `Create DSCInitialization` and create one. |
| 107 | +4. Look for tab labeled Data Science Cluster. |
| 108 | +5. Locate `Create DataScienceCluster` and create one. |
| 109 | + |
| 110 | +## Preparing Your Cluster for GPU Workloads |
| 111 | + |
| 112 | +To leverage GPU-enabled workloads on your OpenShift cluster, you need to install both the Node Feature Discovery (NFD) Operator and the NVIDIA GPU Operator. |
| 113 | + |
| 114 | +### Installing the Operators |
| 115 | + |
| 116 | +Both the NFD and the NVIDIA GPU Operators can be installed from the OperatorHub. |
| 117 | +Detailed steps for installation and configuration are provided in the NVIDIA |
| 118 | +documentation, which can be accessed [here](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/steps-overview.html#high-level-steps). |
| 119 | + |
| 120 | +1. **Open the OpenShift dashboard.** |
| 121 | +2. Navigate to **OperatorHub**. |
| 122 | +3. Search for and install the following operators (default settings are fine): |
| 123 | + - **Node Feature Discovery Operator** |
| 124 | + - **NVIDIA GPU Operator** |
| 125 | + |
| 126 | +### Configure the Node Feature Discovery Operator |
| 127 | + |
| 128 | +After installing the Node Feature Discovery Operator, you need to create a Node Feature Discovery Custom Resource (CR). You can use the default settings for this CR: |
| 129 | + |
| 130 | +1. Create the Node Feature Discovery CR in the dashboard. |
| 131 | +2. Several pods will start in the `openshift-nfd` namespace (which is the default). |
| 132 | + Wait for all these pods to become operational. Once active, your nodes will be |
| 133 | + labeled with numerous feature flags, indicating that the operator is functioning |
| 134 | + correctly. |
| 135 | + |
| 136 | +### Configure the NVIDIA GPU Operator |
| 137 | + |
| 138 | +After installing the NVIDIA GPU Operator, proceed with creating a GPU ClusterPolicy Custom Resource (CR): |
| 139 | + |
| 140 | +1. Create the GPU ClusterPolicy CR through the dashboard. |
| 141 | +2. This action will trigger several pods to start in the NVIDIA GPU namespace. |
| 142 | + |
| 143 | +> [!NOTE] |
| 144 | +> These pods may take some time to become operational as they compile the necessary drivers. |
| 145 | +
|
| 146 | +## Configure Kueue for Task Scheduling |
| 147 | + |
| 148 | +Kueue is used for managing and scheduling task workflows in your cluster. To configure Kueue in your environment, follow the detailed steps provided |
| 149 | + |
| 150 | +1. Install Kueue resources, namely Cluster Queue, Resource Flavor, and Local Queue: |
| 151 | + - Visit [Kueue Resources configuration](https://github.com/project-codeflare/codeflare-sdk/blob/main/docs/setup-kueue.md) |
| 152 | + |
| 153 | +## Launch a Jupyter Notebook Instance |
| 154 | + |
| 155 | +After setting up the Data Science Cluster components, you can start using the Jupyter notebooks for your data science projects. Here’s how to launch a Jupyter notebook: |
| 156 | + |
| 157 | +1. Access the RHOAI Dashboard: |
| 158 | + -  |
| 159 | + - Navigate to the Red Hat OpenShift AI dashboard within your OpenShift web console. |
| 160 | +2. Create a Data Science Project: |
| 161 | + - Go to the Data Science Projects section from the dashboard menu. |
| 162 | + - Click on `Create data science project` and follow the prompts to set up a new project. |
| 163 | +3. Launch Jupyter Workbench: |
| 164 | + - Inside your newly created project, find and click on the "Create Workbench" button. |
| 165 | + - On the Workbench creation page, select "Standard Data Science" from the list of available notebook images. This image will include common data science libraries and tools that you might need. |
| 166 | + - Configure any additional settings such as compute resources or environment variables as needed and `Create workbench` |
| 167 | +4. Access Your Notebook: |
| 168 | + - Once the workbench is ready, click on the provided link or button to open your Jupyter notebook. |
| 169 | +## Submit your first job |
| 170 | + |
| 171 | +We can now go ahead and submit our first distributed model training job to our cluster. |
| 172 | + |
| 173 | +This can be done from any python based environment, including a script or a jupyter notebook. For this guide, we'll assume you've selected the "Jupyter Data Science" from the list of available images on your notebook spawner page. |
| 174 | + |
| 175 | +### Clone the demo code |
| 176 | + |
| 177 | +Once your notebook environment is ready, in order to test our CodeFlare stack we will want to run through some of the demo notebooks provided by the CodeFlare community. So let's start by cloning their repo into our working environment. |
| 178 | + |
| 179 | +```bash |
| 180 | +git clone https://github.com/project-codeflare/codeflare-sdk |
| 181 | +cd codeflare-sdk |
| 182 | +``` |
| 183 | + |
| 184 | +For further development guidelines and instructions on setting up your development environment for codeflare-sdk, please refer to the [CodeFlare SDK README](https://github.com/project-codeflare/codeflare-sdk?tab=readme-ov-file#development). |
| 185 | + |
| 186 | +### Run the Guided Demo Notebooks |
| 187 | + |
| 188 | +Get started with the guided demo notebooks for the CodeFlare-SDK by following these steps: |
| 189 | + |
| 190 | +1. Access Your Jupyter Notebook Server: |
| 191 | +2. Update Your Notebook with Access Token and Server Details: |
| 192 | + - Retrieve your OpenShift access token by selecting your username in the console, choosing "Copy Login Command", and then "Display Token". |
| 193 | + - Open your desired demo notebook from the `codeflare-sdk/demo-notebooks/guided-demos` directory. |
| 194 | + - Update the notebook with your access token and server details and run the demos. |
| 195 | + |
| 196 | +## Cleaning up the AI Platform Install |
| 197 | + |
| 198 | +To completely clean up all the components after an install, follow these steps: |
| 199 | + |
| 200 | +```bash |
| 201 | +make delete-all-in-one |
| 202 | +``` |
| 203 | +### Manual Cleanup Steps |
| 204 | + |
| 205 | +If you prefer to manually clean up the installation or need to manually remove individual components and operators, follow these steps: |
| 206 | + |
| 207 | +1. Uninstall Operators |
| 208 | + - Open the OpenShift dashboard. |
| 209 | + - Go to Installed Operators. |
| 210 | + - Look for any operators you have installed, such as the NVIDIA GPU Operator, Node Feature Discovery Operator, and Red Hat OpenShift AI Operator. |
| 211 | + - Click on the operator and then click Uninstall Operator. Follow the prompts to remove the operator and its associated resources. |
| 212 | + |
| 213 | +## Next Steps |
| 214 | + |
| 215 | +And with that you have gotten started using the CodeFlare stack alongside your Red Hat OpenShift AI Deployment to add distributed workloads and batch computing to your machine learning platform. |
| 216 | + |
| 217 | +You are now ready to try out the stack with your own machine learning workloads. If you'd like some more examples, you can also run through the existing demo code provided by the Codeflare-SDK community. |
| 218 | + |
| 219 | +- [Submit basic job](https://github.com/project-codeflare/codeflare-sdk/blob/main/demo-notebooks/guided-demos/1_cluster_job_client.ipynb) |
| 220 | +- [Run an interactive session](https://github.com/project-codeflare/codeflare-sdk/blob/main/demo-notebooks/guided-demos/2_basic_interactive.ipynb) |
0 commit comments