The purpose of this project is to create a scalable and power-efficient Large Language Model (LLM) inference service using Kubernetes. The service utilizes a custom power capping operator that accepts a Custom Resource Definition (CRD) to specify the power capping limit. The operator uses KEDA (Kubernetes Event-Driven Autoscaling) to scale the LLM inference service deployment based on the specified power cap. Kepler, a power monitoring tool, is used to monitor the power consumption of CPU and GPU resources on the server.
In addition to server-level power capping, the operator also considers rack-level heating issues and incorporates techniques for monitoring, capping, and scheduling workloads to reduce cooling requirements at the rack level. By leveraging rack-aware scheduling algorithms, the operator aims to minimize heat recirculation and optimize the placement of workloads across servers and racks.
Data centers face the challenge of efficiently utilizing their compute resources while ensuring that power and cooling constraints are not exceeded. Overpower and overheat incidents can lead to hardware damage, service disruptions, and increased operational costs. This project aims to provide a solution that enables data centers to evenly distribute workloads in time and space, reducing the risk of overpower or overheat incidents.
By implementing a power capping operator in Kubernetes, data centers can dynamically manage the power consumption of LLM inference workloads at both the server and rack levels. The operator optimizes workload placement and resource allocation to minimize power consumption, reduce cooling requirements, and ensure compliance with power cap limits and rack-level constraints.
The power capping operator follows an architecture similar to the Kubernetes Vertical Pod Autoscaler (VPA) controller. It consists of three main components:
- Recommender: Monitors the current and past resource and power consumption, and provides recommended actions for the actuator based on the defined policies.
- Actuator: Checks which of the managed pods have correct power consumption set and, if not, kills or migrates them to conform to the power capping and performance-power ratio policy.
- Admission Plugin: Sets the correct resource requests on new pods and issues alerts for passive actuators.
graph LR
A[Recommender] --> B[Actuator]
A --> C[Admission Plugin]
B --> D{Managed Pods}
C --> D
The power capping operator integrates with existing Kubernetes tools and frameworks, such as KEDA for event-driven autoscaling, KServe for serving LLM inference workloads, and Kepler for power monitoring. It leverages these tools to optimize power consumption and workload placement based on the defined policies and constraints.
graph TD
A[Power Capping Operator] --> B[KEDA]
A --> C[KServe]
A --> D[Kepler]
B --> E{LLM Inference Service}
C --> E
D --> A
Out of the box, the power capping operator includes batteries for Power Oversubscription and Performance-Power Ratio Optimization scenarios. These batteries serve as examples of how the system functions in simple scenarios. Data centers can develop or purchase more advanced algorithms from the marketplace to cover specific needs and use cases.
To install the power capping operator, follow these steps:
-
Clone the repository:
git clone https://github.com/Climatik-Project/Climatik-Project
-
Create .env file in root folder with secrets
SLACK_WEBHOOK_URL=<your-slack-webhook-url> GITHUB_USERNAME=<your-username> GITHUB_REPO=<your-repo-name> GITHUB_PAT=<your-github-pat> PROMETHEUS_HOST=http://localhost:9090 SLACK_SIGNING_SECRET=<secret> # see README-slack-webhook-server.md SLACK_BOT_TOKEN=<secret> # see README-slack-webhook-server.md
-
Python Libraries:
deactivate python -m venv venv source venv/bin/activate pip install -r python/climatik_operator/requirements.txt
-
Install the necessary CRDs and operators:
make cluster-up make
-
Verify resources (Pod, Deployment, ScaledObject) exist:
kubectl get pods --all-namespaces kubectl get pods -n operator-powercapping-system kubectl get deployments -n operator-powercapping-system kubectl get scaledobjects -n operator-powercapping-system kubectl describe scaledobject mistral-7b-scaleobject -n operator-powercapping-system kubectl describe scaledobject llama2-7b-scaleobject -n operator-powercapping-system kubectl describe pod -n operator-powercapping-system operator-powercapping-controller-manager kubectl describe pod -n operator-powercapping-system operator-powercapping-webhook-manager kubectl describe pod -n operator-powercapping-system llama2-7b kubectl describe pod -n operator-powercapping-system mistral-7b
-
Package Visibility Issue: when running
kubectl describe pod -n operator-powercapping-system operator-powercapping-controller-manager kubectl describe pod -n operator-powercapping-system operator-powercapping-webhook-manager
if see
failed to authorize: failed to fetch anonymous token: unexpected status from GET request to URL, 401 Unauthorized
Please go to your own github and change visibility of your package to public
-
Check logs for containers:
For manager:
kubectl logs -n operator-powercapping-system operator-powercapping-controller-manager-${pod unique id} -c manager
kubectl exec -it -n operator-powercapping-system deployment/llama2-7b -- /bin/sh ps aux
For All:
kubectl logs -n operator-powercapping-system operator-powercapping-controller-manager-${pod unique id} --all-containers=true
For ScaleObjects:
kubectl get scaledobject --all-namespaces kubectl logs -n keda -l app=keda-operator
-
Test Operator Locally:
cd python/climatik_operator && kopf run operator.py
-
Check CRD:
kubectl get crd
-
Configure the power capping CRD with the desired power cap limit, rack-level constraints, and other parameters. Refer to the CRD documentation for more details.
To reduce the risk of interrupting production workloads, data centers can initially use the power capping operator as a pure observability and recommendation tool after installation. The operator will provide alerts and recommendations based on the defined policies and constraints. Data center operators can manually review these recommendations and decide whether to take the suggested actions.
The power capping operator will log the system behaviors and provide a summary and comparison of the scenarios where the recommended actions were taken or not taken. If the recommendations are accepted, the system will simulate the behavior of not taking the actions, and vice versa. This allows data centers to make informed decisions based on real data and gradually adopt the power capping operator to automatically manage more workloads and use cases.
It's important to note that the power capping operator only installs the necessary CRDs and operators, and allows for configuration of the parameters. The LLM inference services themselves are deployed and managed by other systems like KServe and vLLM. The power capping operator will only affect the scaling behavior of these services to reach the optimization goals, such as energy capping or efficiency.
Contributions to the project are welcome! If you find any issues or have suggestions for improvement, please open an issue or submit a pull request on the GitHub repository.
This project is licensed under the Apache License 2.0.
For any questions or inquiries, please contact the project MAINTAINERS.