Power-Capped LLM Inference Service using Kubernetes

1. Overview

The purpose of this project is to create a scalable and power-efficient Large Language Model (LLM) inference service using Kubernetes. The service utilizes a custom power capping operator that accepts a Custom Resource Definition (CRD) to specify the power capping limit. The operator uses KEDA (Kubernetes Event-Driven Autoscaling) to scale the LLM inference service deployment based on the specified power cap. Kepler, a power monitoring tool, is used to monitor the power consumption of CPU and GPU resources on the server.

In addition to server-level power capping, the operator also considers rack-level heating issues and incorporates techniques for monitoring, capping, and scheduling workloads to reduce cooling requirements at the rack level. By leveraging rack-aware scheduling algorithms, the operator aims to minimize heat recirculation and optimize the placement of workloads across servers and racks.

2. Motivation

Data centers face the challenge of efficiently utilizing their compute resources while ensuring that power and cooling constraints are not exceeded. Overpower and overheat incidents can lead to hardware damage, service disruptions, and increased operational costs. This project aims to provide a solution that enables data centers to evenly distribute workloads in time and space, reducing the risk of overpower or overheat incidents.

By implementing a power capping operator in Kubernetes, data centers can dynamically manage the power consumption of LLM inference workloads at both the server and rack levels. The operator optimizes workload placement and resource allocation to minimize power consumption, reduce cooling requirements, and ensure compliance with power cap limits and rack-level constraints.

3. Architecture

The power capping operator follows an architecture similar to the Kubernetes Vertical Pod Autoscaler (VPA) controller. It consists of three main components:

Recommender: Monitors the current and past resource and power consumption, and provides recommended actions for the actuator based on the defined policies.
Actuator: Checks which of the managed pods have correct power consumption set and, if not, kills or migrates them to conform to the power capping and performance-power ratio policy.
Admission Plugin: Sets the correct resource requests on new pods and issues alerts for passive actuators.

graph LR
A[Recommender] --> B[Actuator]
A --> C[Admission Plugin]
B --> D{Managed Pods}
C --> D

The power capping operator integrates with existing Kubernetes tools and frameworks, such as KEDA for event-driven autoscaling, KServe for serving LLM inference workloads, and Kepler for power monitoring. It leverages these tools to optimize power consumption and workload placement based on the defined policies and constraints.

graph TD
A[Power Capping Operator] --> B[KEDA]
A --> C[KServe]
A --> D[Kepler]
B --> E{LLM Inference Service}
C --> E
D --> A

Out of the box, the power capping operator includes batteries for Power Oversubscription and Performance-Power Ratio Optimization scenarios. These batteries serve as examples of how the system functions in simple scenarios. Data centers can develop or purchase more advanced algorithms from the marketplace to cover specific needs and use cases.

4. Installation

To install the power capping operator, follow these steps:

Clone the repository:

git clone https://github.com/Climatik-Project/Climatik-Project

Create .env file in root folder with secrets

SLACK_WEBHOOK_URL=<your-slack-webhook-url>
GITHUB_USERNAME=<your-username>
GITHUB_REPO=<your-repo-name>
GITHUB_PAT=<your-github-pat>
PROMETHEUS_HOST=http://localhost:9090
SLACK_SIGNING_SECRET=<secret> # see README-slack-webhook-server.md
SLACK_BOT_TOKEN=<secret> # see README-slack-webhook-server.md

Python Libraries:

deactivate
python -m venv venv
source venv/bin/activate
pip install -r python/climatik_operator/requirements.txt

Install the necessary CRDs and operators:
```
make cluster-up
make
```

Verify resources (Pod, Deployment, ScaledObject) exist:

kubectl get pods --all-namespaces
kubectl get pods -n operator-powercapping-system
kubectl get deployments -n operator-powercapping-system
kubectl get scaledobjects -n operator-powercapping-system
kubectl describe scaledobject mistral-7b-scaleobject -n operator-powercapping-system
kubectl describe scaledobject llama2-7b-scaleobject -n operator-powercapping-system
kubectl describe pod -n operator-powercapping-system operator-powercapping-controller-manager
kubectl describe pod -n operator-powercapping-system operator-powercapping-webhook-manager
kubectl describe pod -n operator-powercapping-system llama2-7b
kubectl describe pod -n operator-powercapping-system mistral-7b

Package Visibility Issue: when running

kubectl describe pod -n operator-powercapping-system operator-powercapping-controller-manager
kubectl describe pod -n operator-powercapping-system operator-powercapping-webhook-manager

if see

failed to authorize: failed to fetch anonymous token: unexpected status from GET request to URL, 401 Unauthorized

Please go to your own github and change visibility of your package to public

Check logs for containers:

For manager:

kubectl logs -n operator-powercapping-system operator-powercapping-controller-manager-${pod unique id} -c manager

kubectl exec -it -n operator-powercapping-system deployment/llama2-7b -- /bin/sh
ps aux

For All:

kubectl logs -n operator-powercapping-system operator-powercapping-controller-manager-${pod unique id} --all-containers=true

For ScaleObjects:

kubectl get scaledobject --all-namespaces
kubectl logs -n keda -l app=keda-operator

Test Operator Locally:

cd python/climatik_operator && kopf run operator.py

Check CRD:
```
kubectl get crd
```
Configure the power capping CRD with the desired power cap limit, rack-level constraints, and other parameters. Refer to the CRD documentation for more details.

5. Usage

To reduce the risk of interrupting production workloads, data centers can initially use the power capping operator as a pure observability and recommendation tool after installation. The operator will provide alerts and recommendations based on the defined policies and constraints. Data center operators can manually review these recommendations and decide whether to take the suggested actions.

The power capping operator will log the system behaviors and provide a summary and comparison of the scenarios where the recommended actions were taken or not taken. If the recommendations are accepted, the system will simulate the behavior of not taking the actions, and vice versa. This allows data centers to make informed decisions based on real data and gradually adopt the power capping operator to automatically manage more workloads and use cases.

It's important to note that the power capping operator only installs the necessary CRDs and operators, and allows for configuration of the parameters. The LLM inference services themselves are deployed and managed by other systems like KServe and vLLM. The power capping operator will only affect the scaling behavior of these services to reach the optimization goals, such as energy capping or efficiency.

6. Documentation

7. Contributing

Contributions to the project are welcome! If you find any issues or have suggestions for improvement, please open an issue or submit a pull request on the GitHub repository.

8. License

This project is licensed under the Apache License 2.0.

9. Contact

For any questions or inquiries, please contact the project MAINTAINERS.

Name		Name	Last commit message	Last commit date
Latest commit History 436 Commits
.github/workflows		.github/workflows
api/v1alpha1		api/v1alpha1
cmd		cmd
config		config
deploy/climatik-operator		deploy/climatik-operator
doc		doc
dockerfiles		dockerfiles
hack		hack
internal		internal
python		python
templates/webhook		templates/webhook
test		test
.DS_Store		.DS_Store
.gitignore		.gitignore
ADVISORS.md		ADVISORS.md
COMMUNITY.md		COMMUNITY.md
CONTRIBUTING.md		CONTRIBUTING.md
GOVERNANCE.md		GOVERNANCE.md
LICENSE		LICENSE
MAINTAINERS.md		MAINTAINERS.md
Makefile		Makefile
Makefile.operator		Makefile.operator
PROJECT		PROJECT
README-operator.md		README-operator.md
README-python.md		README-python.md
README-slack-webhook-server.md		README-slack-webhook-server.md
README.md		README.md
REAME-slack-webhook-alert.md		REAME-slack-webhook-alert.md
ROADMAP.md		ROADMAP.md
code-of-conduct.md		code-of-conduct.md
format.sh		format.sh
go.mod		go.mod
go.sum		go.sum
grpc.proto		grpc.proto
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Power-Capped LLM Inference Service using Kubernetes

1. Overview

2. Motivation

3. Architecture

4. Installation

5. Usage

6. Documentation

7. Contributing

8. License

9. Contact

About

Releases

Packages

Languages

License

zhifanl/Climatik-Project

Folders and files

Latest commit

History

Repository files navigation

Power-Capped LLM Inference Service using Kubernetes

1. Overview

2. Motivation

3. Architecture

4. Installation

5. Usage

6. Documentation

7. Contributing

8. License

9. Contact

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages