Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Helm charts health check, ingress, and values #257

Merged
merged 2 commits into from
Mar 26, 2025

Conversation

richardr1126
Copy link
Contributor

@richardr1126 richardr1126 commented Mar 22, 2025

Update Helm Chart Deployment and Configuration with /examples

Updated the Helm chart and deployment configuration, which still works for all Clouds. Then, it includes an example YAML file specifically for Azure Kubernetes Service (AKS) with the NVIDIA GPU Operator. The actual chart updates improve ingress and load balancing, corrects default values, and enhances deployment health checks.

Key Updates:

  • Examples folder: For examples to deploy to Azure AKS with NVIDIA GPU Operator and GPU time-slicing to oversubscribe GPUs, with TLS encryption for https traffic
  • Ingress and Load Balancing: Fixed previous issues with ingress and load balancing; now supports multiple users seamlessly
  • Ingress and TLS Encryption: Correctly configured for HTTPS traffic with TLS encryption if needed

Example Folder .yaml:

  • Utilizes 2x Nvidia-T4 16GB GPUs on 2 separate spot nodes
    • Cheapest GPU instances on Azure: 2x for ~$6/day or ~$160/month
  • Uses the NVIDIA GPU operator device plugin for time-slicing to claim to the system there are 8 smaller 4gb GPUs total
  • Provides efficient load balancing between 8 replicas of the Kokoro-FastAPI
  • Uses nginx ingress controller, cert-manager, and external-dns (with cloudflare) (all installed with Helm)

Version Details:

  • Set deployment version to v0.2.0 due to issues with v0.2.1, which requires CUDA 12.8

Next Steps:

  • Proposal to update the Wiki instructions once changes are approved
  • Update example values.yaml files to reflect the new configuration

Screenshots of running pods

image Screenshot 2025-03-22 at 6 17 21 AM

Please review the updates, and let me know if further changes are needed.

@fireblade2534
Copy link
Collaborator

@richardr1126 Looks good to me. Thanks for the changes :)

@fireblade2534 fireblade2534 merged commit d0c13f6 into remsky:master Mar 26, 2025
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants