Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
157 changes: 157 additions & 0 deletions doc/source/cluster/kubernetes/examples/argocd.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
(deploying-on-argocd-example)=

# Deploying via ArgoCD

Below is an example template on how to deploy using ArgoCD with
3 different worker groups:

```yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: ray-operator-crds
namespace: argocd
spec:
project: default
destination:
server: https://kubernetes.default.svc
namespace: ray-cluster
source:
repoURL: https://github.com/ray-project/kuberay
targetRevision: v1.4.0 # update this as necessary
path: helm-chart/kuberay-operator/crds
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- Replace=true

---

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: ray-operator
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/ray-project/kuberay
targetRevision: v1.4.0 # update this as necessary
path: helm-chart/kuberay-operator
helm:
skipCrds: true # note this step is required
destination:
server: https://kubernetes.default.svc
namespace: ray-cluster
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true

---

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: raycluster
namespace: argocd
spec:
project: default
destination:
server: https://kubernetes.default.svc
namespace: ray-cluster
ignoreDifferences:
- group: ray.io
kind: RayCluster
name: raycluster-kuberay
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Bug

The ignoreDifferences section for the RayCluster application specifies name: raycluster-kuberay, but the Helm chart's releaseName is raycluster. This mismatch means the ignoreDifferences rule won't apply to the deployed RayCluster, potentially causing conflicts between ArgoCD and the Ray Autoscaler over worker replica counts.

Fix in Cursor Fix in Web

namespace: ray-cluster
jsonPointers:
- /spec/workerGroupSpecs/0/replicas
- /spec/workerGroupSpecs/1/replicas
- /spec/workerGroupSpecs/2/replicas
Comment on lines +73 to +76
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It would be helpful to add a comment here explaining that these paths need to be adjusted based on the number of worker groups in the RayCluster. This makes the example more adaptable for users with different configurations.

Suggested change
jsonPointers:
- /spec/workerGroupSpecs/0/replicas
- /spec/workerGroupSpecs/1/replicas
- /spec/workerGroupSpecs/2/replicas
jsonPointers: # Adjust this list to match the number of worker groups
- /spec/workerGroupSpecs/0/replicas
- /spec/workerGroupSpecs/1/replicas
- /spec/workerGroupSpecs/2/replicas

source:
repoURL: https://ray-project.github.io/kuberay-helm/
chart: ray-cluster
targetRevision: "1.4.1"
helm:
releaseName: raycluster
valuesObject:
image:
repository: docker.io/rayproject/ray
tag: latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using the latest tag for Docker images is generally discouraged for production deployments as it can make your setup non-deterministic. It's better to pin to a specific version to ensure reproducibility. Since this example uses Autoscaler v2, a version like 2.10.0 or newer would be appropriate. This suggestion also applies to the images on lines 110 and 121.

Suggested change
tag: latest
tag: "2.10.0"

pullPolicy: IfNotPresent
head:
rayStartParams:
num-cpus: "0"
enableInTreeAutoscaling: true
autoscalerOptions:
version: v2
upscalingMode: Default
idleTimeoutSeconds: 600 # 10 minutes
env:
- name: AUTOSCALER_MAX_CONCURRENT_LAUNCHES
value: "100"
worker:
groupName: standard-worker
replicas: 1
minReplicas: 1
maxReplicas: 200
rayStartParams:
resources: '"{\"standard-worker\": 1}"'
additionalWorkerGroups:
additional-worker-group1:
image:
repository: docker.io/rayproject/ray
tag: latest
pullPolicy: IfNotPresent
disabled: false
replicas: 0
minReplicas: 0
maxReplicas: 30
rayStartParams:
resources: '"{\"additional-worker-group1\": 1}"'
additional-worker-group2:
image:
repository: docker.io/rayproject/ray
tag: latest
pullPolicy: IfNotPresent
disabled: false
replicas: 1
minReplicas: 1
maxReplicas: 200
rayStartParams:
resources: '"{\"additional-worker-group2\": 1}"'
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true

```

## Auto Scaling

With regard to the Ray autoscaler, note this section in the ArgoCD application:

```yaml
ignoreDifferences:
- group: ray.io
kind: RayCluster
name: raycluster-kuberay
namespace: ray-cluster
jsonPointers:
- /spec/workerGroupSpecs/0/replicas
- /spec/workerGroupSpecs/1/replicas
- /spec/workerGroupSpecs/2/replicas
```

It has been observed that without this `ignoreDifferences` section, ArgoCD
and the Ray Autoscaler may conflict, resulting in unexpected behaviour when
it comes to requesting workers dynamically (e.g. `ray.autoscaler.sdk.request_resources`).
More specifically, when requesting N workers, the Autoscaler would not spin up N workers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Please add a newline at the end of the file. It's a common convention and can prevent issues with some tools.

Suggested change
More specifically, when requesting N workers, the Autoscaler would not spin up N workers.
More specifically, when requesting N workers, the Autoscaler would not spin up N workers.