Implement clientmode #85

sudiptob2 · 2025-12-05T16:47:26Z

closes G-Research/spark#148
I attempted to implement client mode on top of the latest changes. Thanks for the initial POC by @GeorgeJahad

How to run

config.sh

export IMAGE_NAME="spark:armada"
export ARMADA_MASTER="local://armada://localhost:30002"
export ARMADA_QUEUE="test"
export ARMADA_LOOKOUT_URL="http://localhost:30000"
export INCLUDE_PYTHON=true
export USE_KIND=true
export SPARK_DRIVER_HOST="172.18.0.1"
export SPARK_DRIVER_PORT="7078"

Cluster mode + Dynamic allocation
./scripts/submitArmadaSpark.sh -M cluster -A dynamic 100
Cluster mode + Static allocation
./scripts/submitArmadaSpark.sh -M cluster -A static 100
Client mode + Dynamic allocation
./scripts/submitArmadaSpark.sh -M client -A dynamic 100
Client mode + Static allocation
./scripts/submitArmadaSpark.sh -M client -A static 100

I added an extra condition to the check based on ARMADA_JOB_SET_ID to determine clustermode, since this variable is only set in the driver’s environment in cluster mode. This is somewhat hacky, but it prevents extra executor requests and the E2E tests are now passing. Root cause of this problem still needs to be figured out and should be fixes when we implement E2E test for client mode.

Signed-off-by: Sudipto Baral <[email protected]>

src/main/scala/org/apache/spark/scheduler/cluster/armada/ArmadaClusterManagerBackend.scala

scripts/submitArmadaSparkClientMode.sh

conf/spark-defaults.conf

GeorgeJahad · 2025-12-05T17:54:10Z

When I refactored, I created this trait to manage all the combinations deployment and allocation modes: https://github.com/armadaproject/armada-spark/blob/master/src/main/scala/org/apache/spark/deploy/armada/DeploymentModeHelper.scala#L22

That way we don't have to litter the code with if statements testing client/dyamic etc. Can you take a look and see if you can take more advantage of that approach?

Signed-off-by: Sudipto Baral <[email protected]>

… a single script Signed-off-by: Sudipto Baral <[email protected]>

src/main/scala/org/apache/spark/scheduler/cluster/armada/ArmadaExecutorAllocator.scala

GeorgeJahad · 2025-12-09T04:18:15Z

src/main/scala/org/apache/spark/deploy/armada/DeploymentModeHelper.scala

+    *
+    * In cluster mode, jobSetId comes from environment variable ARMADA_JOB_SET_ID. In client mode,
+    * jobSetId comes from config or falls back to application ID.
+    *


I can't remember the reason why these are different between client and cluster mode. Is there a reason why we shouldn't just make them both come from the config/appId?

Not sure about the specific reason, but in a general sense, I don't see any reason for them to handle differently based on modes. Should come from config regardless of modes

I agree. Please make it so.

GeorgeJahad · 2025-12-09T04:26:56Z

src/main/scala/org/apache/spark/deploy/armada/submit/ArmadaClientApplication.scala

      driverJobId: String,
-      executorCount: Int
+      executorCount: Int,
+      driverHostname: Option[String] = None


Instead of adding a new parameter here, can't we just have the mode helper return the driver hostname?

GeorgeJahad · 2025-12-09T05:33:36Z

scripts/submitArmadaSpark.sh

+    DEPLOY_MODE_ARGS=(
+        --conf spark.driver.host=$SPARK_DRIVER_HOST
+        --conf spark.driver.port=$SPARK_DRIVER_PORT
+        --conf spark.app.id=armada-spark-$(openssl rand -hex 3)


I'd really prefer the rand to happen inside our scala code. The default should be random. If the user wants to override for some reason, he can, but I don't want us to have to do this here.

Made some changes to fix this.

Signed-off-by: Sudipto Baral <[email protected]>

sudiptob2 · 2025-12-16T14:57:17Z

src/main/scala/org/apache/spark/scheduler/cluster/armada/ArmadaClusterManagerBackend.scala

+          val isClusterModeEnvCheck = sys.env.contains("ARMADA_JOB_SET_ID")
+          val shouldProactivelyRequest =
+            !isClusterModeEnvCheck && modeHelper.shouldProactivelyRequestExecutors && initialExecutors > 0
+
+          if (shouldProactivelyRequest) {
+            val executorCount  = modeHelper.getExecutorCount
+            val defaultProfile = ResourceProfile.getOrCreateDefaultProfile(conf)
+            doRequestTotalExecutors(Map(defaultProfile -> executorCount))
+          }


The previous if check intended to ensure client mode, was not sufficient. As a result, additional executors were being requested even in cluster mode, which caused the E2E failures.

I added an extra condition to the check based on ARMADA_JOB_SET_ID, since this variable is only set in the driver’s environment in cluster mode. This is somewhat hacky, but it prevents extra executor requests and the E2E tests are now passing.

Please let me know if there’s a better way to reliably determine the mode without relying on ARMADA_JOB_SET_ID.

Signed-off-by: Sudipto Baral <[email protected]>

GeorgeJahad

lgtm, thanks @sudiptob2!

sudiptob2 added 3 commits December 4, 2025 17:46

feat: client mode + dynamic allocation works; todo: client + static

0ecff7d

Signed-off-by: Sudipto Baral <[email protected]>

feat: static + client mode, explicitly submit executors

6d5b0ef

Signed-off-by: Sudipto Baral <[email protected]>

nit, test and lint

82c6ee5

Signed-off-by: Sudipto Baral <[email protected]>

sudiptob2 commented Dec 5, 2025

View reviewed changes

src/main/scala/org/apache/spark/scheduler/cluster/armada/ArmadaClusterManagerBackend.scala Outdated Show resolved Hide resolved

sudiptob2 commented Dec 5, 2025

View reviewed changes

scripts/submitArmadaSparkClientMode.sh Outdated Show resolved Hide resolved

GeorgeJahad reviewed Dec 5, 2025

View reviewed changes

scripts/submitArmadaSparkClientMode.sh Outdated Show resolved Hide resolved

GeorgeJahad reviewed Dec 5, 2025

View reviewed changes

conf/spark-defaults.conf Outdated Show resolved Hide resolved

sudiptob2 added 3 commits December 5, 2025 15:31

reduce memory req of executor

1276f9d

Signed-off-by: Sudipto Baral <[email protected]>

Use DeploymentModeHelper to manage mode specific logics

d3ea730

Signed-off-by: Sudipto Baral <[email protected]>

Combine {dynamic, static } x {client, cluster} mode job submission in…

fc45224

… a single script Signed-off-by: Sudipto Baral <[email protected]>

sudiptob2 marked this pull request as ready for review December 8, 2025 17:57

GeorgeJahad reviewed Dec 8, 2025

View reviewed changes

src/main/scala/org/apache/spark/scheduler/cluster/armada/ArmadaExecutorAllocator.scala Outdated Show resolved Hide resolved

GeorgeJahad reviewed Dec 9, 2025

View reviewed changes

sudiptob2 added 4 commits December 9, 2025 16:52

use DeploymentModeHelper to handle hostname calculaiton

c5193fb

Signed-off-by: Sudipto Baral <[email protected]>

handle empty args

ebd21f5

Signed-off-by: Sudipto Baral <[email protected]>

lint

b384df5

Signed-off-by: Sudipto Baral <[email protected]>

generate random appID if not set in conf

13e191d

Signed-off-by: Sudipto Baral <[email protected]>

sudiptob2 force-pushed the feat/client-mode-poc branch 3 times, most recently from 61df38c to 6d588c6 Compare December 10, 2025 15:53

temporarily comment out proactive executor req on static client mode

badfa1b

Signed-off-by: Sudipto Baral <[email protected]>

sudiptob2 force-pushed the feat/client-mode-poc branch 2 times, most recently from d6fe5a7 to b070665 Compare December 16, 2025 00:06

more check for to determine its client mode.

d954211

Signed-off-by: Sudipto Baral <[email protected]>

sudiptob2 force-pushed the feat/client-mode-poc branch from b070665 to d954211 Compare December 16, 2025 01:05

sudiptob2 commented Dec 16, 2025

View reviewed changes

logging and readme improvment

0d2b555

Signed-off-by: Sudipto Baral <[email protected]>

sudiptob2 requested a review from GeorgeJahad December 17, 2025 18:48

GeorgeJahad approved these changes Dec 17, 2025

View reviewed changes

GeorgeJahad merged commit adc51d7 into armadaproject:master Dec 17, 2025
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement clientmode #85

Implement clientmode #85

Uh oh!

sudiptob2 commented Dec 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GeorgeJahad commented Dec 5, 2025

Uh oh!

Uh oh!

GeorgeJahad Dec 9, 2025

Uh oh!

sudiptob2 Dec 9, 2025

Uh oh!

GeorgeJahad Dec 9, 2025

Uh oh!

GeorgeJahad Dec 9, 2025

Uh oh!

GeorgeJahad Dec 9, 2025

Uh oh!

sudiptob2 Dec 10, 2025

Uh oh!

sudiptob2 Dec 16, 2025

Uh oh!

GeorgeJahad left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement clientmode #85

Implement clientmode #85

Uh oh!

Conversation

sudiptob2 commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to run

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GeorgeJahad commented Dec 5, 2025

Uh oh!

Uh oh!

GeorgeJahad Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

sudiptob2 Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

GeorgeJahad Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

GeorgeJahad Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

GeorgeJahad Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

sudiptob2 Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

sudiptob2 Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

GeorgeJahad left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sudiptob2 commented Dec 5, 2025 •

edited

Loading