Skip to content

Conversation

@sudiptob2
Copy link
Collaborator

@sudiptob2 sudiptob2 commented Dec 5, 2025

closes G-Research/spark#148
I attempted to implement client mode on top of the latest changes. Thanks for the initial POC by @GeorgeJahad

How to run

config.sh

export IMAGE_NAME="spark:armada"
export ARMADA_MASTER="local://armada://localhost:30002"
export ARMADA_QUEUE="test"
export ARMADA_LOOKOUT_URL="http://localhost:30000"
export INCLUDE_PYTHON=true
export USE_KIND=true
export SPARK_DRIVER_HOST="172.18.0.1"
export SPARK_DRIVER_PORT="7078"
  1. Cluster mode + Dynamic allocation
    ./scripts/submitArmadaSpark.sh -M cluster -A dynamic 100

  2. Cluster mode + Static allocation
    ./scripts/submitArmadaSpark.sh -M cluster -A static 100

  3. Client mode + Dynamic allocation
    ./scripts/submitArmadaSpark.sh -M client -A dynamic 100

  4. Client mode + Static allocation
    ./scripts/submitArmadaSpark.sh -M client -A static 100

I added an extra condition to the check based on ARMADA_JOB_SET_ID to determine clustermode, since this variable is only set in the driver’s environment in cluster mode. This is somewhat hacky, but it prevents extra executor requests and the E2E tests are now passing. Root cause of this problem still needs to be figured out and should be fixes when we implement E2E test for client mode.

@GeorgeJahad
Copy link
Collaborator

When I refactored, I created this trait to manage all the combinations deployment and allocation modes: https://github.com/armadaproject/armada-spark/blob/master/src/main/scala/org/apache/spark/deploy/armada/DeploymentModeHelper.scala#L22

That way we don't have to litter the code with if statements testing client/dyamic etc. Can you take a look and see if you can take more advantage of that approach?

@sudiptob2 sudiptob2 marked this pull request as ready for review December 8, 2025 17:57
*
* In cluster mode, jobSetId comes from environment variable ARMADA_JOB_SET_ID. In client mode,
* jobSetId comes from config or falls back to application ID.
*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember the reason why these are different between client and cluster mode. Is there a reason why we shouldn't just make them both come from the config/appId?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the specific reason, but in a general sense, I don't see any reason for them to handle differently based on modes. Should come from config regardless of modes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Please make it so.

driverJobId: String,
executorCount: Int
executorCount: Int,
driverHostname: Option[String] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding a new parameter here, can't we just have the mode helper return the driver hostname?

DEPLOY_MODE_ARGS=(
--conf spark.driver.host=$SPARK_DRIVER_HOST
--conf spark.driver.port=$SPARK_DRIVER_PORT
--conf spark.app.id=armada-spark-$(openssl rand -hex 3)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really prefer the rand to happen inside our scala code. The default should be random. If the user wants to override for some reason, he can, but I don't want us to have to do this here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some changes to fix this.

@sudiptob2 sudiptob2 force-pushed the feat/client-mode-poc branch 3 times, most recently from 61df38c to 6d588c6 Compare December 10, 2025 15:53
@sudiptob2 sudiptob2 force-pushed the feat/client-mode-poc branch 2 times, most recently from d6fe5a7 to b070665 Compare December 16, 2025 00:06
@sudiptob2 sudiptob2 force-pushed the feat/client-mode-poc branch from b070665 to d954211 Compare December 16, 2025 01:05
Comment on lines +141 to +149
val isClusterModeEnvCheck = sys.env.contains("ARMADA_JOB_SET_ID")
val shouldProactivelyRequest =
!isClusterModeEnvCheck && modeHelper.shouldProactivelyRequestExecutors && initialExecutors > 0

if (shouldProactivelyRequest) {
val executorCount = modeHelper.getExecutorCount
val defaultProfile = ResourceProfile.getOrCreateDefaultProfile(conf)
doRequestTotalExecutors(Map(defaultProfile -> executorCount))
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous if check intended to ensure client mode, was not sufficient. As a result, additional executors were being requested even in cluster mode, which caused the E2E failures.

I added an extra condition to the check based on ARMADA_JOB_SET_ID, since this variable is only set in the driver’s environment in cluster mode. This is somewhat hacky, but it prevents extra executor requests and the E2E tests are now passing.

Please let me know if there’s a better way to reliably determine the mode without relying on ARMADA_JOB_SET_ID.

Signed-off-by: Sudipto Baral <[email protected]>
Copy link
Collaborator

@GeorgeJahad GeorgeJahad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks @sudiptob2!

@GeorgeJahad GeorgeJahad merged commit adc51d7 into armadaproject:master Dec 17, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support client mode while keeping clustermode intact

2 participants