-
Notifications
You must be signed in to change notification settings - Fork 5
Implement clientmode #85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement clientmode #85
Conversation
Signed-off-by: Sudipto Baral <[email protected]>
Signed-off-by: Sudipto Baral <[email protected]>
Signed-off-by: Sudipto Baral <[email protected]>
src/main/scala/org/apache/spark/scheduler/cluster/armada/ArmadaClusterManagerBackend.scala
Outdated
Show resolved
Hide resolved
|
When I refactored, I created this trait to manage all the combinations deployment and allocation modes: https://github.com/armadaproject/armada-spark/blob/master/src/main/scala/org/apache/spark/deploy/armada/DeploymentModeHelper.scala#L22 That way we don't have to litter the code with if statements testing client/dyamic etc. Can you take a look and see if you can take more advantage of that approach? |
Signed-off-by: Sudipto Baral <[email protected]>
Signed-off-by: Sudipto Baral <[email protected]>
… a single script Signed-off-by: Sudipto Baral <[email protected]>
src/main/scala/org/apache/spark/scheduler/cluster/armada/ArmadaExecutorAllocator.scala
Outdated
Show resolved
Hide resolved
| * | ||
| * In cluster mode, jobSetId comes from environment variable ARMADA_JOB_SET_ID. In client mode, | ||
| * jobSetId comes from config or falls back to application ID. | ||
| * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't remember the reason why these are different between client and cluster mode. Is there a reason why we shouldn't just make them both come from the config/appId?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about the specific reason, but in a general sense, I don't see any reason for them to handle differently based on modes. Should come from config regardless of modes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Please make it so.
| driverJobId: String, | ||
| executorCount: Int | ||
| executorCount: Int, | ||
| driverHostname: Option[String] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding a new parameter here, can't we just have the mode helper return the driver hostname?
scripts/submitArmadaSpark.sh
Outdated
| DEPLOY_MODE_ARGS=( | ||
| --conf spark.driver.host=$SPARK_DRIVER_HOST | ||
| --conf spark.driver.port=$SPARK_DRIVER_PORT | ||
| --conf spark.app.id=armada-spark-$(openssl rand -hex 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really prefer the rand to happen inside our scala code. The default should be random. If the user wants to override for some reason, he can, but I don't want us to have to do this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some changes to fix this.
Signed-off-by: Sudipto Baral <[email protected]>
Signed-off-by: Sudipto Baral <[email protected]>
Signed-off-by: Sudipto Baral <[email protected]>
Signed-off-by: Sudipto Baral <[email protected]>
61df38c to
6d588c6
Compare
Signed-off-by: Sudipto Baral <[email protected]>
d6fe5a7 to
b070665
Compare
Signed-off-by: Sudipto Baral <[email protected]>
b070665 to
d954211
Compare
| val isClusterModeEnvCheck = sys.env.contains("ARMADA_JOB_SET_ID") | ||
| val shouldProactivelyRequest = | ||
| !isClusterModeEnvCheck && modeHelper.shouldProactivelyRequestExecutors && initialExecutors > 0 | ||
|
|
||
| if (shouldProactivelyRequest) { | ||
| val executorCount = modeHelper.getExecutorCount | ||
| val defaultProfile = ResourceProfile.getOrCreateDefaultProfile(conf) | ||
| doRequestTotalExecutors(Map(defaultProfile -> executorCount)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous if check intended to ensure client mode, was not sufficient. As a result, additional executors were being requested even in cluster mode, which caused the E2E failures.
I added an extra condition to the check based on ARMADA_JOB_SET_ID, since this variable is only set in the driver’s environment in cluster mode. This is somewhat hacky, but it prevents extra executor requests and the E2E tests are now passing.
Please let me know if there’s a better way to reliably determine the mode without relying on ARMADA_JOB_SET_ID.
Signed-off-by: Sudipto Baral <[email protected]>
GeorgeJahad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks @sudiptob2!
closes G-Research/spark#148
I attempted to implement client mode on top of the latest changes. Thanks for the initial POC by @GeorgeJahad
How to run
config.sh
Cluster mode + Dynamic allocation
./scripts/submitArmadaSpark.sh -M cluster -A dynamic 100Cluster mode + Static allocation
./scripts/submitArmadaSpark.sh -M cluster -A static 100Client mode + Dynamic allocation
./scripts/submitArmadaSpark.sh -M client -A dynamic 100Client mode + Static allocation
./scripts/submitArmadaSpark.sh -M client -A static 100