You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment
What is the outcome that you are trying to reach?
We seek to reduce the startup time for newly submitted spark applications. Ideally, we would remove the duplication of downloads between the operator and driver.
Describe the solution you would like
TBD
Describe alternatives you have considered
We have internally addressed this by mounting a PVC to the operator's $SPARK_HOME/jars directory and sharing it across all drivers and all executors. We consider this to be a 'hacky' solution.
Additional context
The Spark Operator currently runs spark-submit twice. First, the operator pod runs spark-submit in cluster mode. This results in the first set of dependency downloads, on the operator. Then, the driver runs spark-submit in client mode. This results in the second set of dependency downloads, on the driver pod. If the job runs to completion and is then re-submitted with the same dependencies, the operator will not have to re-download the dependencies (unless Ivy opts to), but the newly created driver will again re-download the dependencies.
The text was updated successfully, but these errors were encountered:
Community Note
What is the outcome that you are trying to reach?
We seek to reduce the startup time for newly submitted spark applications. Ideally, we would remove the duplication of downloads between the operator and driver.
Describe the solution you would like
TBD
Describe alternatives you have considered
We have internally addressed this by mounting a PVC to the operator's
$SPARK_HOME/jars
directory and sharing it across all drivers and all executors. We consider this to be a 'hacky' solution.Additional context
The Spark Operator currently runs spark-submit twice. First, the operator pod runs spark-submit in cluster mode. This results in the first set of dependency downloads, on the operator. Then, the driver runs spark-submit in client mode. This results in the second set of dependency downloads, on the driver pod. If the job runs to completion and is then re-submitted with the same dependencies, the operator will not have to re-download the dependencies (unless Ivy opts to), but the newly created driver will again re-download the dependencies.
The text was updated successfully, but these errors were encountered: