We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
We are using spark operator v1beta2-1.6.2-3.5.0 in production. We have a spark application with the below policy.
restartPolicy: onFailureRetryInterval: 10 onSubmissionFailureRetries: 5 onSubmissionFailureRetryInterval: 20 type: Always
When driver pod failed for some reason, operator has the below logs
UID:"263d8e51-4a32-4832-94b5-f73043e4bd69", APIVersion:"sparkoperator.k8s.io/v1beta2", ResourceVersion:"1343465049", FieldPath:""}): type: 'Warning' reason: 'SparkDriverFailed' Driver **-regular-driver failed I0810 13:25:06.814383 10 controller.go:860] Update the status of SparkApplication namespace/**-regular from: { "sparkApplicationId": "spark-5a8712b706984da5bd90ac3fa7f35b86", "submissionID": "9b80c355-fc31-4680-b3ce-7ce34f7c31ec", "lastSubmissionAttemptTime": "2024-08-10T10:26:46Z", "terminationTime": null, "driverInfo": { "webUIServiceName": "**-regular-ui-svc", "webUIPort": 4040, "webUIAddress": "10.204.49.228:0", "podName": "**-regular-driver" }, "applicationState": { "state": "RUNNING" }, "executorState": { "**-8f53b6913bd57dcf-exec-7": "UNKNOWN", "**-8f53b6913bd57dcf-exec-8": "FAILED", "**-8f53b6913bd57dcf-exec-9": "COMPLETED" }, "executionAttempts": 63, "submissionAttempts": 1 } to: { "sparkApplicationId": "spark-5a8712b706984da5bd90ac3fa7f35b86", "submissionID": "9b80c355-fc31-4680-b3ce-7ce34f7c31ec", "lastSubmissionAttemptTime": "2024-08-10T10:26:46Z", "terminationTime": "2024-08-10T13:25:06Z", "driverInfo": { "webUIServiceName": "**-regular-ui-svc", "webUIPort": 4040, "webUIAddress": "10.204.49.228:0", "podName": "**-regular-driver" }, "applicationState": { "state": "FAILING", "errorMessage": "driver container failed with ExitCode: 1, Reason: Error" }, "executorState": { "**-8f53b6913bd57dcf-exec-8": "FAILED", "**-8f53b6913bd57dcf-exec-9": "COMPLETED" }, "executionAttempts": 63, "submissionAttempts": 1 } I0810 13:25:06.827513 10 metrics.go:125] Decrementing spark_app_running_count with labels map[app_type:Unknown] metricVal to 4 I0810 13:25:06.827552 10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-4. OldState: UNKNOWN NewState: FAILED I0810 13:25:06.827564 10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 29 I0810 13:25:06.827571 10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-18. OldState: UNKNOWN NewState: FAILED I0810 13:25:06.827575 10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 28 I0810 13:25:06.827579 10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-19. OldState: UNKNOWN NewState: FAILED I0810 13:25:06.827586 10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 27 I0810 13:25:06.827593 10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-3. OldState: UNKNOWN NewState: FAILED I0810 13:25:06.827607 10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 26 I0810 13:25:06.827613 10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-35. OldState: UNKNOWN NewState: FAILED I0810 13:25:06.827622 10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 25 I0810 13:25:06.827631 10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-7. OldState: UNKNOWN NewState: FAILED I0810 13:25:06.827640 10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 24 I0810 13:25:06.827647 10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-1. OldState: UNKNOWN NewState: FAILED I0810 13:25:06.827654 10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 23 I0810 13:25:06.827662 10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-24. OldState: UNKNOWN NewState: FAILED I0810 13:25:06.827669 10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 22 I0810 13:25:06.827675 10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-40. OldState: UNKNOWN NewState: FAILED I0810 13:25:06.827679 10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 21 I0810 13:25:06.827683 10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-5. OldState: UNKNOWN NewState: FAILED I0810 13:25:06.827689 10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 20 I0810 13:25:06.827693 10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-2. OldState: UNKNOWN NewState: FAILED I0810 13:25:06.827697 10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 19 I0810 13:25:06.827701 10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-32. OldState: UNKNOWN NewState: FAILED I0810 13:25:06.827707 10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 18 I0810 13:25:06.827711 10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-38. OldState: UNKNOWN NewState: FAILED I0810 13:25:06.827717 10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 17 I0810 13:25:06.827722 10 controller.go:274] Ending processing key: "namespace/**-regular" I0810 13:25:06.827792 10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it I0810 13:25:06.827814 10 controller.go:267] Starting processing key: "namespace/**-regular" I0810 13:25:06.827855 10 controller.go:274] Ending processing key: "namespace/**-regular" I0810 13:25:07.452988 10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it I0810 13:25:07.453037 10 controller.go:267] Starting processing key: "namespace/**-regular" I0810 13:25:07.453109 10 controller.go:274] Ending processing key: "namespace/**-regular" I0810 13:25:07.909475 10 spark_pod_eventhandler.go:58] Pod **-regular-driver updated in namespace namespace. I0810 13:25:07.909518 10 spark_pod_eventhandler.go:95] Enqueuing SparkApplication namespace/**-regular for app update processing. I0810 13:25:07.909544 10 controller.go:267] Starting processing key: "namespace/**-regular" I0810 13:25:07.909634 10 controller.go:274] Ending processing key: "namespace/**-regular" I0810 13:25:07.969839 10 spark_pod_eventhandler.go:58] Pod **-regular-driver updated in namespace namespace. I0810 13:25:07.969874 10 spark_pod_eventhandler.go:95] Enqueuing SparkApplication namespace/**-regular for app update processing. I0810 13:25:07.969899 10 controller.go:267] Starting processing key: "namespace/**-regular" I0810 13:25:07.969972 10 controller.go:274] Ending processing key: "namespace/**-regular" I0810 13:25:08.822351 10 spark_pod_eventhandler.go:58] Pod **-regular-driver updated in namespace namespace. I0810 13:25:08.822380 10 spark_pod_eventhandler.go:95] Enqueuing SparkApplication namespace/**-regular for app update processing. I0810 13:25:08.822402 10 controller.go:267] Starting processing key: "namespace/**-regular" I0810 13:25:08.822479 10 controller.go:274] Ending processing key: "namespace/**-regular" I0810 13:25:37.453315 10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it I0810 13:25:37.453383 10 controller.go:267] Starting processing key: "namespace/**-regular" I0810 13:25:37.453478 10 controller.go:274] Ending processing key: "namespace/**-regular" I0810 13:26:07.453988 10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it I0810 13:26:07.454063 10 controller.go:267] Starting processing key: "namespace/**-regular" I0810 13:26:07.454133 10 controller.go:274] Ending processing key: "namespace/**-regular" I0810 13:26:37.454371 10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it I0810 13:26:37.454424 10 controller.go:267] Starting processing key: "namespace/**-regular" I0810 13:26:37.454507 10 controller.go:274] Ending processing key: "namespace/**-regular" I0810 13:27:07.455311 10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it I0810 13:27:07.455388 10 controller.go:267] Starting processing key: "namespace/**-regular" I0810 13:27:07.455463 10 controller.go:274] Ending processing key: "namespace/**-regular" I0810 13:27:37.455431 10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it I0810 13:27:37.455488 10 controller.go:267] Starting processing key: "namespace/**-regular" I0810 13:27:37.455562 10 controller.go:274] Ending processing key: "namespace/**-regular" I0810 13:27:50.436345 10 spark_pod_eventhandler.go:58] Pod **-regular-driver updated in namespace namespace. I0810 13:27:50.436379 10 spark_pod_eventhandler.go:95] Enqueuing SparkApplication namespace/**-regular for app update processing. I0810 13:27:50.436405 10 controller.go:267] Starting processing key: "namespace/**-regular" I0810 13:27:50.436492 10 controller.go:274] Ending processing key: "namespace/**-regular" I0810 13:27:50.440231 10 spark_pod_eventhandler.go:77] Pod **-regular-driver deleted in namespace namespace. I0810 13:27:50.440254 10 spark_pod_eventhandler.go:95] Enqueuing SparkApplication namespace/**-regular for app update processing. I0810 13:27:50.440271 10 controller.go:267] Starting processing key: "namespace/**-regular" I0810 13:27:50.440324 10 controller.go:274] Ending processing key: "namespace/**-regular" I0810 13:28:07.456297 10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it I0810 13:28:07.456357 10 controller.go:267] Starting processing key: "namespace/**-regular" I0810 13:28:07.456440 10 controller.go:274] Ending processing key: "namespace/**-regular" I0810 13:28:37.456984 10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it I0810 13:28:37.457038 10 controller.go:267] Starting processing key: "namespace/**-regular" I0810 13:28:37.457124 10 controller.go:274] Ending processing key: "namespace/**-regular" I0810 13:29:07.457237 10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it I0810 13:29:07.457300 10 controller.go:267] Starting processing key: "namespace/**-regular" I0810 13:29:07.457400 10 controller.go:274] Ending processing key: "namespace/**-regular" I0810 13:29:37.458045 10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it I0810 13:29:37.458105 10 controller.go:267] Starting processing key: "namespace/**-regular" I0810 13:29:37.458285 10 controller.go:274] Ending processing key: "namespace/**-regular" I0810 13:30:07.458243 10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it I0810 13:30:07.458322 10 controller.go:267] Starting processing key: "namespace/**-regular" I0810 13:30:07.458431 10 controller.go:274] Ending processing key: "namespace/**-regular"
driver pod is in error state and the sparkapplication state is
NAME STATUS ATTEMPTS START FINISH AGE **-regular FAILING 63 2024-08-10T10:26:46Z 2024-08-10T13:25:06Z 4d5h
only after deleting the sparkapplication manually, operator has the logs and started the spark application
I0810 13:30:31.202106 10 controller.go:896] Deleting pod **-regular-driver in namespace namespace I0810 13:30:31.205664 10 controller.go:904] Deleting Spark UI Service **-regular-ui-svc in namespace namespace I0810 13:30:31.224970 10 event.go:364] Event(v1.ObjectReference{Kind:"SparkApplication", Namespace:"namespace", Name:"**-regular", UID:"263d8e51-4a32-4832-94b5-f73043e4bd69", APIVersion:"sparkoperator.k8s.io/v1beta2", ResourceVersion:"1343488188", FieldPath:""}): type: 'Normal' reason: 'SparkApplicationDeleted' SparkApplication **-regular was deleted I0810 13:30:35.985819 10 controller.go:188] SparkApplication namespace/**-regular was added, enqueuing it for submission I0810 13:30:35.985869 10 controller.go:267] Starting processing key: "namespace/**-regular" I0810 13:30:35.985929 10 driveringress.go:287] Creating a service **-regular-ui-svc for the Driver Ingress for application **-regular I0
How to make sure that my sparkapplication gets restarted when the driver failed, this is happening regularly.
Submit the sparkApplication to spark operator with
when the driver pod fails, sparkapplication fails and doesn't submit again
Spark application should be restarted
SparkApplication has finish time and is in Failing State.
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Description
We are using spark operator v1beta2-1.6.2-3.5.0 in production. We have a spark application with the below policy.
When driver pod failed for some reason, operator has the below logs
driver pod is in error state and the sparkapplication state is
only after deleting the sparkapplication manually, operator has the logs and started the spark application
How to make sure that my sparkapplication gets restarted when the driver failed, this is happening regularly.
Reproduction Code [Required]
Submit the sparkApplication to spark operator with
when the driver pod fails, sparkapplication fails and doesn't submit again
Expected behavior
Spark application should be restarted
Actual behavior
SparkApplication has finish time and is in Failing State.
Terminal Output Screenshot(s)
Environment & Versions
Additional context
The text was updated successfully, but these errors were encountered: