Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] SparkApplication in FAILING state has finish time #2118

Open
1 task done
BalaMahesh opened this issue Aug 10, 2024 · 0 comments
Open
1 task done

[BUG] SparkApplication in FAILING state has finish time #2118

BalaMahesh opened this issue Aug 10, 2024 · 0 comments

Comments

@BalaMahesh
Copy link

BalaMahesh commented Aug 10, 2024

Description

We are using spark operator v1beta2-1.6.2-3.5.0 in production. We have a spark application with the below policy.

  restartPolicy:
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20
    type: Always

When driver pod failed for some reason, operator has the below logs

UID:"263d8e51-4a32-4832-94b5-f73043e4bd69", APIVersion:"sparkoperator.k8s.io/v1beta2", ResourceVersion:"1343465049", FieldPath:""}): type: 'Warning' reason: 'SparkDriverFailed' Driver **-regular-driver failed
I0810 13:25:06.814383      10 controller.go:860] Update the status of SparkApplication namespace/**-regular from:
{
  "sparkApplicationId": "spark-5a8712b706984da5bd90ac3fa7f35b86",
  "submissionID": "9b80c355-fc31-4680-b3ce-7ce34f7c31ec",
  "lastSubmissionAttemptTime": "2024-08-10T10:26:46Z",
  "terminationTime": null,
  "driverInfo": {
    "webUIServiceName": "**-regular-ui-svc",
    "webUIPort": 4040,
    "webUIAddress": "10.204.49.228:0",
    "podName": "**-regular-driver"
  },
  "applicationState": {
    "state": "RUNNING"
  },
  "executorState": {
    "**-8f53b6913bd57dcf-exec-7": "UNKNOWN",
    "**-8f53b6913bd57dcf-exec-8": "FAILED",
    "**-8f53b6913bd57dcf-exec-9": "COMPLETED"
  },
  "executionAttempts": 63,
  "submissionAttempts": 1
}
to:
{
  "sparkApplicationId": "spark-5a8712b706984da5bd90ac3fa7f35b86",
  "submissionID": "9b80c355-fc31-4680-b3ce-7ce34f7c31ec",
  "lastSubmissionAttemptTime": "2024-08-10T10:26:46Z",
  "terminationTime": "2024-08-10T13:25:06Z",
  "driverInfo": {
    "webUIServiceName": "**-regular-ui-svc",
    "webUIPort": 4040,
    "webUIAddress": "10.204.49.228:0",
    "podName": "**-regular-driver"
  },
  "applicationState": {
    "state": "FAILING",
    "errorMessage": "driver container failed with ExitCode: 1, Reason: Error"
  },
  "executorState": {
    "**-8f53b6913bd57dcf-exec-8": "FAILED",
    "**-8f53b6913bd57dcf-exec-9": "COMPLETED"
  },
  "executionAttempts": 63,
  "submissionAttempts": 1
}
I0810 13:25:06.827513      10 metrics.go:125] Decrementing spark_app_running_count with labels map[app_type:Unknown] metricVal to 4
I0810 13:25:06.827552      10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-4. OldState: UNKNOWN NewState: FAILED
I0810 13:25:06.827564      10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 29
I0810 13:25:06.827571      10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-18. OldState: UNKNOWN NewState: FAILED
I0810 13:25:06.827575      10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 28
I0810 13:25:06.827579      10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-19. OldState: UNKNOWN NewState: FAILED
I0810 13:25:06.827586      10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 27
I0810 13:25:06.827593      10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-3. OldState: UNKNOWN NewState: FAILED
I0810 13:25:06.827607      10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 26
I0810 13:25:06.827613      10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-35. OldState: UNKNOWN NewState: FAILED
I0810 13:25:06.827622      10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 25
I0810 13:25:06.827631      10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-7. OldState: UNKNOWN NewState: FAILED
I0810 13:25:06.827640      10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 24
I0810 13:25:06.827647      10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-1. OldState: UNKNOWN NewState: FAILED
I0810 13:25:06.827654      10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 23
I0810 13:25:06.827662      10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-24. OldState: UNKNOWN NewState: FAILED
I0810 13:25:06.827669      10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 22
I0810 13:25:06.827675      10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-40. OldState: UNKNOWN NewState: FAILED
I0810 13:25:06.827679      10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 21
I0810 13:25:06.827683      10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-5. OldState: UNKNOWN NewState: FAILED
I0810 13:25:06.827689      10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 20
I0810 13:25:06.827693      10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-2. OldState: UNKNOWN NewState: FAILED
I0810 13:25:06.827697      10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 19
I0810 13:25:06.827701      10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-32. OldState: UNKNOWN NewState: FAILED
I0810 13:25:06.827707      10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 18
I0810 13:25:06.827711      10 sparkapp_metrics.go:287] Exporting Metrics for Executor **-8f53b6913bd57dcf-exec-38. OldState: UNKNOWN NewState: FAILED
I0810 13:25:06.827717      10 metrics.go:125] Decrementing spark_app_executor_running_count with labels map[app_type:Unknown] metricVal to 17
I0810 13:25:06.827722      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:25:06.827792      10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it
I0810 13:25:06.827814      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:25:06.827855      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:25:07.452988      10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it
I0810 13:25:07.453037      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:25:07.453109      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:25:07.909475      10 spark_pod_eventhandler.go:58] Pod **-regular-driver updated in namespace namespace.
I0810 13:25:07.909518      10 spark_pod_eventhandler.go:95] Enqueuing SparkApplication namespace/**-regular for app update processing.
I0810 13:25:07.909544      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:25:07.909634      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:25:07.969839      10 spark_pod_eventhandler.go:58] Pod **-regular-driver updated in namespace namespace.
I0810 13:25:07.969874      10 spark_pod_eventhandler.go:95] Enqueuing SparkApplication namespace/**-regular for app update processing.
I0810 13:25:07.969899      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:25:07.969972      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:25:08.822351      10 spark_pod_eventhandler.go:58] Pod **-regular-driver updated in namespace namespace.
I0810 13:25:08.822380      10 spark_pod_eventhandler.go:95] Enqueuing SparkApplication namespace/**-regular for app update processing.
I0810 13:25:08.822402      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:25:08.822479      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:25:37.453315      10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it
I0810 13:25:37.453383      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:25:37.453478      10 controller.go:274] Ending processing key: "namespace/**-regular"


I0810 13:26:07.453988      10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it
I0810 13:26:07.454063      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:26:07.454133      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:26:37.454371      10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it
I0810 13:26:37.454424      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:26:37.454507      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:27:07.455311      10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it
I0810 13:27:07.455388      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:27:07.455463      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:27:37.455431      10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it
I0810 13:27:37.455488      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:27:37.455562      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:27:50.436345      10 spark_pod_eventhandler.go:58] Pod **-regular-driver updated in namespace namespace.
I0810 13:27:50.436379      10 spark_pod_eventhandler.go:95] Enqueuing SparkApplication namespace/**-regular for app update processing.
I0810 13:27:50.436405      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:27:50.436492      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:27:50.440231      10 spark_pod_eventhandler.go:77] Pod **-regular-driver deleted in namespace namespace.
I0810 13:27:50.440254      10 spark_pod_eventhandler.go:95] Enqueuing SparkApplication namespace/**-regular for app update processing.
I0810 13:27:50.440271      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:27:50.440324      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:28:07.456297      10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it
I0810 13:28:07.456357      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:28:07.456440      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:28:37.456984      10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it
I0810 13:28:37.457038      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:28:37.457124      10 controller.go:274] Ending processing key: "namespace/**-regular"
I0810 13:29:07.457237      10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it
I0810 13:29:07.457300      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:29:07.457400      10 controller.go:274] Ending processing key: "namespace/**-regular"


I0810 13:29:37.458045      10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it
I0810 13:29:37.458105      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:29:37.458285      10 controller.go:274] Ending processing key: "namespace/**-regular"


I0810 13:30:07.458243      10 controller.go:227] SparkApplication namespace/**-regular was updated, enqueuing it
I0810 13:30:07.458322      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:30:07.458431      10 controller.go:274] Ending processing key: "namespace/**-regular"

driver pod is in error state and the sparkapplication state is

NAME           STATUS    ATTEMPTS   START                                FINISH                              AGE
**-regular    FAILING   63                 2024-08-10T10:26:46Z   2024-08-10T13:25:06Z   4d5h

only after deleting the sparkapplication manually, operator has the logs and started the spark application

I0810 13:30:31.202106      10 controller.go:896] Deleting pod **-regular-driver in namespace namespace
I0810 13:30:31.205664      10 controller.go:904] Deleting Spark UI Service **-regular-ui-svc in namespace namespace
I0810 13:30:31.224970      10 event.go:364] Event(v1.ObjectReference{Kind:"SparkApplication", Namespace:"namespace", Name:"**-regular", UID:"263d8e51-4a32-4832-94b5-f73043e4bd69", APIVersion:"sparkoperator.k8s.io/v1beta2", ResourceVersion:"1343488188", FieldPath:""}): type: 'Normal' reason: 'SparkApplicationDeleted' SparkApplication **-regular was deleted
I0810 13:30:35.985819      10 controller.go:188] SparkApplication namespace/**-regular was added, enqueuing it for submission
I0810 13:30:35.985869      10 controller.go:267] Starting processing key: "namespace/**-regular"
I0810 13:30:35.985929      10 driveringress.go:287] Creating a service **-regular-ui-svc for the Driver Ingress for application **-regular
I0

How to make sure that my sparkapplication gets restarted when the driver failed, this is happening regularly.

  • ✋ I have searched the open/closed issues and my issue is not listed.

Reproduction Code [Required]

Submit the sparkApplication to spark operator with

  restartPolicy:
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20
    type: Always

when the driver pod fails, sparkapplication fails and doesn't submit again

Expected behavior

Spark application should be restarted

Actual behavior

SparkApplication has finish time and is in Failing State.

Terminal Output Screenshot(s)

Environment & Versions

  • Spark Operator App version: v1beta2-1.6.2-3.5.0
  • Helm Chart Version: 1.4.5
  • Kubernetes Version: v1.28.7-gke.1026001
  • Apache Spark version: 3.4.0

Additional context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant