You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
suspended_choice --> Suspended=True: TrainJob is suspended.
978
+
Suspended=True --> Suspended=True: Wait for unsuspending.
979
+
Suspended=True --> Suspended=False: TrainJob is unsuspended.
980
+
suspended_choice --> Suspended=False: TrainJob is not suspended.
981
+
982
+
#FAILURE
983
+
state terminal_choice <<choice>>
984
+
Suspended=False --> terminal_choice: Actual Jobs go to terminal phase.
985
+
terminal_choice --> Failed=True: Actual Jobs (e.g., JobSet) failed.
986
+
Failed=True --> [*]
987
+
988
+
#COMPLETION
989
+
terminal_choice --> Complete=True: Actual Jobs (e.g., JobSet) completed.
990
+
Complete=True --> [*]
991
+
```
992
+
993
+
In the above state transition, the `Created=False` will happen in the following situations and
994
+
those different situations can be identified by the condition reasons (`.status.conditions.[type="Created"].reason`).
995
+
996
+
- `JobsBuildFailed`: When the TrainJob controller failed to construct objects (resources) using the [runtime framework interfaces](../../../pkg/runtime.v2/framework/interface.go)
997
+
- `JobsCreationFailed`: When the TrainJob controller succeeded to construct objects, but it failed to deploy objects to the cluster.
998
+
999
+
Additionally, we extend the [runtime framework interfaces](../../../pkg/runtime.v2/framework/interface.go)
1000
+
to allow each plugin to propagate the arbitrary conditions to the TrainJob.
1001
+
919
1002
## The Training Runtime API
920
1003
921
1004
The `TrainingRuntime` is the pre-created configurations of model training on the cluster,
0 commit comments