-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-33634] Add Conditions to Flink CRD's Status field #749
base: main
Are you sure you want to change the base?
Conversation
private void setCRStatus(FlinkDeployment flinkApp) { | ||
final List<Condition> conditions = new ArrayList<>(); | ||
FlinkDeploymentStatus deploymentStatus = flinkApp.getStatus(); | ||
switch (deploymentStatus.getJobManagerDeploymentStatus()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a bit more logic here. We need to check if the status doesn't already have a condition in it's list that matches the same type, we need to then check the reason, status and message of this condition and only if they are different replace the existing condition with the new one.
We shouldn't just replace the existing list of conditions with a new one, there may be additional conditions of different types added later that would get removed. This may be one of the reasons why the tests have had to be modified to bump up the number of status changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified logic to update the conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that this PR is a little half baked at this point. I understand the general motivation of introducing a Condition field but we should reuse existing logic like the resource lifecycle state to populate it.
Setting this somewhat arbitrarily based on the deployment state makes it inconsistent between application and session jobs etc.
import java.util.Date; | ||
|
||
/** Status of CR. */ | ||
public class CommonCRStatus { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very confusing class name. We already have CommonStatus
which is actually part of the status (superclass). This class here is just a utility class for creating Condition
objects. So maybe FlinkConditions
or ConditionUtils
would be a better name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using ConditionUtils
public static Condition crReadyTrueCondition(final String message) { | ||
return crCondition("Ready", "True", message, "Ready"); | ||
} | ||
|
||
public static Condition crReadyFalseCondition(final String message) { | ||
return crCondition("Ready", "False", message, "Progressing"); | ||
} | ||
|
||
public static Condition crErrorCondition(final String message) { | ||
return crCondition("Error", "True", message, "UnhandledException"); | ||
} | ||
|
||
public static Condition crCondition( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we rename the class we can also rename this methods simply to: neady
/ notReady
/ error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Method names has been changed
} | ||
|
||
public static Condition crErrorCondition(final String message) { | ||
return crCondition("Error", "True", message, "UnhandledException"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the message UnhandledException
?
@@ -227,4 +235,28 @@ private boolean validateDeployment(FlinkResourceContext<FlinkDeployment> ctx) { | |||
} | |||
return true; | |||
} | |||
|
|||
private void setCRStatus(FlinkDeployment flinkApp) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having to call setCrStatus
can we move this logic to within the actual CommonStatus
or FlinkDeploymentStatus
so that it's computed automatically like the ResourceLifecycleState
?
Also I think we could actually compute this directly from the resource lifecycle state instead of adding arbitrary new logic here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made changes to populate Conditions in CommonStatus using resource lifecycle state
Pushed the changes for re review. |
updateConditionIfNotExist( | ||
conditions, | ||
ConditionUtils.notReady( | ||
"The resource was created in Kubernetes but not yet handled by the operator")); | ||
break; | ||
case SUSPENDED: | ||
updateConditionIfNotExist( | ||
conditions, | ||
ConditionUtils.notReady("The resource (job) has been suspended")); | ||
break; | ||
case UPGRADING: | ||
updateConditionIfNotExist( | ||
conditions, ConditionUtils.notReady("The resource is being upgraded")); | ||
break; | ||
case DEPLOYED: | ||
updateConditionIfNotExist( | ||
conditions, | ||
ConditionUtils.ready( | ||
"The resource is deployed, but it’s not yet considered to be stable and might be rolled back in the future")); | ||
break; | ||
case ROLLING_BACK: | ||
updateConditionIfNotExist( | ||
conditions, | ||
ConditionUtils.notReady( | ||
"The resource is being rolled back to the last stable spec")); | ||
break; | ||
case ROLLED_BACK: | ||
updateConditionIfNotExist( | ||
conditions, | ||
ConditionUtils.ready("The resource is deployed with the last stable spec")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a lot of duplicated code and copy-pasted strings from the ResourceLifeCycleState
, I think we should add this conversion to the enum (or a utility class using the enum directly)
if (conditions.stream() | ||
.noneMatch(condition -> condition.getType().equals(newCondition.getType()))) { | ||
conditions.add(newCondition); | ||
} else if (conditions.removeIf( | ||
condition -> | ||
!(condition.getReason().equals(newCondition.getReason()) | ||
&& condition.getMessage().equals(newCondition.getMessage())))) { | ||
conditions.add(newCondition); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain what this is supposed to do exactly (and also add some docs to the code)? I am a bit confused by the logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will add the docs to code. above logic is to make sure that rather to blindly replace any existing conditions with new one , check for existing condition with same type and replace only if the same condition type has different message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I wrote in the last comment, I think we need a FLIP for this instead of fixing up this PR
} | ||
|
||
public static Condition error(final String message) { | ||
return crCondition("Error", "True", message, "The job terminally failed"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think error means that the job terminally failed in all cases. There are also operator side errors like reconciliation problems etc.
import java.text.SimpleDateFormat; | ||
import java.util.Date; | ||
|
||
/** Status of CR. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect javadoc
.withMessage(message) | ||
.withReason(reason) | ||
.withLastTransitionTime( | ||
new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss'Z'").format(new Date())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't Instant.now().toString()
work?
} | ||
|
||
public static Condition notReady(final String message) { | ||
return crCondition("Ready", "False", message, "Progressing"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why we arbitrarily chose to have Ready and Error conditions? Why not simply use the ResourceLifecycleState name as the type and description as the message?
.withType(type) | ||
.withStatus(status) | ||
.withMessage(message) | ||
.withReason(reason) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need a reason, I suggest let's just remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per the doc , https://maelvls.dev/kubernetes-conditions/ , looks like we can keep reason as simple category of cause of the current status.
Reason is intended to be used in concise output, such as one-line kubectl get output, and in summarizing occurrences of causes, whereas Message is intended to be presented to users in detailed status explanations, such as kubectl describe output.
I was thinking probably we can keep that for any not ready and error conditions, and not required for ready condition?.
@lajith2006 @tagarr To be honest I am not sure whether the current simplistic approach for representing a single status part (JobManagerDeplyomentStats / ResourceLifecycleState) makes too much sense. It doesn't really follow the semantics laid out by other Kubernetes resources like pod conditions etc. I think it would make sense to open a FLIP and properly design the conditions that we should represent instead of just adding this feature to "tick a box" |
@gyfora do you want me to open a FLIP for this?. |
I think we need a FLIP with the design of the conditions yes. So that the community can discuss and vote on it as I said previously. |
Sure, I will open FLIP with design. |
Hi @gyfora I was referring the page https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals to open a FLIP with all details of what should be there in FLIP . But what I was trying to understand was that how to login to open FLIP, I guess as per the page https://cwiki.apache.org/confluence/display/FLINK/General+Information+for+Committers , we need to have apache id to login and create FLIP template?. Could you please let me know steps or any document I can refer to get into FLIP template to Open FLIP. Thank you in advance. |
@lajith2006 do you already have a Jira account? |
@gyfora yes, I am having JIRA account using which I can login to https://issues.apache.org/jira/projects/FLINK/. |
okay then can you please tell me the account name? :D |
account name : lajithk |
It seems like you need to create a confluence account (cwiki.apache.org) once you have that I can give you permissions to create a FLIP page |
I have been checking on to create confluence account , https://cwiki.apache.org/confluence , it says for register go to Log in page, but don't see any option to register there in login page. On further digging noticed some thing like https://cwiki.apache.org/confluence/display/DIRxTRIPLESEC/User+Registration . Is that something I have to follow up ?. or are there any path I can look for registration?. |
@gyfora , Could you please point me anyone I can reach out to get assistance on helping to get account created in https://cwiki.apache.org/confluence?. Thank you in advance. |
Seems like confluence access for non-committers have been recently disabled. Based on a recent discussion we are adopting a new (somewhat simpler) process for new FLIPS: https://lists.apache.org/thread/rkpvlnwj9gv1hvx1dyklx6k88qpnvk2t
|
What is the purpose of the change
This PR is for issue https://issues.apache.org/jira/browse/FLINK-33634 to add Conditions field in the CR status of Flink Deployment and FlinkSessionJob.
Brief change log
add conditions status to CR.
READY
DEPLOYED_NOT_READY
DEPLOYING
ERROR
RUNNING
CREATED
CANCELED
FAILED
Verifying this change
This change is already covered by existing tests, such as FlinkDeploymentControllerTest and FlinkSessionJobControllerTest
Does this pull request potentially affect one of the following parts:
CustomResourceDescriptors
: (yes / no) yesDocumentation