Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-33634] Add Conditions to Flink CRD's Status field #749

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

lajith2006
Copy link

What is the purpose of the change

This PR is for issue https://issues.apache.org/jira/browse/FLINK-33634 to add Conditions field in the CR status of Flink Deployment and FlinkSessionJob.

Brief change log

  • Created a new class CommonCRStatus to build different Conditions with type and status. This reusable class can be used to build CR condition of type Ready or Error conditions based on jobmanager status and job status.
  • FlinkDeploymentStatus is updated to add list of Condition
  • FlinkDeploymentController is updated to add Conditions to FlinkDeploymentStatus based on JobManagerDeploymentStatus. Following JobManagerDeploymentStatus are considerd to
    add conditions status to CR.
    READY
    DEPLOYED_NOT_READY
    DEPLOYING
    ERROR
  • FlinkSessionJobStatus is updated to add list of Condition
  • FlinkSessionJobController is updated to add Conditions to FlinkSessionJobStatus based on JobStatus. Following JobStatus are considered to add conditions status to CR
    RUNNING
    CREATED
    CANCELED
    FAILED
  • FlinkDeploymentControllerTest and FlinkSessionJobControllerTest are modified to test the Status update with conditions.

Verifying this change

This change is already covered by existing tests, such as FlinkDeploymentControllerTest and FlinkSessionJobControllerTest

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no) no
  • The public API, i.e., is any changes to the CustomResourceDescriptors: (yes / no) yes
  • Core observer or reconciler logic that is regularly executed: (yes / no) yes

Documentation

  • Does this pull request introduce a new feature? (yes / no) yes
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented) docs

private void setCRStatus(FlinkDeployment flinkApp) {
final List<Condition> conditions = new ArrayList<>();
FlinkDeploymentStatus deploymentStatus = flinkApp.getStatus();
switch (deploymentStatus.getJobManagerDeploymentStatus()) {
Copy link
Contributor

@tagarr tagarr Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a bit more logic here. We need to check if the status doesn't already have a condition in it's list that matches the same type, we need to then check the reason, status and message of this condition and only if they are different replace the existing condition with the new one.

We shouldn't just replace the existing list of conditions with a new one, there may be additional conditions of different types added later that would get removed. This may be one of the reasons why the tests have had to be modified to bump up the number of status changes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified logic to update the conditions.

Copy link
Contributor

@gyfora gyfora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that this PR is a little half baked at this point. I understand the general motivation of introducing a Condition field but we should reuse existing logic like the resource lifecycle state to populate it.

Setting this somewhat arbitrarily based on the deployment state makes it inconsistent between application and session jobs etc.

import java.util.Date;

/** Status of CR. */
public class CommonCRStatus {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very confusing class name. We already have CommonStatus which is actually part of the status (superclass). This class here is just a utility class for creating Condition objects. So maybe FlinkConditions or ConditionUtils would be a better name.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ConditionUtils

Comment on lines 29 to 41
public static Condition crReadyTrueCondition(final String message) {
return crCondition("Ready", "True", message, "Ready");
}

public static Condition crReadyFalseCondition(final String message) {
return crCondition("Ready", "False", message, "Progressing");
}

public static Condition crErrorCondition(final String message) {
return crCondition("Error", "True", message, "UnhandledException");
}

public static Condition crCondition(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we rename the class we can also rename this methods simply to: neady / notReady / error

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method names has been changed

}

public static Condition crErrorCondition(final String message) {
return crCondition("Error", "True", message, "UnhandledException");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the message UnhandledException ?

@@ -227,4 +235,28 @@ private boolean validateDeployment(FlinkResourceContext<FlinkDeployment> ctx) {
}
return true;
}

private void setCRStatus(FlinkDeployment flinkApp) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having to call setCrStatus can we move this logic to within the actual CommonStatus or FlinkDeploymentStatus so that it's computed automatically like the ResourceLifecycleState?

Also I think we could actually compute this directly from the resource lifecycle state instead of adding arbitrary new logic here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made changes to populate Conditions in CommonStatus using resource lifecycle state

@lajith2006
Copy link
Author

Pushed the changes for re review.

Comment on lines +115 to +144
updateConditionIfNotExist(
conditions,
ConditionUtils.notReady(
"The resource was created in Kubernetes but not yet handled by the operator"));
break;
case SUSPENDED:
updateConditionIfNotExist(
conditions,
ConditionUtils.notReady("The resource (job) has been suspended"));
break;
case UPGRADING:
updateConditionIfNotExist(
conditions, ConditionUtils.notReady("The resource is being upgraded"));
break;
case DEPLOYED:
updateConditionIfNotExist(
conditions,
ConditionUtils.ready(
"The resource is deployed, but it’s not yet considered to be stable and might be rolled back in the future"));
break;
case ROLLING_BACK:
updateConditionIfNotExist(
conditions,
ConditionUtils.notReady(
"The resource is being rolled back to the last stable spec"));
break;
case ROLLED_BACK:
updateConditionIfNotExist(
conditions,
ConditionUtils.ready("The resource is deployed with the last stable spec"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a lot of duplicated code and copy-pasted strings from the ResourceLifeCycleState , I think we should add this conversion to the enum (or a utility class using the enum directly)

Comment on lines +164 to +172
if (conditions.stream()
.noneMatch(condition -> condition.getType().equals(newCondition.getType()))) {
conditions.add(newCondition);
} else if (conditions.removeIf(
condition ->
!(condition.getReason().equals(newCondition.getReason())
&& condition.getMessage().equals(newCondition.getMessage())))) {
conditions.add(newCondition);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain what this is supposed to do exactly (and also add some docs to the code)? I am a bit confused by the logic

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will add the docs to code. above logic is to make sure that rather to blindly replace any existing conditions with new one , check for existing condition with same type and replace only if the same condition type has different message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I wrote in the last comment, I think we need a FLIP for this instead of fixing up this PR

}

public static Condition error(final String message) {
return crCondition("Error", "True", message, "The job terminally failed");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think error means that the job terminally failed in all cases. There are also operator side errors like reconciliation problems etc.

import java.text.SimpleDateFormat;
import java.util.Date;

/** Status of CR. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect javadoc

.withMessage(message)
.withReason(reason)
.withLastTransitionTime(
new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss'Z'").format(new Date()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't Instant.now().toString() work?

}

public static Condition notReady(final String message) {
return crCondition("Ready", "False", message, "Progressing");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we arbitrarily chose to have Ready and Error conditions? Why not simply use the ResourceLifecycleState name as the type and description as the message?

.withType(type)
.withStatus(status)
.withMessage(message)
.withReason(reason)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need a reason, I suggest let's just remove it.

Copy link
Author

@lajith2006 lajith2006 Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the doc , https://maelvls.dev/kubernetes-conditions/ , looks like we can keep reason as simple category of cause of the current status.

Reason is intended to be used in concise output, such as one-line kubectl get output, and in summarizing occurrences of causes, whereas Message is intended to be presented to users in detailed status explanations, such as kubectl describe output.

I was thinking probably we can keep that for any not ready and error conditions, and not required for ready condition?.

@gyfora
Copy link
Contributor

gyfora commented Jan 15, 2024

@lajith2006 @tagarr
I have been contemplating about this after reviewing the code and also read a bit more on the topic here: https://maelvls.dev/kubernetes-conditions/

To be honest I am not sure whether the current simplistic approach for representing a single status part (JobManagerDeplyomentStats / ResourceLifecycleState) makes too much sense. It doesn't really follow the semantics laid out by other Kubernetes resources like pod conditions etc.

I think it would make sense to open a FLIP and properly design the conditions that we should represent instead of just adding this feature to "tick a box"

@lajith2006
Copy link
Author

@gyfora do you want me to open a FLIP for this?.

@gyfora
Copy link
Contributor

gyfora commented Jan 16, 2024

@gyfora do you want me to open a FLIP for this?.

I think we need a FLIP with the design of the conditions yes. So that the community can discuss and vote on it as I said previously.

@lajith2006
Copy link
Author

Sure, I will open FLIP with design.

@lajith2006
Copy link
Author

Hi @gyfora I was referring the page https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals to open a FLIP with all details of what should be there in FLIP . But what I was trying to understand was that how to login to open FLIP, I guess as per the page https://cwiki.apache.org/confluence/display/FLINK/General+Information+for+Committers , we need to have apache id to login and create FLIP template?. Could you please let me know steps or any document I can refer to get into FLIP template to Open FLIP. Thank you in advance.

@gyfora
Copy link
Contributor

gyfora commented Feb 1, 2024

@lajith2006 do you already have a Jira account?

@lajith2006
Copy link
Author

@gyfora yes, I am having JIRA account using which I can login to https://issues.apache.org/jira/projects/FLINK/.

@gyfora
Copy link
Contributor

gyfora commented Feb 6, 2024

@gyfora yes, I am having JIRA account using which I can login to https://issues.apache.org/jira/projects/FLINK/.

okay then can you please tell me the account name? :D

@lajith2006
Copy link
Author

@gyfora yes, I am having JIRA account using which I can login to https://issues.apache.org/jira/projects/FLINK/.

okay then can you please tell me the account name? :D

account name : lajithk

@gyfora
Copy link
Contributor

gyfora commented Feb 6, 2024

@gyfora yes, I am having JIRA account using which I can login to https://issues.apache.org/jira/projects/FLINK/.

okay then can you please tell me the account name? :D

account name : lajithk

It seems like you need to create a confluence account (cwiki.apache.org) once you have that I can give you permissions to create a FLIP page

@lajith2006
Copy link
Author

@gyfora yes, I am having JIRA account using which I can login to https://issues.apache.org/jira/projects/FLINK/.

okay then can you please tell me the account name? :D

account name : lajithk

It seems like you need to create a confluence account (cwiki.apache.org) once you have that I can give you permissions to create a FLIP page

I have been checking on to create confluence account , https://cwiki.apache.org/confluence , it says for register go to Log in page, but don't see any option to register there in login page. On further digging noticed some thing like https://cwiki.apache.org/confluence/display/DIRxTRIPLESEC/User+Registration . Is that something I have to follow up ?. or are there any path I can look for registration?.

@lajith2006
Copy link
Author

@gyfora yes, I am having JIRA account using which I can login to https://issues.apache.org/jira/projects/FLINK/.

okay then can you please tell me the account name? :D

account name : lajithk

It seems like you need to create a confluence account (cwiki.apache.org) once you have that I can give you permissions to create a FLIP page

I have been checking on to create confluence account , https://cwiki.apache.org/confluence , it says for register go to Log in page, but don't see any option to register there in login page. On further digging noticed some thing like https://cwiki.apache.org/confluence/display/DIRxTRIPLESEC/User+Registration . Is that something I have to follow up ?. or are there any path I can look for registration?.

@gyfora , Could you please point me anyone I can reach out to get assistance on helping to get account created in https://cwiki.apache.org/confluence?. Thank you in advance.

@gyfora
Copy link
Contributor

gyfora commented Feb 21, 2024

@gyfora yes, I am having JIRA account using which I can login to https://issues.apache.org/jira/projects/FLINK/.

okay then can you please tell me the account name? :D

account name : lajithk

It seems like you need to create a confluence account (cwiki.apache.org) once you have that I can give you permissions to create a FLIP page

I have been checking on to create confluence account , https://cwiki.apache.org/confluence , it says for register go to Log in page, but don't see any option to register there in login page. On further digging noticed some thing like https://cwiki.apache.org/confluence/display/DIRxTRIPLESEC/User+Registration . Is that something I have to follow up ?. or are there any path I can look for registration?.

@gyfora , Could you please point me anyone I can reach out to get assistance on helping to get account created in https://cwiki.apache.org/confluence?. Thank you in advance.

Seems like confluence access for non-committers have been recently disabled. Based on a recent discussion we are adopting a new (somewhat simpler) process for new FLIPS: https://lists.apache.org/thread/rkpvlnwj9gv1hvx1dyklx6k88qpnvk2t

Contributors create a Google Doc and make that view-only, and post that Google Doc to the mailing list for a discussion thread. 
When the discussions have been resolved, the contributor ask on the Dev mailing list to a committer/PMC to copy the contents from the Google Doc, and create a FLIP number for them. 
The contributor can then use that FLIP to actually have a VOTE thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants