Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add resume-ci label #43929

Closed
wants to merge 1 commit into from
Closed

add resume-ci label #43929

wants to merge 1 commit into from

Conversation

MoLow
Copy link
Member

@MoLow MoLow commented Jul 21, 2022

fixes #40817
awaiting merge of nodejs/node-core-utils#642

@nodejs-github-bot
Copy link
Collaborator

Review requested:

  • @nodejs/actions
  • @nodejs/tsc

@nodejs-github-bot nodejs-github-bot added meta Issues and PRs related to the general management of the project. tools Issues and PRs related to the tools directory. labels Jul 21, 2022
@MoLow MoLow requested a review from benjamingr July 21, 2022 12:14
@targos
Copy link
Member

targos commented Jul 21, 2022

I'm very reluctant about adding this because I think it's a mistake to resume CI without looking at the failures, and if you are looking at the failures you can just click on the Jenkins button.

@aduh95
Copy link
Contributor

aduh95 commented Jul 21, 2022

I'm very reluctant about adding this because I think it's a mistake to resume CI without looking at the failures, and if you are looking at the failures you can just click on the Jenkins button.

Unless you are not a collaborator (e.g. a trigger), in which case you cannot resume using Jenkins CI. I agree with the sentiment though.

@MoLow
Copy link
Member Author

MoLow commented Jul 21, 2022

I agree, CI should only be resumed if a human has made sure the specific flakiness existed before that PR - this is a tool that can help triaggers do something they currently cannot do

this was already discussed in the original issue #40817 (comment)

@MoLow
Copy link
Member Author

MoLow commented Jul 21, 2022

also, as for my understanding, TSC has addressed (part) of this discussion as well? #42125 (comment)
https://github.com/mhdawson/TSC/blob/a52f5bb892d986e470661c15635d79d384302dd1/meetings/2022-03-10.md#nodejsnode

@targos
Copy link
Member

targos commented Jul 21, 2022

Was it considered to allow triagers to access the Jenkins feature?

@tniessen
Copy link
Member

Was it considered to allow triagers to access the Jenkins feature?

IIRC that is the intention here: not to make resuming CIs simpler in general, but as a workaround to give this particular permission to triagers.

@targos
Copy link
Member

targos commented Jul 21, 2022

I mean, do we really need a workaround, rather than giving them the permission inside Jenkins?

@MoLow
Copy link
Member Author

MoLow commented Jul 21, 2022

I mean, do we really need a workaround, rather than giving them the permission inside Jenkins?

If that is possible that sounds like a better solution to me

Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, amazing

@GeoffreyBooth
Copy link
Member

I'm very reluctant about adding this because I think it's a mistake to resume CI without looking at the failures, and if you are looking at the failures you can just click on the Jenkins button.

If I didn't need to resume CI repeatedly for every PR, I would sympathize with this. But CI is much too flaky for looking at errors to be worth my time until after I've resumed at least three, maybe five times. It should resume by default and stop after 3-5 attempts.

Copy link
Member

@benjamingr benjamingr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not ideal that we have to do this but I do think this is the correct thing to do given the circumstances.

@tniessen
Copy link
Member

But CI is much too flaky for looking at errors to be worth my time until after I've resumed at least three, maybe five times. It should resume by default and stop after 3-5 attempts.

That is exactly how I assume #43522 was merged, and it has made CI much, much worse for all collaborators, to the point where it was virtually unusable for days.

It should resume by default and stop after 3-5 attempts.

Resuming CI without looking at errors makes it more likely to miss related failures and thus to introduce new flaky tests, which makes the situation worse for everyone, beyond that PR.

Let's say a PR introduces a test that flakes 50 % of the time. Running one CI and checking for errors gives you a 50 % chance of catching it. Resuming CI without checking what errors occurred reduces the chances of catching the flaky test exponentially!

I can't find it right now, but there was a PR a while ago that implemented something like this (i.e., automated resuming or something similar), which I was against for the same reason.


There are other approaches that might be worth investigating. For example, the test runner could, when a test fails, re-run it n times and count how many times it fails. This can be used to estimate whether a test flaked or whether it failed deterministically, but CI should still fail. If a test is marked as flaky, regardless of whether it passes, the test runner could run it n times and see if at least one run passes, to make sure it really is flaky and not failing every time. (But this would need some experimentation.)

@mcollina
Copy link
Member

The problem is more profound: it's virtually impossible to get a green CI without resuming. I currently have 7 PRs that I'm restarting CI every day.

Something that would be extremely helpful is to get the list of failed tests as a PR comment.

We could have a different approach:

  1. mark all tests that fails at least once a day as flaky
  2. only run a reduced set of very reliable tests on the platforms that are more likely to fail (windows, arm, smartos, aix). This can be based on the support tier

@tniessen
Copy link
Member

I completely agree @mcollina, we are in a tough spot right now. All I am saying is that resuming without properly checking for errors only makes things worse.

mark all tests that fails at least once a day as flaky

Big +1 as long as we treat it as an urgent TODO list. (Essentially what I wrote in #43754 (comment).)

aduh95 added a commit to aduh95/node that referenced this pull request Jul 23, 2022
nodejs-github-bot pushed a commit that referenced this pull request Jul 25, 2022
Refs: #43929 (comment)

PR-URL: #43954
Reviewed-By: Matteo Collina <[email protected]>
Reviewed-By: Tobias Nießen <[email protected]>
Reviewed-By: Feng Yu <[email protected]>
danielleadams pushed a commit that referenced this pull request Jul 26, 2022
Refs: #43929 (comment)

PR-URL: #43954
Reviewed-By: Matteo Collina <[email protected]>
Reviewed-By: Tobias Nießen <[email protected]>
Reviewed-By: Feng Yu <[email protected]>
targos pushed a commit that referenced this pull request Jul 31, 2022
Refs: #43929 (comment)

PR-URL: #43954
Reviewed-By: Matteo Collina <[email protected]>
Reviewed-By: Tobias Nießen <[email protected]>
Reviewed-By: Feng Yu <[email protected]>
targos pushed a commit that referenced this pull request Jul 31, 2022
Refs: #43929 (comment)

PR-URL: #43954
Reviewed-By: Matteo Collina <[email protected]>
Reviewed-By: Tobias Nießen <[email protected]>
Reviewed-By: Feng Yu <[email protected]>
targos pushed a commit that referenced this pull request Aug 1, 2022
Refs: #43929 (comment)

PR-URL: #43954
Reviewed-By: Matteo Collina <[email protected]>
Reviewed-By: Tobias Nießen <[email protected]>
Reviewed-By: Feng Yu <[email protected]>
@kvakil
Copy link
Contributor

kvakil commented Aug 5, 2022

I think that this PR would be a great change. But I also think that request-ci should always do the right thing: if there are no commits since the last CI, it should retry the CI. If there is an intervening commit, it should start a new build. People shouldn't need to think about which is correct. #44130 clears up some of the documentation here but I think it's hard to solve UX issues with documentation.

(signed, an idiot who didn't realize request and retry were different, and so requested too many builds)

@MoLow
Copy link
Member Author

MoLow commented Aug 5, 2022

@kvakil that sounds a great improvement! but:

  1. I feel this PR is currently too controversial to land, without addressing some of the feedback. Perhaps if we add your suggestion with a limit that after 3 times, it will comment on the PR - stating that "resume must be done manually after three failures." WDYT?
  2. The PR in ncu needs an approval first. It seems to me like it is ok merging it since running a cli command won't be abused easily as a label, can someone approve and merge it? CC @nodejs/node-core-utils

@kvakil
Copy link
Contributor

kvakil commented Aug 5, 2022 via email

@MoLow
Copy link
Member Author

MoLow commented Aug 5, 2022

@kvakil adding a request-ci label is still too automatic in the sence that CI failures should not be ignored, and need minimal inspection.
See #43929 (comment)

@aduh95
Copy link
Contributor

aduh95 commented Aug 5, 2022

I just think the user experience would be better.

Currently the user experience is to go to the Jenkins Web UI to check what are the failures, and if it turns out the failures are indeed unrelated to the PR to test, the "Resume build" button is right there on Jenkins UI, there's no reason to go back to GitHub to add a label. What I'm trying to say is this label is not meant to improve the UX (it won't), it's to enable triagers to resume CIs, which they currently can't.

People shouldn't need to think about which is correct.

That's exactly what's controversial about this PR: some are concerned that adding this label would make collaborators/triagers less likely to think about the CI failures and instead re-apply the label without checking the failures until they get a passing CI – in particular, it would enable the landing of PRs that introduce flaky tests, making the CI even less reliable than they currently are.

  1. Perhaps if we add your suggestion with a limit that after 3 times, it will comment on the PR - stating that "resume must be done manually after three failures." WDYT?

That'd be a very nice feature indeed! I would go further: only accept 1 request-ci and one resume-ci when no new commits have landed; with clear error messages it could help educate people on the specificities of our CI system, and it addresses the concern that folks could use the system to land flaky PRs.
(Lately you rarely need more than one resume to get a passing CI anyway).

@MoLow
Copy link
Member Author

MoLow commented Aug 5, 2022

That'd be a very nice feature indeed! I would go further: only accept 1 request-ci and one resume-ci when no new commits have landed; with clear error messages it could help educate people on the specificities of our CI system, and it addresses the concern that folks could use the system to land flaky PRs. (Lately you rarely need more than one resume to get a passing CI anyway).

yes, I will probably wait for my nomination to complete so my jenkins token will actually work :)

@kvakil
Copy link
Contributor

kvakil commented Aug 5, 2022

just think the user experience would be better.

Currently the user experience is to go to the Jenkins Web UI to check what are the failures, and if it turns out the failures are indeed unrelated to the PR to test, the "Resume build" button is right there on Jenkins UI, there's no reason to go back to GitHub to add a label. What I'm trying to say is this label is not meant to improve the UX (it won't), it's to enable triagers to resume CIs, which they currently can't.

The particular UX I dislike here is having resume-ci and request-ci which to me sound very similar. I think I understand the concerns around having one label better now, thanks for elaborating.

@MoLow MoLow closed this Sep 11, 2022
Fyko pushed a commit to Fyko/node that referenced this pull request Sep 15, 2022
Refs: nodejs#43929 (comment)

PR-URL: nodejs#43954
Reviewed-By: Matteo Collina <[email protected]>
Reviewed-By: Tobias Nießen <[email protected]>
Reviewed-By: Feng Yu <[email protected]>
guangwong pushed a commit to noslate-project/node that referenced this pull request Oct 10, 2022
Refs: nodejs/node#43929 (comment)

PR-URL: nodejs/node#43954
Reviewed-By: Matteo Collina <[email protected]>
Reviewed-By: Tobias Nießen <[email protected]>
Reviewed-By: Feng Yu <[email protected]>
aduh95 added a commit to aduh95/node that referenced this pull request Oct 22, 2022
Refs: nodejs#43929 (comment)

PR-URL: nodejs#43954
Reviewed-By: Matteo Collina <[email protected]>
Reviewed-By: Tobias Nießen <[email protected]>
Reviewed-By: Feng Yu <[email protected]>
danielleadams pushed a commit that referenced this pull request Oct 26, 2022
Refs: #43929 (comment)

PR-URL: #43954
Backport-PR-URL: #45126
Reviewed-By: Matteo Collina <[email protected]>
Reviewed-By: Tobias Nießen <[email protected]>
Reviewed-By: Feng Yu <[email protected]>
@MoLow MoLow deleted the add-resume-ci-label branch May 24, 2024 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta Issues and PRs related to the general management of the project. tools Issues and PRs related to the tools directory.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a resume-ci label to issues?
9 participants