Skip to content

Conversation

@hakuna-matatah
Copy link
Contributor

To test some theories

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 4, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hakman for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from dims and hakman November 4, 2025 21:38
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 4, 2025
@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

1 similar comment
@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

1 similar comment
@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 19, 2025
@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

2 similar comments
@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

@hakuna-matatah
Copy link
Contributor Author

It seems like now tests (both presubmits and periodics) are not even able to create clusters @dims @hakman , it says it is unable to get bucket details.

I1220 07:05:34.766288   18101 subnets.go:224] Assigned CIDR 10.0.128.0/18 to subnet us-east-2c
Error: error building complete spec: failed to get bucket details for "s3://k8s-infra-kops-discovery-d19a-20251220070534/scale-5000.periodic.test-cncf-aws.k8s.io": Could not retrieve location for AWS bucket k8s-infra-kops-discovery-d19a-20251220070534
Error: error building complete spec: failed to get bucket details for "s3://k8s-infra-kops-discovery-8d16-20251220204549/e2e-fa029a0ba8-a2033.tests-kops-aws.k8s.io": Could not retrieve location for AWS bucket k8s-infra-kops-discovery-8d16-20251220204549

Do we know if these buckets still exist ? I don't have access to the account to check this.

@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

1 similar comment
@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

@hakman hakman changed the title [WIP] Increase etcd iops [WIP] test: Increase etcd IOPS for AWS scale jobs Dec 29, 2025
@hakman
Copy link
Member

hakman commented Dec 29, 2025

@hakuna-matatah Looks like cluster validation passed with 5k nodes. I think this is mostly ready to merge.

@hakuna-matatah
Copy link
Contributor Author

hakuna-matatah commented Dec 30, 2025

Looks like cluster validation passed with 5k nodes. I think this is mostly ready to merge.

Unfortunately, not yet. It appears prometheus stack failed to set up, need to understand why ? And it appears somehow job is still in running state for last 20hrs and it dump the logs yet - https://gcsweb.k8s.io/gcs/kubernetes-ci-logs/pr-logs/pull/kops/17741/presubmit-kops-aws-scale-amazonvpc-using-cl2/2005523352457842688/

�������Failure�3no endpoints available for service "prometheus-k8s""�ServiceUnavailable0����"�
W1229 07:09:21.681067   58568 util.go:72] error while calling prometheus api: the server is currently unable to handle the request (get services http:prometheus-k8s:9090), response: k8s�

�v1��Status�]
�
�������Failure�3no endpoints available for service "prometheus-k8s""�ServiceUnavailable0����"�
F1229 07:09:21.685136   58568 clusterloader.go:335] Error while setting up prometheus stack: timed out waiting for the condition
exit status 1
F1229 07:09:22.092900   53903 cl2.go:161] failed to run clusterloader2 tester: exit status 1

will re-run to see if it kills old one and if prom stack setup failure is consistent ^^^, hard to debug if there are no logs on why prometheus stack failed to set up.

@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

@hakuna-matatah
Copy link
Contributor Author

hakuna-matatah commented Dec 30, 2025

@ameukam @BenTheElder It looks like prow job is stuck in running state for last 20 hours - https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kops/17741/presubmit-kops-aws-scale-amazonvpc-using-cl2/2005523352457842688

I remember vaguely this happened in the past and there was a fix on Infra for this. Do you happen to know if it has regressed ?

@k8s-ci-robot
Copy link
Contributor

@hakuna-matatah: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
presubmit-kops-gce-small-scale-ipalias-using-cl2 e91dee9 link true /test presubmit-kops-gce-small-scale-ipalias-using-cl2
pull-kops-kubernetes-e2e-ubuntu-gce-build e91dee9 link false /test pull-kops-kubernetes-e2e-ubuntu-gce-build
presubmit-kops-aws-scale-amazonvpc-using-cl2 e91dee9 link false /test presubmit-kops-aws-scale-amazonvpc-using-cl2

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hakman
Copy link
Member

hakman commented Dec 30, 2025

Looks like cluster validation passed with 5k nodes. I think this is mostly ready to merge.

Unfortunately, not yet. It appears prometheus stack failed to set up, need to understand why ? And it appears somehow job is still in running state for last 20hrs and it dump the logs yet - https://gcsweb.k8s.io/gcs/kubernetes-ci-logs/pr-logs/pull/kops/17741/presubmit-kops-aws-scale-amazonvpc-using-cl2/2005523352457842688/

will re-run to see if it kills old one and if prom stack setup failure is consistent ^^^, hard to debug if there are no logs on why prometheus stack failed to set up.

I remember @upodroid investigating this and being related to perf-test. Not sure why it's so frequent these days.

@upodroid
Copy link
Member

Double check if the pvc of the prometheus pod is up and running.

Also, we should be running 100 node jobs every other hour like we do for gce

@hakuna-matatah
Copy link
Contributor Author

hakuna-matatah commented Dec 30, 2025

Double check if the pvc of the prometheus pod is up and running.

I think even for this we would need APIServer audit logs

It appears that prometheus stack set up went fine in the last test, but test itself failed due to API SLOs breaching.

<failure type="Failure">:0 [measurement call APIResponsivenessPrometheus - APIResponsivenessPrometheus error: top 
latency metric: there should be no high-latency requests, but: [got: &{Resource:events Subresource: Verb:DELETE 
Scope:namespace Latency:perc50: 1m0s, perc90: 1m0s, perc99: 1m0s Count:120 SlowCount:68}; expected perc99 <= 30s]] :0</failure>

Unfortunately we don't have APIServer audit logs with kops setup to debug where the latency is coming from ?
I wonder if requests are waiting in AP&F queue for longer time before executing and thus breaching SLO ? Or etcd being culprit ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants