Skip to content

CASSGO-50 Endless query execution fix proposal #1875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 19, 2025

Conversation

tengu-alt
Copy link
Contributor

@tengu-alt tengu-alt commented Apr 8, 2025

I've researched possible endless query execution when HostSelectionPolicy returns the same downed host (CASSGO-50) and propose three approaches to prevent it.

  • Wrap hostIter to prevent it from returning downed host before executing the query.
    e.g.
  hostIter = func() SelectedHost {
    checkIter := hostIter()
    if !checkIter.Info().IsUp() {
      return nil
    }
    return checkIter
  }

But that approach can cause breaking changes for applications which already relies on user-defined policies that can return downed hosts.

  • Properly documented type NextHost func():
    Add a comment for NextHost type describing that NextHost should return not downedhost or return nil

  • Add the threshold limit based on the number of hosts inside (q *queryExecutor) do which will prevent the endless execution without the breaking changes

Implementations of the threshold and documentation are presented in this PR.

@tengu-alt tengu-alt force-pushed the fix-endless-execution branch from a11640f to 531ffa0 Compare April 8, 2025 14:09
@tengu-alt
Copy link
Contributor Author

@joao-r-reis @jameshartig can you please take a look at this?

@joao-r-reis
Copy link
Contributor

Hmm I don't like the idea of introducing a dependency on the number of connection pools but wrapping hostIter won't solve the issue because whether the host is down or nil doesn't matter much, it will lead to the same behavior...

I actually think it's fine for the driver to assume that the host selection policy is behaving correctly (i.e. not returning the same host over and over) but we still have the issue that in #1793 we added an internal policy that does this (returns the same host everytime the function is called).

I've commented this before but I truly think the best option would be to make the internal host selection policy added in #1793 return the host only once even if that comes at a cost of having no retries or speculative executions when the user uses this feature.

Later down the line if we see that there is demand for retries/speculative executions when the featured added in #1793 is used then we can reevaluate.

One potential alternative if we absolutely want to keep retries when a specific host is being targeted by the query is to add a special flag so that the query executor treats this host selection policy differently (i.e. break from the loop immediately when the returned host is nil or downed).

@joao-r-reis
Copy link
Contributor

@tengu-alt do you have time to address this? We're trying to resolve the pending PRs for the 2.0 release

@tengu-alt tengu-alt force-pushed the fix-endless-execution branch from 531ffa0 to 8ea1c0b Compare May 7, 2025 10:32
@tengu-alt
Copy link
Contributor Author

tengu-alt commented May 7, 2025

@tengu-alt do you have time to address #1875 (comment)? We're trying to resolve the pending PRs for the 2.0 release

Sorry for the late response. I've pushed another approach according to your suggestions.
I propose the deletion of the pool after we got it using the getPoolByHostID, and keep the documentation about policies correct behavior.

@tengu-alt tengu-alt force-pushed the fix-endless-execution branch from 8ea1c0b to 75718a1 Compare May 8, 2025 11:36
// we return nil to avoid endless query execution in queryExecutor.do()
if !ok || !pool.host.IsUp() {
return nil
if host != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ensure this hostIter can be safely called concurrently I think we should make the outer variable an int and then use atomic.CompareAndSwap in this if

		returnedHostOnce := 0
		hostIter = func() SelectedHost {
			if atomic.CompareAndSwapInt32(&returnedHostOnce, 0, 1) {
                            return return (*selectedHost)(host)
                        }
                        return nil
                }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, refactored.

@@ -89,14 +89,18 @@ func (q *queryExecutor) executeQuery(qry ExecutableQuery) (*Iter, error) {
// check if the host id is specified for the query,
// if it is, the query should be executed at the corresponding host.
if hostID := qry.GetHostID(); hostID != "" {
pool, ok := q.pool.getPoolByHostID(hostID)
if !ok || !pool.host.IsUp() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this now that we return nil after the first execution since this check is already being done in the query executor. We needed this before because we needed a way to prevent the hostIter from returning the same unresponsive host over and over again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, changed.

@tengu-alt tengu-alt force-pushed the fix-endless-execution branch 2 times, most recently from bb18a5c to 570c26e Compare May 9, 2025 10:12
policies.go Outdated
@@ -323,6 +323,7 @@ func (host *selectedHost) Info() *HostInfo {
func (host *selectedHost) Mark(err error) {}

// NextHost is an iteration function over picked hosts
// Should return nil if SelectedHost is not Up to prevent endless query execution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the driver checks if the host is not up, the function should "eventually" return nil regardless of what causes it.

// Should return nil eventually to prevent endless query execution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

pool, ok := q.pool.getPoolByHostID(hostID)
if !ok {
return nil, ErrNoConnections
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove lines 93-96, it's not needed as I explained before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't clearly understand: we need the pool to get the host from it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sorry hmm I think it would make more sense to just get the host from the ring here, I didn't even realize it was only getting the pool to get the host before.
Can you replace this with a call to session.ring.getHost(id)? Maybe change getHost to return host, bool so that we can check whether the lookup failed here and return an error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, refactored

if !ok || !pool.host.IsUp() {
return nil
if atomic.CompareAndSwapInt32(&returnedHostOnce, 0, 1) {
return (*selectedHost)(host)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add a test to ensure that if a host id is set on the query then it only gets tried once even if it fails

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I checked, TestQuery_SetHostID in the cassandra_test.go already covers this case.

@tengu-alt tengu-alt force-pushed the fix-endless-execution branch from 570c26e to 962d4d0 Compare May 15, 2025 11:47
Copy link
Contributor

@joao-r-reis joao-r-reis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 nice, can you add to the changelog and format the commit message? Then I'll merge

Fix for the endless query execution when HostSelectionPolicy returns
the same downed host.
Internal HostSelectionPolicy will return the host only once if HostID is set.
Documentation for external policies was added.

patch by Oleksandr Luzhniy; reviewed by João Reis, for CASSGO-50
@tengu-alt tengu-alt force-pushed the fix-endless-execution branch from 962d4d0 to c66da4a Compare May 16, 2025 13:13
@tengu-alt
Copy link
Contributor Author

👍 nice, can you add to the changelog and format the commit message? Then I'll merge

Done.

@joao-r-reis joao-r-reis merged commit 05e6036 into apache:trunk May 19, 2025
72 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants