-
Notifications
You must be signed in to change notification settings - Fork 637
CASSGO-50 Endless query execution fix proposal #1875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a11640f
to
531ffa0
Compare
@joao-r-reis @jameshartig can you please take a look at this? |
Hmm I don't like the idea of introducing a dependency on the number of connection pools but wrapping I actually think it's fine for the driver to assume that the host selection policy is behaving correctly (i.e. not returning the same host over and over) but we still have the issue that in #1793 we added an internal policy that does this (returns the same host everytime the function is called). I've commented this before but I truly think the best option would be to make the internal host selection policy added in #1793 return the host only once even if that comes at a cost of having no retries or speculative executions when the user uses this feature. Later down the line if we see that there is demand for retries/speculative executions when the featured added in #1793 is used then we can reevaluate. One potential alternative if we absolutely want to keep retries when a specific host is being targeted by the query is to add a special flag so that the query executor treats this host selection policy differently (i.e. break from the loop immediately when the returned host is nil or downed). |
@tengu-alt do you have time to address this? We're trying to resolve the pending PRs for the 2.0 release |
531ffa0
to
8ea1c0b
Compare
Sorry for the late response. I've pushed another approach according to your suggestions. |
8ea1c0b
to
75718a1
Compare
query_executor.go
Outdated
// we return nil to avoid endless query execution in queryExecutor.do() | ||
if !ok || !pool.host.IsUp() { | ||
return nil | ||
if host != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To ensure this hostIter
can be safely called concurrently I think we should make the outer variable an int
and then use atomic.CompareAndSwap
in this if
returnedHostOnce := 0
hostIter = func() SelectedHost {
if atomic.CompareAndSwapInt32(&returnedHostOnce, 0, 1) {
return return (*selectedHost)(host)
}
return nil
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, refactored.
query_executor.go
Outdated
@@ -89,14 +89,18 @@ func (q *queryExecutor) executeQuery(qry ExecutableQuery) (*Iter, error) { | |||
// check if the host id is specified for the query, | |||
// if it is, the query should be executed at the corresponding host. | |||
if hostID := qry.GetHostID(); hostID != "" { | |||
pool, ok := q.pool.getPoolByHostID(hostID) | |||
if !ok || !pool.host.IsUp() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this now that we return nil
after the first execution since this check is already being done in the query executor. We needed this before because we needed a way to prevent the hostIter
from returning the same unresponsive host over and over again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, changed.
bb18a5c
to
570c26e
Compare
policies.go
Outdated
@@ -323,6 +323,7 @@ func (host *selectedHost) Info() *HostInfo { | |||
func (host *selectedHost) Mark(err error) {} | |||
|
|||
// NextHost is an iteration function over picked hosts | |||
// Should return nil if SelectedHost is not Up to prevent endless query execution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the driver checks if the host is not up, the function should "eventually" return nil regardless of what causes it.
// Should return nil eventually to prevent endless query execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
pool, ok := q.pool.getPoolByHostID(hostID) | ||
if !ok { | ||
return nil, ErrNoConnections | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove lines 93-96, it's not needed as I explained before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't clearly understand: we need the pool to get the host from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh sorry hmm I think it would make more sense to just get the host from the ring here, I didn't even realize it was only getting the pool to get the host before.
Can you replace this with a call to session.ring.getHost(id)
? Maybe change getHost
to return host, bool
so that we can check whether the lookup failed here and return an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, refactored
if !ok || !pool.host.IsUp() { | ||
return nil | ||
if atomic.CompareAndSwapInt32(&returnedHostOnce, 0, 1) { | ||
return (*selectedHost)(host) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add a test to ensure that if a host id is set on the query then it only gets tried once even if it fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I checked, TestQuery_SetHostID
in the cassandra_test.go
already covers this case.
570c26e
to
962d4d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 nice, can you add to the changelog and format the commit message? Then I'll merge
Fix for the endless query execution when HostSelectionPolicy returns the same downed host. Internal HostSelectionPolicy will return the host only once if HostID is set. Documentation for external policies was added. patch by Oleksandr Luzhniy; reviewed by João Reis, for CASSGO-50
962d4d0
to
c66da4a
Compare
Done. |
I've researched possible endless query execution when
HostSelectionPolicy
returns the same downed host (CASSGO-50) and propose three approaches to prevent it.hostIter
to prevent it from returning downed host before executing the query.e.g.
But that approach can cause breaking changes for applications which already relies on user-defined policies that can return downed hosts.
Properly documented
type NextHost func()
:Add a comment for
NextHost
type describing thatNextHost
should return not downedhost or return nilAdd the threshold limit based on the number of hosts inside
(q *queryExecutor) do
which will prevent the endless execution without the breaking changesImplementations of the threshold and documentation are presented in this PR.