Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG(go client):when go client is writing to one partition and the replica node core dump, go client will finish after timeout without updating the configuration. #1856

Open
lengyuexuexuan opened this issue Jan 16, 2024 · 2 comments
Labels
type/bug This issue reports a bug.

Comments

@lengyuexuexuan
Copy link
Collaborator

In the code, when replcia core dump, the function loopForResponse() will return "nil".
1
Then, the process will be blocked in function CallWithGpid() until the time exceeds the timeout.
2
3

why not update the configuration of table and retry previous operation when the above situation occurs?

@lengyuexuexuan lengyuexuexuan added the type/bug This issue reports a bug. label Jan 16, 2024
@acelyc111
Copy link
Member

@lengyuexuexuan Thanks for the feedback, could you please submit a patch to fix it?

@lengyuexuexuan
Copy link
Collaborator Author

@lengyuexuexuan Thanks for the feedback, could you please submit a patch to fix it?

OK. No problem.

empiredan pushed a commit that referenced this issue Jun 20, 2024
…to primary meta server if it was changed (#1916)

#1880
#1856

As for #1856:
when go client is writing to one partition and the replica node core dump, go client will finish 
after timeout without updating the configuration. In this case, the go client only restart to solve
the problem. 

In this pr, the client would update configuration of table automatically when someone replica
core dump. After testing, we found that the the replica error is "context.DeadlineExceeded"
(incubator-pegasus/go-client/pegasus/table_connector.go) when the replica core dump.

Therefore, when client meets the error, the go client will update configuration automatically.
Besides, this request will not retry. Because only in the case of timeout, the configuration will be
automatically updated. If you try again before then, it will still fail. There is also the risk of infinite
retries. Therefore, it is better to directly return the request error to the user and let the user try
again.

As for #1880:
When the client sends an RPC message "RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX" to the
meta server, if the meta server isn't primary, the response that forward to the primary meta server
will return. 

According to the above description, assuming that the client does not have a primary meta server
configured, we can connect to the primary meta server in this way.

About tests:
1. Start onebox, and the primary meta server is not added to the go client configuration.
2. The go client writes data to a certain partition and then kills the replica process.
ruojieranyishen pushed a commit to ruojieranyishen/incubator-pegasus that referenced this issue Jul 17, 2024
…to primary meta server if it was changed (apache#1916)

apache#1880
apache#1856

As for apache#1856:
when go client is writing to one partition and the replica node core dump, go client will finish 
after timeout without updating the configuration. In this case, the go client only restart to solve
the problem. 

In this pr, the client would update configuration of table automatically when someone replica
core dump. After testing, we found that the the replica error is "context.DeadlineExceeded"
(incubator-pegasus/go-client/pegasus/table_connector.go) when the replica core dump.

Therefore, when client meets the error, the go client will update configuration automatically.
Besides, this request will not retry. Because only in the case of timeout, the configuration will be
automatically updated. If you try again before then, it will still fail. There is also the risk of infinite
retries. Therefore, it is better to directly return the request error to the user and let the user try
again.

As for apache#1880:
When the client sends an RPC message "RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX" to the
meta server, if the meta server isn't primary, the response that forward to the primary meta server
will return. 

According to the above description, assuming that the client does not have a primary meta server
configured, we can connect to the primary meta server in this way.

About tests:
1. Start onebox, and the primary meta server is not added to the go client configuration.
2. The go client writes data to a certain partition and then kills the replica process.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug This issue reports a bug.
Projects
None yet
Development

No branches or pull requests

2 participants