Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in PageListener #1351

Merged
merged 2 commits into from
Oct 24, 2024
Merged

Conversation

kaituo
Copy link
Collaborator

@kaituo kaituo commented Oct 23, 2024

Description

This PR

  • Introduced an AtomicInteger called pagesInFlight to track the number of pages currently being processed. 
  • Incremented pagesInFlight before processing each page and decremented it after processing is complete
  • Adjusted the condition in scheduleImputeHCTask to check both pagesInFlight.get() == 0 (all pages have been processed) and sentOutPages.get() == receivedPages.get() (all responses have been received) before scheduling the imputeHC task. 
  • Removed the previous final check in onResponse that decided when to schedule imputeHC, relying instead on the updated counters for accurate synchronization.

These changes address the race condition where sentOutPages might not have been incremented in time before checking whether to schedule the imputeHC task. By accurately tracking the number of in-flight pages and sent pages, we ensure that imputeHC is executed only after all pages have been fully processed and all responses have been received.

Testing done:

  1. Reproduced the race condition by starting two detectors with imputation. This causes an out of order illegal argument exception from RCF due to this race condition. Also verified the change fixed the problem.
  2. added an IT for the above scenario.

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

This PR
- Introduced an `AtomicInteger` called `pagesInFlight` to track the number of pages currently being processed. 
- Incremented `pagesInFlight` before processing each page and decremented it after processing is complete
- Adjusted the condition in `scheduleImputeHCTask` to check both `pagesInFlight.get() == 0` (all pages have been processed) and `sentOutPages.get() == receivedPages.get()` (all responses have been received) before scheduling the `imputeHC` task. 
- Removed the previous final check in `onResponse` that decided when to schedule `imputeHC`, relying instead on the updated counters for accurate synchronization.

These changes address the race condition where `sentOutPages` might not have been incremented in time before checking whether to schedule the `imputeHC` task. By accurately tracking the number of in-flight pages and sent pages, we ensure that `imputeHC` is executed only after all pages have been fully processed and all responses have been received.

Testing done:
1. Reproduced the race condition by starting two detectors with imputation. This causes an out of order illegal argument exception from RCF due to this race condition. Also verified the change fixed the problem.
2. added an IT for the above scenario.

Signed-off-by: Kaituo Li <[email protected]>
Copy link

codecov bot commented Oct 23, 2024

Codecov Report

Attention: Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.

Project coverage is 80.00%. Comparing base (da73506) to head (2b497b9).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...ensearch/timeseries/transport/ResultProcessor.java 85.71% 0 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##               main    #1351   +/-   ##
=========================================
  Coverage     80.00%   80.00%           
- Complexity     5662     5673   +11     
=========================================
  Files           533      533           
  Lines         23429    23430    +1     
  Branches       2335     2334    -1     
=========================================
+ Hits          18745    18746    +1     
- Misses         3573     3578    +5     
+ Partials       1111     1106    -5     
Flag Coverage Δ
plugin 80.00% <85.71%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...imeseries/transport/ResultBulkTransportAction.java 70.58% <ø> (ø)
...ensearch/timeseries/transport/ResultProcessor.java 78.90% <85.71%> (+0.88%) ⬆️

... and 14 files with indirect coverage changes

}

@Override
public void onResponse(CompositeRetriever.Page entityFeatures) {
// start processing next page after sending out features for previous page
if (pageIterator.hasNext()) {
pageIterator.next(this);
} else if (config.getImputationOption() != null) {
scheduleImputeHCTask();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we first going inside here and then incrementing the pages inFlight, shouldn't we first increment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, changed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure on this actually, then this pagesInFlight.get() == 0 will never be reached? I was just thinking of first case also if its 0 it might pass right away

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we finished processing all of the inflight requests, pagesInFlight.get() == 0, right?

} else {
// No entity features to process
// Decrement pagesInFlight immediately
pagesInFlight.decrementAndGet();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, here every page is from the feature aggregation results, we want to make sure we received every page and then sending it out to processing (means sending the aggregated feature data to the correct model and doing .process())? then after we check each entity if data was received and send impute call to place the imputed value?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@amitgalitz
Copy link
Member

Overall were we processing results of detector 1 still but then detector 2 already finished and went to scheduleImputeHCTask?

@kaituo
Copy link
Collaborator Author

kaituo commented Oct 24, 2024

Overall were we processing results of detector 1 still but then detector 2 already finished and went to scheduleImputeHCTask?

scheduleImputeHCTask is detector specific. So when detector 2 is finished, scheduleImputeHCTask will start regardless of detector 1.

@kaituo kaituo added the bug Something isn't working label Oct 24, 2024
@kaituo kaituo merged commit f62885a into opensearch-project:main Oct 24, 2024
21 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 24, 2024
* Fix race condition in PageListener

This PR
- Introduced an `AtomicInteger` called `pagesInFlight` to track the number of pages currently being processed. 
- Incremented `pagesInFlight` before processing each page and decremented it after processing is complete
- Adjusted the condition in `scheduleImputeHCTask` to check both `pagesInFlight.get() == 0` (all pages have been processed) and `sentOutPages.get() == receivedPages.get()` (all responses have been received) before scheduling the `imputeHC` task. 
- Removed the previous final check in `onResponse` that decided when to schedule `imputeHC`, relying instead on the updated counters for accurate synchronization.

These changes address the race condition where `sentOutPages` might not have been incremented in time before checking whether to schedule the `imputeHC` task. By accurately tracking the number of in-flight pages and sent pages, we ensure that `imputeHC` is executed only after all pages have been fully processed and all responses have been received.

Testing done:
1. Reproduced the race condition by starting two detectors with imputation. This causes an out of order illegal argument exception from RCF due to this race condition. Also verified the change fixed the problem.
2. added an IT for the above scenario.

Signed-off-by: Kaituo Li <[email protected]>

* make sure increment before schedule

Signed-off-by: Kaituo Li <[email protected]>

---------

Signed-off-by: Kaituo Li <[email protected]>
(cherry picked from commit f62885a)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 24, 2024
* Fix race condition in PageListener

This PR
- Introduced an `AtomicInteger` called `pagesInFlight` to track the number of pages currently being processed. 
- Incremented `pagesInFlight` before processing each page and decremented it after processing is complete
- Adjusted the condition in `scheduleImputeHCTask` to check both `pagesInFlight.get() == 0` (all pages have been processed) and `sentOutPages.get() == receivedPages.get()` (all responses have been received) before scheduling the `imputeHC` task. 
- Removed the previous final check in `onResponse` that decided when to schedule `imputeHC`, relying instead on the updated counters for accurate synchronization.

These changes address the race condition where `sentOutPages` might not have been incremented in time before checking whether to schedule the `imputeHC` task. By accurately tracking the number of in-flight pages and sent pages, we ensure that `imputeHC` is executed only after all pages have been fully processed and all responses have been received.

Testing done:
1. Reproduced the race condition by starting two detectors with imputation. This causes an out of order illegal argument exception from RCF due to this race condition. Also verified the change fixed the problem.
2. added an IT for the above scenario.

Signed-off-by: Kaituo Li <[email protected]>

* make sure increment before schedule

Signed-off-by: Kaituo Li <[email protected]>

---------

Signed-off-by: Kaituo Li <[email protected]>
(cherry picked from commit f62885a)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
kaituo pushed a commit that referenced this pull request Oct 24, 2024
* Fix race condition in PageListener

This PR
- Introduced an `AtomicInteger` called `pagesInFlight` to track the number of pages currently being processed. 
- Incremented `pagesInFlight` before processing each page and decremented it after processing is complete
- Adjusted the condition in `scheduleImputeHCTask` to check both `pagesInFlight.get() == 0` (all pages have been processed) and `sentOutPages.get() == receivedPages.get()` (all responses have been received) before scheduling the `imputeHC` task. 
- Removed the previous final check in `onResponse` that decided when to schedule `imputeHC`, relying instead on the updated counters for accurate synchronization.

These changes address the race condition where `sentOutPages` might not have been incremented in time before checking whether to schedule the `imputeHC` task. By accurately tracking the number of in-flight pages and sent pages, we ensure that `imputeHC` is executed only after all pages have been fully processed and all responses have been received.

Testing done:
1. Reproduced the race condition by starting two detectors with imputation. This causes an out of order illegal argument exception from RCF due to this race condition. Also verified the change fixed the problem.
2. added an IT for the above scenario.



* make sure increment before schedule



---------


(cherry picked from commit f62885a)

Signed-off-by: Kaituo Li <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
kaituo pushed a commit that referenced this pull request Oct 24, 2024
* Fix race condition in PageListener

This PR
- Introduced an `AtomicInteger` called `pagesInFlight` to track the number of pages currently being processed. 
- Incremented `pagesInFlight` before processing each page and decremented it after processing is complete
- Adjusted the condition in `scheduleImputeHCTask` to check both `pagesInFlight.get() == 0` (all pages have been processed) and `sentOutPages.get() == receivedPages.get()` (all responses have been received) before scheduling the `imputeHC` task. 
- Removed the previous final check in `onResponse` that decided when to schedule `imputeHC`, relying instead on the updated counters for accurate synchronization.

These changes address the race condition where `sentOutPages` might not have been incremented in time before checking whether to schedule the `imputeHC` task. By accurately tracking the number of in-flight pages and sent pages, we ensure that `imputeHC` is executed only after all pages have been fully processed and all responses have been received.

Testing done:
1. Reproduced the race condition by starting two detectors with imputation. This causes an out of order illegal argument exception from RCF due to this race condition. Also verified the change fixed the problem.
2. added an IT for the above scenario.



* make sure increment before schedule



---------


(cherry picked from commit f62885a)

Signed-off-by: Kaituo Li <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants