Skip to content

Conversation

@michaelkipper
Copy link

What

This PR attempts to address #244. It throttles the number of bulkhead tickets to 1 when the circuit opens, throttles to success_threshold when the circuit moves to half_open, and removes the throttle when the circuit finally closes.

Why

When the circuit is in an open state, we still allow all the workers that can acquire tickets to attempt the transition to half_open. When the circuit is half open, we allow all those workers to attempt the transition to closed, which can sap much needed capacity during a sustained outage.

Note: This PR is still a work in progress, and there are several corner cases that need to be addressed. I'm looking for a high-level review of the approach.

Issues

  • It's a little gross to couple the bulkhead and circuit breaker in this way, but I'm not sure how else to approach it.
  • Not sure how SEM_UNDO works when one process applies the throttle, and another removes it. Or what happens when the process that applied the throttle terminates. Needs to be thought through more.

cc: @Shopify/servcomm

@michaelkipper michaelkipper force-pushed the mkipper/global-circuit-breaker-throttle-half-open branch from 5e9b561 to 13a7241 Compare July 4, 2019 16:03
Copy link

@thegedge thegedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you had any success with semsim to show the capacity before and after?

I'm also interested in how this behaves with a large error_timeout. In particular, we may have capacity improvements, but that would be at the expense for a slower recovery. In the worst case scenario, where all workers had their circuit open at the same time, n - 1 workers will wait (edit: at least) an additional error_timeout seconds after recovery. I think I got that right 🤔

void test_sliding_window()
{
semian_simple_sliding_window_shared_t window;
assert_le_int(sizeof(window), 4096, "window size is greater than a page");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about sysconf(_SC_PAGESIZE) instead of 4096?

@michaelkipper
Copy link
Author

Have you had any success with semsim?

No, I don't think this works at all...
SI_SEM_CONFIGURED_TICKETS is basically a bookkeeping value. The real value we want to modify is SI_SEM_TICKETS.
The problem is that at the point when we most likely want to reduce SI_SEM_TICKETS to 1, it's already close to zero. If we attempt to do a -throttle operation, that worker would block until the rest of the workers timed out on their requests. Given that we know we're opening the circuit, it would be awesome if we could terminate those requests immediately, but my brief research into this suggests that we'd have to take down the worker with a SIGTERM which is likely a very expensive operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants