-
Notifications
You must be signed in to change notification settings - Fork 230
Rate limit restructure (server-side) #933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Ha. Interesting - I would have thought the ++ would have caused an error. Good to know! |
It did! However, the test was wrong as well. because the sliding window was set to 0, no timestamps were actually stored for the peers. So that part of the code never ran, until I started up a node, and started to experiment with the config values. |
…quickstart documentation
…seemingly causes tests to fail
humaite
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks okay to me at first glance. i will probably re-read it later.
| gen_server:stop(LimiterRef). | ||
|
|
||
| %% gen_server callbacks | ||
| init([Args]) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why using [Args] and not just Args?
| TEST_PATH="$(./rebar3 as ${TEST_PROFILE} path)" | ||
|
|
||
| ## TODO: Generate path for all apps -> Should we fetch this from somewhere? | ||
| APPS=("arweave" "arweave_config" "arweave_limiter" "arweave_diagnostic") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could remove the "bash-like" list. It's not really portable and could cause trouble. On POSIX shell, a list is a string made of words separated with spaces or null chars.
APPS="arweave arweave_config arweave_limiter arweave_diagnostic"
In this case, the code on L523 will simply looks like that:
for app in ${APPS}
do
...
done
Using specific bash features is usually not a good idea.
| PARAMS="-pa ${TEST_PATH} ${TEST_PATH_BASE} -config ${TEST_CONFIG} -noshell" | ||
| TEST_PATH="$(./rebar3 as ${TEST_PROFILE} path)" | ||
|
|
||
| ## TODO: Generate path for all apps -> Should we fetch this from somewhere? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, those paths are generated using rebar3, but it seems for the case of tests, it was not correctly done.
| @@ -0,0 +1,69 @@ | |||
| -module(ar_http_iface_rate_limiter_middleware). | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add an header with a bit of documentation to explain the role of this middleware?
|
|
||
| config_peer_to_ip_addr({A, B, C, D, _}) -> {A, B, C, D}. | ||
|
|
||
| path_to_limiter_ref([<<"chunk">> | _]) -> chunk; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this function is reading the path from the client request, and will return the kind of limiter used. Using a data structure like a proplist or a map can probably be easier to modify in the future. Not sure if could be easy to do right now though.
The problem I see here is, if we are adding a new path, we will need to add it in this function (at least if we need a special rate limit for it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's certainly going to some work to add a new rate limiter group.
- add Rate Limiter Group (RLG) process to
arweave_limiter_sup - add Rate Limiter Group config parameters as fields to arweave_config
configrecord.
2b) add defaults for the parameters. - add Rate Limiter routing logic to
ar_http_iface_rate_limiter_middleware
IMHO, this mapping function is a quite idiomatic way to map paths to atoms. (No dynamically named atoms, no need for lookups)
- I'd debate that 2) is a bigger burden than 3) (extending this function).
- I'd like to point out: this PR doesn't aim to introduce dynamic RLGs.
- With a different approach on how we deal with the configuration, I think we can introduce RLGs started in a dynamic manner in the future, if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not adding a lot of new end-point, but perhaps documenting this procedure somewhere could be nice.
| -export([start/0, ban_peer/2, is_peer_banned/1, cleanup_ban/1]). | ||
| -export([start_link/0]). | ||
|
|
||
| -ifdef(TEST). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| -ifdef(TEST). | |
| -ifdef(AR_TEST). |
We've had some name collisions with the ?TEST define in some dependencies, so we use ?AR_TEST now to indicate whether or not Arweave is being run in test-mode (it gets set in rebar.config)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will be quite a few of these, I can change change all to AR_TEST
|
The |
| execute(Req, Env) -> | ||
| IPAddr = requesting_ip_addr(Req), | ||
| {ok, Config} = arweave_config:get_env(), | ||
| case lists:member(blacklist, Config#config.disable) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I didn't realize we had a disable blacklist option, and I'm not sure if it is currently in use. It looks like it would completely disable rate limiting on the server side. @ldmberman @humaite are you aware of any nodes that use it?
It's possible some large miners use it on internal nodes used to server data to mining nodes (for example). But I'm not sure it would even be that effective since by default clients will self-throttle regardless of the server-side behavior.
This is a long way of asking:
- @khetzl did you mean to remove this option? (e.g. because you've replaced it with another configuration option)
- Even if the removal was unintentional... I'm not sure we actually need or want this particular flag. In general the
enableanddisableoptions are poorly understood and rarely used, so if we want to disable server-side rate limiting a more formal config is a better choice. Only question is whether anyone is using the current flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intended to remove this, we shouldn't allow unrestricted access to resources to the public.
Trusted peers can be listed so their requests are limited via a less restrictive Rate Limiter Group. (Currently, no limiting at all)
Note: Even if the disable blacklist flag is enabled in the current release, a global concurrency limit is enforced by cowboy/ranch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I agree it's correct to remove. However worth providing some context:
- In cases like this where it is the right call to remove an option, we may still need to flag and communicate the removal for any miners that are relying on it. Both in the release notes and in Discord. I suspect this option isn't used much (or at all) so probably fine to silently remove.
- Many miners implement their own security in front of their node. Specifically they will run nodes on a LAN without direct internet connection - so even if they opt in to using the
disable blacklistflag it doesn't necessarily mean they or we are providing unrestricted access to the public. - I don't think this flag is used, but if it is used it is likely due to the above. We've had a few miners run large networks with many internal (non publicly accessible nodes). In the past they tried using
local_peersbut since arweave doesn't currently support any sort of wildcards/subnetting, and because the ndoe shutdown/boot time can take a long time, some of the miners had trouble managing rate limiting for their internal nodes. It's in that scenario where I could seedisable blacklistbeing useful (setup an internal data-sharing node, disable blacklist, and then let all your other nodes connect to without worrying about needing to restart periodically to update local_peers)
All that said: I do think fine to remove. But I wanted to call out that context as our operating model is often different than that of non-blockchain companies (e.g. we build infrastructure software to be run by 3rd parties as well as by us internally - and those 3rd parties vary from novice to enterprise-level)
The Yes, the client-side still relies on this. You're right, it will be removed later. Probably, the most important change we want to implement on the client side is peers not relying on their local configuration for throttling, but altering their behaviour (throttle) dynamically according to the remote peer's reponse headers. That renders the |
| -define(DEFAULT_HTTP_API_LIMITER_IS_MANUAL_REDUCTION_DISABLED, false). | ||
|
|
||
| %% General RLG | ||
| -define(DEFAULT_HTTP_API_LIMITER_GENERAL_SLIDING_WINDOW_LIMIT, 150). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the defaults that apply across the board to all endpoints unless specifically overridden?
| 'http_api.limiter.get_previous_vdf_session.is_manual_reduction_disabled' = | ||
| ?DEFAULT_HTTP_API_LIMITER_IS_MANUAL_REDUCTION_DISABLED, | ||
|
|
||
| 'http_api.limiter.metrics.sliding_window_limit' = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@khetzl @humaite What's the recommended approach for the new config system regarding how deep to nest configs?
E.g. Is there a clear preference http_api_limiter.metrics.sliding_window_timestamp_cleanup_interval vs. http_api_limiter.metrics.sliding_window.timestamp_cleanup_interval?
Especially in a JSON format I could see grouping all the sliding_window or leaky_tick options together making it a little easier to keep everything straight.
| %%% @doc Arweave Rate Limiter. | ||
| %%% | ||
| %%% `arweave_limiter' module is an interface to the Arweave | ||
| %%% Rate Limiter functionality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I like this approach - really helps with modularization 👍
|
|
||
| reduce_for_peer(LimiterRef, Peer) -> | ||
| Result = gen_server:call(LimiterRef, {reduce_for_peer, Peer}), | ||
| Result == ok andalso prometheus_counter:inc(ar_limiter_reduce_requests_total, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't seen this structure before. Any benefit vs.
ok = Result = gen_server:call(LimiterRef, {reduce_for_peer, Peer}),
prometheus_counter:inc(ar_limiter_reduce_requests_total,
Ah wait, I think I see the distinction. My version would crash, your version would just silently not call prometheus_count:inc, right?
|
|
||
| %% gen_server callbacks | ||
| init([Args]) -> | ||
| process_flag(trap_exit, true), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've gone back and forth on process_flag(trap_exit, true),. At one point we were adding it to all gen_servers, and then we switched to only adding it when there was a specific cleanup action we wanted to take on exit.
What's the rationale for trapping the exit here?
| case Concurrency > ConcurrencyLimit of | ||
| true -> | ||
| %% Concurrency Hard Limit | ||
| ?LOG_WARNING([{event, ar_limiter_reject}, {reason, concurrency}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to remove all of these ?LOG_WARNINGs or, at best, flip them to ?LOG_DEBUG.
We wrestle with how verbosely to log events that are initiated by clients. We've had issues in the past where a malicious client would attempt to flood a miner's logs via inexpensive HTTP requests.
I guess as a rule of thumb if a log can be generated by a client, we should aim to have the event that triggers the log be difficult/rare. Although log bloat is a constant battle, and I'm sure there are places in the code where this rule of thumb has been violated
| register_concurrent( | ||
| Peer, FromPid, ConcurrentRequests, ConcurrentMonitors), | ||
| {reply, {register, leaky}, | ||
| State#{leaky_tokens => NewLeakyTokens, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question for my own understanding: what's the rationale of => vs. := here? => is like upsert, right? and := will fail if the key doesn't already exists? Why do we only want to fail on concurrent_monitors?
| concurrent_monitors := NewMonitors}} | ||
| end; | ||
| _ -> | ||
| {NewRequests, NewMonitors} = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible that we hit this branch (i.e. we have capacity under the sleding window limit) while also still having a non-zero Tokens count, right? i.e. we have requests registered against the burst limit and also have space in the liding window.
I guess I could see this happening if we recently exceeded the sliding window limit and started registering tokens against the burst limit, and then some capacit opened up in the sliding window before we had finished "draining the burst limit bucket".
Is that correct? And that's expected, right?
| Accept | ||
| end. | ||
|
|
||
| reduce_for_peer(LimiterRef, Peer) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this called? I only see it called from ar_http_iface_middleware:handle_post_tx_accepted. I was expecting to see it called whenever an inbound HTTP request is processed, but perhaps I misunderstood
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah looks like I misunderstood. Reading the comment reduce_for_peer is a special case so that we don't penalize peers for sharing valid transactions with us. The normal leaky bucket and window counter tracking is handled elswhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment to this function to describe it? Main relevant point that I see is that this function is not the standard way to reduce the request count for a peer, but is an edge case handler.
| Timestamps. | ||
|
|
||
| add_and_order_timestamps(Ts, Timestamps) -> | ||
| lists:reverse(do_add_and_order_timestamps(Ts, lists:reverse(Timestamps))). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| lists:reverse(do_add_and_order_timestamps(Ts, lists:reverse(Timestamps))). | |
| %% Timestamps is ordered oldest to newest. However we expect Ts to usually be the most recent timestamp. To optimize for this case we reverse, add (likely to the front), then re-reverse | |
| lists:reverse(do_add_and_order_timestamps(Ts, lists:reverse(Timestamps))). |
Or some other comment just to explain the double-reverse
| ok = prometheus_histogram:new([ | ||
| {name, ar_limiter_response_time_milliseconds}, | ||
| {help, "Time it took for the limiter to respond to requests"}, | ||
| {buckets, [infinity]}, %% For meaningful results on this, we need buckets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without buckets we can still get an average over time because the histogram includes xxx_count and xxx_sum fields. Most of our internal charts just use this average over time value to track performance changes. But if some degree of bucketing would be helpful for this metric (e.g. if the distribution is skewed per timeslice), defintely add them!
We have a broader issue of how to deal with our growing number of metrics, but I think we'll likely need to tackle that holistically - until then feel free to add whatever metrics you think are necessary
| ]}, | ||
| {test, [ | ||
| {deps, [{meck, "0.8.13"}]}, | ||
| {deps, [{meck, "1.1.0"}]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did upgrading meck address the bug you noticed?
|
PR looks great! Most of my comments are minor. One annoying comment: unfortunately we're still using tab indenting rather than space. We'd like to reformat everything, but until then would be good to have new code indented with tabs. One question: have you been able to run any performance tests using the new code? My guess is the impact of the extra logic is minimal. Only potential bottleneck I can see if the |
Multiple rate limiter actors to be called from middleware/cowboy_handler processes.
Implementation based on Leaky Bucket Token (https://gist.github.com/humaite/21a84c3b3afac07fcebe476580f3a40b) with additional limits for concurrency (similarly to Ranch concurrency limiter).
It could potentially allow complex rules set to map requests to limiter pools and/or custom settings via (reworked) ar_config.
The current implementation is not "wired" in. So the new middleware module is not added to
ar_http_iface_server, and the supervisor is not yet part of thear_nodesupervision tree.The functionality could be extended with values helpful to regulate clients via the
Status 429response headers.