Skip to content

Conversation

@khetzl
Copy link
Collaborator

@khetzl khetzl commented Dec 15, 2025

Multiple rate limiter actors to be called from middleware/cowboy_handler processes.

Implementation based on Leaky Bucket Token (https://gist.github.com/humaite/21a84c3b3afac07fcebe476580f3a40b) with additional limits for concurrency (similarly to Ranch concurrency limiter).

It could potentially allow complex rules set to map requests to limiter pools and/or custom settings via (reworked) ar_config.

The current implementation is not "wired" in. So the new middleware module is not added to ar_http_iface_server, and the supervisor is not yet part of the ar_node supervision tree.

The functionality could be extended with values helpful to regulate clients via the Status 429 response headers.

@JamesPiechota
Copy link
Collaborator

Ha. Interesting - I would have thought the ++ would have caused an error. Good to know!

@khetzl
Copy link
Collaborator Author

khetzl commented Dec 23, 2025

Ha. Interesting - I would have thought the ++ would have caused an error. Good to know!

It did! However, the test was wrong as well. because the sliding window was set to 0, no timestamps were actually stored for the peers. So that part of the code never ran, until I started up a node, and started to experiment with the config values.
Two wrongs made a false right.

@khetzl khetzl changed the title WIP: Rate limit restructure Rate limit restructure (server-side) Jan 9, 2026
Copy link
Collaborator

@humaite humaite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks okay to me at first glance. i will probably re-read it later.

gen_server:stop(LimiterRef).

%% gen_server callbacks
init([Args]) ->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using [Args] and not just Args?

TEST_PATH="$(./rebar3 as ${TEST_PROFILE} path)"

## TODO: Generate path for all apps -> Should we fetch this from somewhere?
APPS=("arweave" "arweave_config" "arweave_limiter" "arweave_diagnostic")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could remove the "bash-like" list. It's not really portable and could cause trouble. On POSIX shell, a list is a string made of words separated with spaces or null chars.

APPS="arweave arweave_config arweave_limiter arweave_diagnostic"

In this case, the code on L523 will simply looks like that:

for app in ${APPS}
do
  ...
done

Using specific bash features is usually not a good idea.

PARAMS="-pa ${TEST_PATH} ${TEST_PATH_BASE} -config ${TEST_CONFIG} -noshell"
TEST_PATH="$(./rebar3 as ${TEST_PROFILE} path)"

## TODO: Generate path for all apps -> Should we fetch this from somewhere?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, those paths are generated using rebar3, but it seems for the case of tests, it was not correctly done.

@@ -0,0 +1,69 @@
-module(ar_http_iface_rate_limiter_middleware).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an header with a bit of documentation to explain the role of this middleware?


config_peer_to_ip_addr({A, B, C, D, _}) -> {A, B, C, D}.

path_to_limiter_ref([<<"chunk">> | _]) -> chunk;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this function is reading the path from the client request, and will return the kind of limiter used. Using a data structure like a proplist or a map can probably be easier to modify in the future. Not sure if could be easy to do right now though.

The problem I see here is, if we are adding a new path, we will need to add it in this function (at least if we need a special rate limit for it).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's certainly going to some work to add a new rate limiter group.

  1. add Rate Limiter Group (RLG) process to arweave_limiter_sup
  2. add Rate Limiter Group config parameters as fields to arweave_config config record.
    2b) add defaults for the parameters.
  3. add Rate Limiter routing logic to ar_http_iface_rate_limiter_middleware

IMHO, this mapping function is a quite idiomatic way to map paths to atoms. (No dynamically named atoms, no need for lookups)

  • I'd debate that 2) is a bigger burden than 3) (extending this function).
  • I'd like to point out: this PR doesn't aim to introduce dynamic RLGs.
  • With a different approach on how we deal with the configuration, I think we can introduce RLGs started in a dynamic manner in the future, if necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not adding a lot of new end-point, but perhaps documenting this procedure somewhere could be nice.

-export([start/0, ban_peer/2, is_peer_banned/1, cleanup_ban/1]).
-export([start_link/0]).

-ifdef(TEST).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
-ifdef(TEST).
-ifdef(AR_TEST).

We've had some name collisions with the ?TEST define in some dependencies, so we use ?AR_TEST now to indicate whether or not Arweave is being run in test-mode (it gets set in rebar.config)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be quite a few of these, I can change change all to AR_TEST

@JamesPiechota
Copy link
Collaborator

The RPM_BY_PATH logic as well as the ar_rate_limiter module - they're both related to the client-side rate limiting, right? i.e. no change to them in this PR, but they will likely be removed/updated in a future PR when the client-side functionality is added.

execute(Req, Env) ->
IPAddr = requesting_ip_addr(Req),
{ok, Config} = arweave_config:get_env(),
case lists:member(blacklist, Config#config.disable) of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I didn't realize we had a disable blacklist option, and I'm not sure if it is currently in use. It looks like it would completely disable rate limiting on the server side. @ldmberman @humaite are you aware of any nodes that use it?

It's possible some large miners use it on internal nodes used to server data to mining nodes (for example). But I'm not sure it would even be that effective since by default clients will self-throttle regardless of the server-side behavior.

This is a long way of asking:

  1. @khetzl did you mean to remove this option? (e.g. because you've replaced it with another configuration option)
  2. Even if the removal was unintentional... I'm not sure we actually need or want this particular flag. In general the enable and disable options are poorly understood and rarely used, so if we want to disable server-side rate limiting a more formal config is a better choice. Only question is whether anyone is using the current flag?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended to remove this, we shouldn't allow unrestricted access to resources to the public.
Trusted peers can be listed so their requests are limited via a less restrictive Rate Limiter Group. (Currently, no limiting at all)

Note: Even if the disable blacklist flag is enabled in the current release, a global concurrency limit is enforced by cowboy/ranch.

Copy link
Collaborator

@JamesPiechota JamesPiechota Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I agree it's correct to remove. However worth providing some context:

  • In cases like this where it is the right call to remove an option, we may still need to flag and communicate the removal for any miners that are relying on it. Both in the release notes and in Discord. I suspect this option isn't used much (or at all) so probably fine to silently remove.
  • Many miners implement their own security in front of their node. Specifically they will run nodes on a LAN without direct internet connection - so even if they opt in to using the disable blacklist flag it doesn't necessarily mean they or we are providing unrestricted access to the public.
  • I don't think this flag is used, but if it is used it is likely due to the above. We've had a few miners run large networks with many internal (non publicly accessible nodes). In the past they tried using local_peers but since arweave doesn't currently support any sort of wildcards/subnetting, and because the ndoe shutdown/boot time can take a long time, some of the miners had trouble managing rate limiting for their internal nodes. It's in that scenario where I could see disable blacklist being useful (setup an internal data-sharing node, disable blacklist, and then let all your other nodes connect to without worrying about needing to restart periodically to update local_peers)

All that said: I do think fine to remove. But I wanted to call out that context as our operating model is often different than that of non-blockchain companies (e.g. we build infrastructure software to be run by 3rd parties as well as by us internally - and those 3rd parties vary from novice to enterprise-level)

@khetzl
Copy link
Collaborator Author

khetzl commented Jan 9, 2026

The RPM_BY_PATH logic as well as the ar_rate_limiter module - they're both related to the client-side rate limiting, right? i.e. no change to them in this PR, but they will likely be removed/updated in a future PR when the client-side functionality is added.

The RPM_BY_PATH macro was used on the server-side as well (apps/arweave/src/ar_blacklist_middleware.erl L153)

Yes, the client-side still relies on this. You're right, it will be removed later. Probably, the most important change we want to implement on the client side is peers not relying on their local configuration for throttling, but altering their behaviour (throttle) dynamically according to the remote peer's reponse headers. That renders the RPM_BY_PATH macro redundant.

-define(DEFAULT_HTTP_API_LIMITER_IS_MANUAL_REDUCTION_DISABLED, false).

%% General RLG
-define(DEFAULT_HTTP_API_LIMITER_GENERAL_SLIDING_WINDOW_LIMIT, 150).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the defaults that apply across the board to all endpoints unless specifically overridden?

'http_api.limiter.get_previous_vdf_session.is_manual_reduction_disabled' =
?DEFAULT_HTTP_API_LIMITER_IS_MANUAL_REDUCTION_DISABLED,

'http_api.limiter.metrics.sliding_window_limit' =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khetzl @humaite What's the recommended approach for the new config system regarding how deep to nest configs?

E.g. Is there a clear preference http_api_limiter.metrics.sliding_window_timestamp_cleanup_interval vs. http_api_limiter.metrics.sliding_window.timestamp_cleanup_interval?

Especially in a JSON format I could see grouping all the sliding_window or leaky_tick options together making it a little easier to keep everything straight.

%%% @doc Arweave Rate Limiter.
%%%
%%% `arweave_limiter' module is an interface to the Arweave
%%% Rate Limiter functionality.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I like this approach - really helps with modularization 👍


reduce_for_peer(LimiterRef, Peer) ->
Result = gen_server:call(LimiterRef, {reduce_for_peer, Peer}),
Result == ok andalso prometheus_counter:inc(ar_limiter_reduce_requests_total,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen this structure before. Any benefit vs.

ok = Result = gen_server:call(LimiterRef, {reduce_for_peer, Peer}),
prometheus_counter:inc(ar_limiter_reduce_requests_total,

Ah wait, I think I see the distinction. My version would crash, your version would just silently not call prometheus_count:inc, right?


%% gen_server callbacks
init([Args]) ->
process_flag(trap_exit, true),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've gone back and forth on process_flag(trap_exit, true),. At one point we were adding it to all gen_servers, and then we switched to only adding it when there was a specific cleanup action we wanted to take on exit.

What's the rationale for trapping the exit here?

case Concurrency > ConcurrencyLimit of
true ->
%% Concurrency Hard Limit
?LOG_WARNING([{event, ar_limiter_reject}, {reason, concurrency},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to remove all of these ?LOG_WARNINGs or, at best, flip them to ?LOG_DEBUG.

We wrestle with how verbosely to log events that are initiated by clients. We've had issues in the past where a malicious client would attempt to flood a miner's logs via inexpensive HTTP requests.

I guess as a rule of thumb if a log can be generated by a client, we should aim to have the event that triggers the log be difficult/rare. Although log bloat is a constant battle, and I'm sure there are places in the code where this rule of thumb has been violated

register_concurrent(
Peer, FromPid, ConcurrentRequests, ConcurrentMonitors),
{reply, {register, leaky},
State#{leaky_tokens => NewLeakyTokens,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for my own understanding: what's the rationale of => vs. := here? => is like upsert, right? and := will fail if the key doesn't already exists? Why do we only want to fail on concurrent_monitors?

concurrent_monitors := NewMonitors}}
end;
_ ->
{NewRequests, NewMonitors} =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible that we hit this branch (i.e. we have capacity under the sleding window limit) while also still having a non-zero Tokens count, right? i.e. we have requests registered against the burst limit and also have space in the liding window.

I guess I could see this happening if we recently exceeded the sliding window limit and started registering tokens against the burst limit, and then some capacit opened up in the sliding window before we had finished "draining the burst limit bucket".

Is that correct? And that's expected, right?

Accept
end.

reduce_for_peer(LimiterRef, Peer) ->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this called? I only see it called from ar_http_iface_middleware:handle_post_tx_accepted. I was expecting to see it called whenever an inbound HTTP request is processed, but perhaps I misunderstood

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah looks like I misunderstood. Reading the comment reduce_for_peer is a special case so that we don't penalize peers for sharing valid transactions with us. The normal leaky bucket and window counter tracking is handled elswhere.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment to this function to describe it? Main relevant point that I see is that this function is not the standard way to reduce the request count for a peer, but is an edge case handler.

Timestamps.

add_and_order_timestamps(Ts, Timestamps) ->
lists:reverse(do_add_and_order_timestamps(Ts, lists:reverse(Timestamps))).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lists:reverse(do_add_and_order_timestamps(Ts, lists:reverse(Timestamps))).
%% Timestamps is ordered oldest to newest. However we expect Ts to usually be the most recent timestamp. To optimize for this case we reverse, add (likely to the front), then re-reverse
lists:reverse(do_add_and_order_timestamps(Ts, lists:reverse(Timestamps))).

Or some other comment just to explain the double-reverse

ok = prometheus_histogram:new([
{name, ar_limiter_response_time_milliseconds},
{help, "Time it took for the limiter to respond to requests"},
{buckets, [infinity]}, %% For meaningful results on this, we need buckets
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without buckets we can still get an average over time because the histogram includes xxx_count and xxx_sum fields. Most of our internal charts just use this average over time value to track performance changes. But if some degree of bucketing would be helpful for this metric (e.g. if the distribution is skewed per timeslice), defintely add them!

We have a broader issue of how to deal with our growing number of metrics, but I think we'll likely need to tackle that holistically - until then feel free to add whatever metrics you think are necessary

]},
{test, [
{deps, [{meck, "0.8.13"}]},
{deps, [{meck, "1.1.0"}]},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did upgrading meck address the bug you noticed?

@JamesPiechota
Copy link
Collaborator

PR looks great! Most of my comments are minor.

One annoying comment: unfortunately we're still using tab indenting rather than space. We'd like to reformat everything, but until then would be good to have new code indented with tabs.

One question: have you been able to run any performance tests using the new code? My guess is the impact of the extra logic is minimal. Only potential bottleneck I can see if the arweave_limiter_group process itself, now that all requests for a given group are sent to a single process I wonder if it could get backed up under load.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants