Skip to content

Conversation

@shizzard
Copy link
Contributor

This PR introduces the OOM monitor process.

In order to run it, you need to enable oom_monitor flag in config:

bin/start ... enable oom_monitor
{
    "enable": ["oom_monitor"]
}

Related configuration options are:

  • oom_monitor_report_period: time interval between reports in seconds; default is 5;
  • oom_monitor_filename: filename to write reports into; default is /tmp/arweave_oom_monitor.log;
  • oom_monitor_top_procs: number of top processes to sample for each category of the report; default is 20.

Sample report:

# 2025-07-10T14:05:34.514Z Arweave OOM Monitor
This file is automatically generated by the Arweave OOM Monitor.
It contains information about the memory usage of the Erlang VM and the processes running on it.
It is used to diagnose potential memory issues and to help prevent Out of Memory crashes.
This file will be overwritten upon node restart.

...

## 2025-07-10T14:24:37.469Z Memory Report

### Basic Memory Info
Total Memory: 65342125696 bytes (62315 MiB); since last: -0.4% -274782312 bytes (-262 MiB)
Processes Memory: 3456938728 bytes (3296 MiB); since last: -6.8% -253911112 bytes (-242 MiB)
System Memory: 61885186968 bytes (59018 MiB); since last: -0.0% -20871200 bytes (-19 MiB)
Atom Memory: 1179937 bytes (1 MiB); since last: +0.0% +0 bytes (+0 MiB)
Binary Memory: 57155375120 bytes (54507 MiB); since last: -0.0% -20792728 bytes (-19 MiB)
Code Memory: 28304486 bytes (26 MiB); since last: +0.0% +0 bytes (+0 MiB)
ETS Memory: 4519942888 bytes (4310 MiB); since last: -0.0% -8136 bytes (-0 MiB)
Process Count: 1093; since last: -2.1% -24
Port Count: 91; since last: --18.0% -20

### Top Memory Processes
  ar_sync_record_storage_module_0_replica_2_9_1 (proc_lib:init_p/5): 367905368 bytes (350 MiB)
  ar_sync_record_storage_module_30_replica_2_9_1 (proc_lib:init_p/5): 306587976 bytes (292 MiB)
  ar_sync_record_storage_module_1_replica_2_9_1 (proc_lib:init_p/5): 306587976 bytes (292 MiB)
  ar_sync_record_storage_module_31_replica_2_9_1 (proc_lib:init_p/5): 255490152 bytes (243 MiB)
  ar_sync_record_storage_module_3_replica_2_9_1 (proc_lib:init_p/5): 212908632 bytes (203 MiB)
  ar_wallets (proc_lib:init_p/5): 147853528 bytes (141 MiB)
  ar_sync_record_storage_module_2_replica_2_9_1 (proc_lib:init_p/5): 123211440 bytes (117 MiB)
  ar_sync_record_storage_module_32_replica_2_9_1 (proc_lib:init_p/5): 102676368 bytes (97 MiB)
  ar_sync_record_storage_module_33_replica_2_9_1 (proc_lib:init_p/5): 85563808 bytes (81 MiB)
  ar_sync_record_storage_module_56_replica_2_9_1 (proc_lib:init_p/5): 71303344 bytes (68 MiB)
  ar_sync_record_storage_module_46_replica_2_9_1 (proc_lib:init_p/5): 71303344 bytes (68 MiB)
  ar_tx_blacklist (proc_lib:init_p/5): 67417016 bytes (64 MiB)
  ar_data_discovery (proc_lib:init_p/5): 49517200 bytes (47 MiB)
  ar_sync_record_storage_module_47_replica_2_9_1 (proc_lib:init_p/5): 49516520 bytes (47 MiB)
  ar_sync_record_storage_module_34_replica_2_9_1 (proc_lib:init_p/5): 49516520 bytes (47 MiB)
  ar_sync_record_storage_module_6_replica_2_9_1 (proc_lib:init_p/5): 49516520 bytes (47 MiB)
  ar_sync_record_storage_module_7_replica_2_9_1 (proc_lib:init_p/5): 41263936 bytes (39 MiB)
  ar_sync_record_storage_module_55_replica_2_9_1 (proc_lib:init_p/5): 34386784 bytes (32 MiB)
  ar_sync_record_storage_module_53_replica_2_9_1 (proc_lib:init_p/5): 28655824 bytes (27 MiB)
  ar_sync_record_storage_module_8_replica_2_9_1 (proc_lib:init_p/5): 28655824 bytes (27 MiB)


### Top Binary Processes
  ar_mining_hash (proc_lib:init_p/5): 15427548158 bytes (14712 MiB)
  ar_mining_worker_6_10 (proc_lib:init_p/5): 2817400585 bytes (2686 MiB)
  ar_mining_worker_61_10 (proc_lib:init_p/5): 2257887159 bytes (2153 MiB)
  ar_mining_worker_29_10 (proc_lib:init_p/5): 2214633408 bytes (2112 MiB)
  ar_mining_worker_25_10 (proc_lib:init_p/5): 2084869131 bytes (1988 MiB)
  ar_mining_worker_49_10 (proc_lib:init_p/5): 2012777194 bytes (1919 MiB)
  ar_mining_worker_16_10 (proc_lib:init_p/5): 2008320047 bytes (1915 MiB)
  ar_mining_worker_48_10 (proc_lib:init_p/5): 1927577403 bytes (1838 MiB)
  ar_mining_worker_13_10 (proc_lib:init_p/5): 1640785777 bytes (1564 MiB)
  ar_mining_worker_3_10 (proc_lib:init_p/5): 1634757980 bytes (1559 MiB)
  ar_mining_worker_22_10 (proc_lib:init_p/5): 1505254626 bytes (1435 MiB)
  ar_mining_worker_43_10 (proc_lib:init_p/5): 1496603199 bytes (1427 MiB)
  ar_mining_worker_41_10 (proc_lib:init_p/5): 1482185762 bytes (1413 MiB)
  ar_mining_worker_21_10 (proc_lib:init_p/5): 1398821216 bytes (1334 MiB)
  ar_mining_worker_40_10 (proc_lib:init_p/5): 1353993935 bytes (1291 MiB)
  ar_mining_worker_35_10 (proc_lib:init_p/5): 1275086752 bytes (1216 MiB)
  ar_mining_worker_12_10 (proc_lib:init_p/5): 1263027990 bytes (1204 MiB)
  ar_mining_worker_50_10 (proc_lib:init_p/5): 1232881229 bytes (1175 MiB)
  ar_mining_worker_56_10 (proc_lib:init_p/5): 1193035965 bytes (1137 MiB)
  ar_mining_worker_18_10 (proc_lib:init_p/5): 1185695273 bytes (1130 MiB)


### Top Mailbox Processes
  <0.12394.0> (erlang:apply/2): 3 messages
  <0.12392.0> (erlang:apply/2): 3 messages
  <0.12408.0> (erlang:apply/2): 2 messages
  <0.12400.0> (erlang:apply/2): 2 messages
  <0.12386.0> (erlang:apply/2): 2 messages
  ar_tx_blacklist (proc_lib:init_p/5): 2 messages
  <0.12407.0> (erlang:apply/2): 1 messages
  <0.12405.0> (erlang:apply/2): 1 messages
  <0.12399.0> (erlang:apply/2): 1 messages
  <0.12397.0> (erlang:apply/2): 1 messages
  <0.95905.0> (proc_lib:init_p/5): 0 messages
  <0.95899.0> (proc_lib:init_p/5): 0 messages
  <0.95895.0> (proc_lib:init_p/5): 0 messages
  <0.95886.0> (proc_lib:init_p/5): 0 messages
  <0.95876.0> (proc_lib:init_p/5): 0 messages
  <0.95860.0> (proc_lib:init_p/5): 0 messages
  <0.95851.0> (proc_lib:init_p/5): 0 messages
  <0.95849.0> (proc_lib:init_p/5): 0 messages
  <0.95841.0> (proc_lib:init_p/5): 0 messages
  <0.95838.0> (proc_lib:init_p/5): 0 messages


### Growing Memory Processes
  ar_mining_worker_3_10 (proc_lib:init_p/5): 3039384 bytes (2 MiB)
  ar_mining_worker_32_10 (proc_lib:init_p/5): 2952744 bytes (2 MiB)
  ar_mining_worker_23_10 (proc_lib:init_p/5): 1824728 bytes (1 MiB)
  ar_mining_worker_15_10 (proc_lib:init_p/5): 1812408 bytes (1 MiB)
  ar_mining_worker_7_10 (proc_lib:init_p/5): 1715712 bytes (1 MiB)
  ar_mining_worker_40_10 (proc_lib:init_p/5): 1579408 bytes (1 MiB)
  ar_mining_worker_0_10 (proc_lib:init_p/5): 1571000 bytes (1 MiB)
  ar_mining_worker_13_10 (proc_lib:init_p/5): 1493816 bytes (1 MiB)
  ar_mining_worker_62_10 (proc_lib:init_p/5): 1448072 bytes (1 MiB)
  ar_mining_worker_36_10 (proc_lib:init_p/5): 1362088 bytes (1 MiB)
  ar_mining_worker_47_10 (proc_lib:init_p/5): 1333816 bytes (1 MiB)
  ar_mining_worker_61_10 (proc_lib:init_p/5): 1290456 bytes (1 MiB)
  ar_mining_worker_16_10 (proc_lib:init_p/5): 984032 bytes (0 MiB)
  ar_mining_worker_25_10 (proc_lib:init_p/5): 977752 bytes (0 MiB)
  ar_mining_worker_49_10 (proc_lib:init_p/5): 963152 bytes (0 MiB)
  ar_mining_worker_58_10 (proc_lib:init_p/5): 889400 bytes (0 MiB)
  ar_mining_worker_17_10 (proc_lib:init_p/5): 692432 bytes (0 MiB)
  ar_mining_worker_44_10 (proc_lib:init_p/5): 582696 bytes (0 MiB)
  ar_mining_worker_19_10 (proc_lib:init_p/5): 580032 bytes (0 MiB)
  ar_mining_worker_51_10 (proc_lib:init_p/5): 551456 bytes (0 MiB)


### Growing Mailbox Processes
  <0.95939.0> (proc_lib:init_p/5): 0 messages
  <0.95931.0> (proc_lib:init_p/5): 0 messages
  <0.95920.0> (proc_lib:init_p/5): 0 messages
  <0.95912.0> (proc_lib:init_p/5): 0 messages
  <0.95905.0> (proc_lib:init_p/5): 0 messages
  <0.95895.0> (proc_lib:init_p/5): 0 messages
  <0.95876.0> (proc_lib:init_p/5): 0 messages
  <0.95860.0> (proc_lib:init_p/5): 0 messages
  <0.95851.0> (proc_lib:init_p/5): 0 messages
  <0.95849.0> (proc_lib:init_p/5): 0 messages
  <0.95841.0> (proc_lib:init_p/5): 0 messages
  <0.95838.0> (proc_lib:init_p/5): 0 messages
  <0.95832.0> (proc_lib:init_p/5): 0 messages
  <0.95789.0> (proc_lib:init_p/5): 0 messages
  <0.95784.0> (proc_lib:init_p/5): 0 messages
  <0.95772.0> (proc_lib:init_p/5): 0 messages
  <0.95762.0> (proc_lib:init_p/5): 0 messages
  <0.95760.0> (proc_lib:init_p/5): 0 messages
  <0.95756.0> (proc_lib:init_p/5): 0 messages
  <0.95749.0> (proc_lib:init_p/5): 0 messages


### Binary Leak Processes
  ar_mining_worker_6_10 (proc_lib:init_p/5): reclaimed 2248 bytes (0 MiB)
  ar_chunk_storage_storage_module_17_replica_2_9_1 (proc_lib:init_p/5): reclaimed 1040 bytes (0 MiB)
  ar_chunk_storage_storage_module_30_replica_2_9_1 (proc_lib:init_p/5): reclaimed 780 bytes (0 MiB)
  ar_chunk_storage_storage_module_3_replica_2_9_1 (proc_lib:init_p/5): reclaimed 780 bytes (0 MiB)
  ar_chunk_storage_storage_module_0_replica_2_9_1 (proc_lib:init_p/5): reclaimed 715 bytes (0 MiB)
  ar_mining_worker_0_10 (proc_lib:init_p/5): reclaimed 606 bytes (0 MiB)
  ar_chunk_storage_storage_module_55_replica_2_9_1 (proc_lib:init_p/5): reclaimed 585 bytes (0 MiB)
  ar_chunk_storage_storage_module_32_replica_2_9_1 (proc_lib:init_p/5): reclaimed 585 bytes (0 MiB)
  ar_chunk_storage_storage_module_2_replica_2_9_1 (proc_lib:init_p/5): reclaimed 585 bytes (0 MiB)
  ar_chunk_storage_storage_module_31_replica_2_9_1 (proc_lib:init_p/5): reclaimed 520 bytes (0 MiB)
  ar_chunk_storage_storage_module_18_replica_2_9_1 (proc_lib:init_p/5): reclaimed 520 bytes (0 MiB)
  ar_mining_worker_1_10 (proc_lib:init_p/5): reclaimed 471 bytes (0 MiB)
  ar_chunk_storage_storage_module_59_replica_2_9_1 (proc_lib:init_p/5): reclaimed 455 bytes (0 MiB)
  ar_mining_worker_2_10 (proc_lib:init_p/5): reclaimed 408 bytes (0 MiB)
  ar_mining_worker_32_10 (proc_lib:init_p/5): reclaimed 402 bytes (0 MiB)
  ar_chunk_storage_storage_module_33_replica_2_9_1 (proc_lib:init_p/5): reclaimed 390 bytes (0 MiB)
  <0.72669.0> (erlang:apply/2): reclaimed 336 bytes (0 MiB)
  <0.12408.0> (erlang:apply/2): reclaimed 329 bytes (0 MiB)
  ar_chunk_storage_storage_module_53_replica_2_9_1 (proc_lib:init_p/5): reclaimed 260 bytes (0 MiB)
  ar_chunk_storage_storage_module_1_replica_2_9_1 (proc_lib:init_p/5): reclaimed 260 bytes (0 MiB)


### Allocator Info
  binary_alloc: 59529789440 bytes (56772 MB)
  eheap_alloc: 3662675968 bytes (3493 MB)
  ets_alloc: 4701749248 bytes (4483 MB)
  fix_alloc: 33325056 bytes (31 MB)
  temp_alloc: 16908288 bytes (16 MB)
  driver_alloc: 59277312 bytes (56 MB)
  sl_alloc: 4227072 bytes (4 MB)
  ll_alloc: 198967296 bytes (189 MB)
  std_alloc: 32800768 bytes (31 MB)

@shizzard shizzard requested a review from JamesPiechota July 10, 2025 14:47
parse_cli_args(["oom_monitor_report_period", Period|Rest], C) ->
try list_to_integer(Period) of
P when P >= 0 ->
parse_cli_args(Rest, C#config{ oom_monitor_report_period = P });
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need a P * 1000?


init([]) ->
?LOG_INFO([{event, ar_oom_monitor_init}]),
%% Trap exit to ensure we close the file handle cleanly
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this? or is this boilerplate copied from other gen_servers?

We've been wrestling with the process_flag(trap_exist, true) as it was one of the contributors to the shutdown hang. I thing we had previously gotten into the habit of just adding it to all new processes by default (e.g. by copying it from an existing process). Just want to call it out here in case that's tha case. If the trap_exit is needed and not just boilerplate: then all good.

case file:open(Filename, [write, raw, {encoding, utf8}]) of
{ok, FileHandle} ->
?LOG_INFO([
{event, ar_oom_monitor_started}, {report_period_ms, ReportPeriod}, {filename, Filename}]),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{event, ar_oom_monitor_started}, {report_period_ms, ReportPeriod}, {filename, Filename}]),
<tab><tab>{event, ar_oom_monitor_started}, {report_period_ms, ReportPeriod}, {filename, Filename}]),

maybe? this line shows up as 8 spaces in my diff rather than 2 tabs

Handle ->
file:close(Handle)
end,
ok.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we always want to return ok here even if file:close yields an error?

Comment on lines +184 to +197
get_top_binary_processes(TopProcs) ->
Procs = lists:map(fun(Pid) ->
case erlang:process_info(Pid, [binary, registered_name, initial_call, current_function]) of
undefined -> {Pid, 0, []};
Props ->
TotalSize = lists:sum([Size || {_, Size, _} <- proplists:get_value(binary, Props, [])]),
InitialCall = proplists:get_value(initial_call, Props, {unknown, unknown, 0}),
CurrentFunction = proplists:get_value(current_function, Props, {unknown, unknown, 0}),
case proplists:get_value(registered_name, Props, undefined) of
undefined -> {Pid, TotalSize, [{initial_call, InitialCall}, {current_function, CurrentFunction}]};
RegName -> {Pid, TotalSize, [RegName, {initial_call, InitialCall}, {current_function, CurrentFunction}]}
end
end
end, erlang:processes()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some spaces in here instead of tabs (haha, the new ChatGPT tell)

AllocData = recon_alloc:memory(allocated_types),
maps:from_list([{Type, Size} || {Type, Size} <- AllocData]).

write_memory_report(FileHandle, MemoryReport, LastMemoryReport) ->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spaces vs. tabs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants