RabbitMQ CPU usage progressively increasing over time #14861

Mistra · 2025-10-30T10:24:10Z

Mistra
Oct 30, 2025

Description

We are observing a progressive increase in CPU usage over time on a RabbitMQ cluster with Kazoo application connected.

The increase appears to be almost linear, no sudden jumps, and continues even under low message traffic and very low queue churn.
Once the Kazoo consumers disconnect, CPU usage drops immediately, suggesting a correlation between Kazoo connections and the observed CPU load.
From rabbitmq-diagnostics runtime_thread_stats, we can see that the dirty_cpu schedulers show a growing percentage of time spent in gc and gc_full over time.

It looks like RabbitMQ is increasingly spending CPU cycles performing garbage collection rather than message processing.
Only with Kazoo connected we are able to replicate this behavior otherwise with simple consumers and producers we are not able to reproduce this issue.

Environment

RabbitMQ version: 4.1.4
Erlang/OTP version: 27.3.4.3
Deployment: Docker on AWS EC2
Node setup:
- Staging: single node (t3.small, 2 vCPU, 2 GB RAM)
- Production: 2-node cluster, on AWS EC2 M6G
Clients: Kazoo (standard AMQP library, no custom fork)
Queue definition sample:
- x-message-ttl: 60000
- x-max-length: 100
- Classic queues (non-durable, auto-delete)

Observations

The problem occurs even in staging, where the message traffic is low.
Channel and queue churn are minimal or null (connections remain stable).
There are ~20 AMQP connections in total.
In stagin, as example, most connections hold 20–30 channels each, with 3–4 connections reaching 300 channels.
Over time, CPU usage gradually increases, primarily attributed to garbage collection (gc, gc_full) in dirty_cpu schedulers.
Memory and process counts remain within reasonable limits (no resource exhaustion).
Once Kazoo disconnects, CPU usage immediately drops and remains low until the next reconnect.

Reproduction steps

Spawn RabbitMQ
Attach the application
Let run for some day (some weeks, depending on the environment)

Expected behavior

It appears that RabbitMQ’s dirty schedulers are increasingly busy performing garbage collection while we expect them to remain stable.
The load persists even when message volume is low, and resets when those consumers are disconnected.
We would appreciate any guidance or suggestions on how to further diagnose or mitigate this behavior.

Additional context

We are attaching some screenshots and an anonymized report.

anonymized_report.log

mkuratczyk · 2025-10-30T11:59:01Z

mkuratczyk
Oct 30, 2025
Maintainer

Most likely, it's the application doing something differently over time. Perhaps it uses basic.get instead of consume?
https://www.rabbitmq.com/docs/consumers#polling

Garbage collection is a side effect of doing some other work. To learn what the CPU is really doing, you can use perf: https://www.rabbitmq.com/blog/2022/05/31/flame-graphs

0 replies

michaelklishin · 2025-10-30T19:43:47Z

michaelklishin
Oct 30, 2025
Maintainer

@Mistra we cannot know what your applications are doing. Chances are, they are opening more and more connections or declaring more and more queues, including leaking such resources.

RabbitMQ has metrics for each object category for this and other reasons. There are dozens of metrics available, instead of looking at one of them, use others to correlate, then use your applications' metrics to correlate further.

There is a separate metric for basic.get frequency in the management UI called, see deliver / get (it could be hidden since the columns are configurable). Increasingly aggressive basic.get usage does looks like a probably cause if the number of connections and queues is stable, and those queues are not classic priority queues.

0 replies

michaelklishin · 2025-10-30T19:44:26Z

michaelklishin
Oct 30, 2025
Maintainer

On an unrelated nodes, two node clusters are explicitly recommended against.

0 replies

michaelklishin · 2025-10-30T20:01:03Z

michaelklishin
Oct 30, 2025
Maintainer

I don't know if this instance of Kazoo uses these config files but they both use a recommended way of reducing CPU footprint in mostly idle environments and allow for an unlimited number of channels at the same time.

You can set channel_max for this cluster to something like 16, 32 or 64 (all perfectly reasonable values for a lot of apps) and see if the client begins logging exceptions.

Besides a resource leak that hasn't been identified (it could be in a different virtual host, for example), my only other hypothesis was the use of periodic GC for all processes which is disabled by default since Jun 2017 (8d52a09).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RabbitMQ CPU usage progressively increasing over time #14861

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

RabbitMQ CPU usage progressively increasing over time #14861

Uh oh!

Uh oh!

Mistra Oct 30, 2025

Description

Environment

Observations

Reproduction steps

Expected behavior

Additional context

Replies: 4 comments

Uh oh!

mkuratczyk Oct 30, 2025 Maintainer

Uh oh!

Uh oh!

michaelklishin Oct 30, 2025 Maintainer

Uh oh!

michaelklishin Oct 30, 2025 Maintainer

Uh oh!

michaelklishin Oct 30, 2025 Maintainer

Mistra
Oct 30, 2025

mkuratczyk
Oct 30, 2025
Maintainer

michaelklishin
Oct 30, 2025
Maintainer

michaelklishin
Oct 30, 2025
Maintainer

michaelklishin
Oct 30, 2025
Maintainer