Process stuck with growing message queue #5420

jamesaimonetti · 2025-02-01T05:33:06Z

We are seeing ever worsening performance in couch 3.3.3

Description

Over time queries to couch take longer and eventually start return 500s and we see perf continue to degrade.

We've found a process with a growing mailbox:

process_info(pid(0,289,0)).
[{registered_name,couch_server_10},
 {current_function,{erts_internal,await_result,1}},
 {initial_call,{proc_lib,init_p,5}},
 {status,running},
 {message_queue_len,69312},
 {links,[<0.18892.3178>,<0.25843.3474>,<0.28109.2975>,
         <0.30613.3209>,<0.32351.3494>,<0.31224.3509>,<0.30413.3250>,
         <0.27158.3496>,<0.28042.3560>,<0.19662.3364>,<0.22065.3445>,
         <0.22667.3591>,<0.20881.3172>,<0.19280.3563>,<0.19642.3408>,
         <0.19654.3416>,<0.19041.3365>,<0.9166.3328>,<0.17046.2913>,
         <0.17074.3321>,<0.17825.3408>|...]},
 {dictionary,[{'$ancestors',[couch_primary_services,
                             couch_sup,<0.256.0>]},
              {'$initial_call',{couch_server,init,1}}]},
 {trap_exit,true},
 {error_handler,error_handler},
 {priority,normal}, 
 {group_leader,<0.255.0>},
 {total_heap_size,365113},
 {heap_size,46422},
 {stack_size,45},
 {reductions,99576710041},
 {garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,16048}]},
 {suspending,[]}]

Looking at the linked processes we see a lot of db updates appearing to be stuck in do_call:

[{current_function,{gen,do_call,4}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,2},
 {links,[<7672.6048.3484>,<7672.289.0>]},
 {dictionary,[{'$ancestors',[<7672.25449.3401>]},
              {io_priority,{db_update,<<"shards/00000000-ffffffff/account/0d/a0/175b29c55f3888839e47caf2821e-202502.1738041440">>}},
              {last_id_merged,<<"202502-ledgers_monthly_rollover">>},
              {'$initial_call',{couch_db_updater,init,1}},
              {idle_limit,61000}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<7672.255.0>},
 {total_heap_size,4185},
 {heap_size,4185},
 {stack_size,44},
 {reductions,21157},
 {garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,
                                       size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,0}]},
 {suspending,[]}]

Steps to Reproduce

This develops over time but appears correlated with a number of tasks we run at the beginning of the month

Expected Behaviour

Don't lock up.

Your Environment

CouchDB version used: 3.3.3 / OTP 24
Browser name and version: N/A
Operating system and version: Centos

Additional Context

Its a 3-node cluster and we see this on all three nodes.

The text was updated successfully, but these errors were encountered:

nickva · 2025-02-01T19:18:45Z

@jamesaimonetti thanks for reaching out

Backups in the couch_server message queue are often due to frequent opens and closes, especially when the db handle lru is full and there are not enough idle handles to replace.

So a few things to try could be:

Raise the max_dbs_open if memory allows it.
Disable idle_check_timeout. This setting was later removed altogether:[couchdb] idle_check_timeout = infinity
Try toggling update_lru_on_read: [couchdb] update_lru_on_read = false. If it was set to true, trying setting to false, and vice-versa. It's kind of dependent on your traffic pattern.
Increase the number of CPU (schedulers) available if possible.couch_server processes are sharded across the number of available schedulers. So having 32 schedulers vs 16 would spread the open calls and the lru across 32 couch_server processes.
Inspect your logs to see if there is anything timing out or crashing constantly.

jamesaimonetti · 2025-02-01T22:22:36Z

@nickva thanks for the pointers!

We were able to halt/1 one of the couch VMs so we'll poke in the crash dump and see if anything jumps out.

I'm also working on a test tool to load couch up like we were seeing and try to make this happen reliably. Will take your points above and incorporate them into our configs.

jamesaimonetti · 2025-02-25T16:07:10Z

@nickva thanks again for the pointers. Turns out the customer had 12 virtual CPUs with other CPU-hogging services running at the same time as a data migration was being run in Couch, causing lots of contention.

We've moved their Couch instances to separate servers and increased CPU count and max_dbs_open. We'll be closely monitoring the situation at the end of this month to see if that's enough, but knowing to check couch_server_X message queues will help.

Much appreciated!

jamesaimonetti added bug needs-triage labels Feb 1, 2025

jamesaimonetti closed this as completed Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process stuck with growing message queue #5420

Process stuck with growing message queue #5420

jamesaimonetti commented Feb 1, 2025

nickva commented Feb 1, 2025

jamesaimonetti commented Feb 1, 2025

jamesaimonetti commented Feb 25, 2025

Process stuck with growing message queue #5420

Process stuck with growing message queue #5420

Comments

jamesaimonetti commented Feb 1, 2025

Description

Steps to Reproduce

Expected Behaviour

Your Environment

Additional Context

nickva commented Feb 1, 2025

jamesaimonetti commented Feb 1, 2025

jamesaimonetti commented Feb 25, 2025