Skip to content

Commit 2dc140b

Browse files
committed
TROUBLESHOOTING: Performance tuning in sssd
This page will describe the performance related issues in sssd and how to troubleshoot those issues.
1 parent bb6c855 commit 2dc140b

2 files changed

Lines changed: 139 additions & 0 deletions

File tree

src/contents.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,7 @@ Table of Contents
9393
Basics <troubleshooting/basics>
9494
Backend <troubleshooting/backend>
9595
SSSD Errors <troubleshooting/errors>
96+
Performance Tuning in SSSD <troubleshooting/performance>
9697
Log Analyzer <troubleshooting/analyzer>
9798
Fleet Commander <troubleshooting/fleet-commander>
9899
SUDO <troubleshooting/sudo>
Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
Performance tuning in SSSD
2+
##########################
3+
4+
Slow id lookup
5+
**************
6+
This has been noticed id lookup become slow if the LDAP/AD user is a member of large groups say for example user is a member of 500+ groups. ``id`` is very heavy. ``id`` does a lot under its hood.
7+
Behind the scenes, when the ``id $user`` command is executed it triggers the following:
8+
9+
- Get user information - getpwnam() for the user
10+
11+
- Get primary group information - getgrgid() for the primary group of the user
12+
13+
- Get list of groups - getgrouplist() for the user
14+
15+
- Get group information for each group returned from step 3 - getgrid() for all GIDs returned from getgrouplist() call.
16+
17+
We need to identify out of the above 4 steps which step is actually slow. In order to collect detailed infromation we need to add ``debug_level = 9`` under the ``[$domain]`` section of the ``/etc/sssd/sssd.conf`` followed by a ``sssd`` restart. We often noticed step 4 is the step where sssd takes most of its time because the most data-intensive operation is downloading the groups including their members and by default this feature is enabled we can turn this off by setting ``ignore_group_members = true``.
18+
Usually, we are interested in what groups a user is a member of (id aduser@ad_domain) as the initial step rather than what members do specific groups include (getent group adgroup@ad_domain). Setting the ignore_group_members option to True makes all groups appear as empty, thus downloading only information about the group objects themselves and not their members, providing a significant performance boost. Please note that id aduser@ad_domain would still return all the correct groups.
19+
20+
- Pros: getgrnam/getgrgid calls are significantly faster.
21+
- Cons: getgrnam/getgrgid calls only return the group information, not the members
22+
23+
**WARNING! If the compat tree is used, DO NOT SET ignore_group_members = true because it breaks the compatibility logic.**
24+
25+
If after disbaling the group_members call still the look is slow in that case we can get into the sssd debug logs and verify how long the ``Initgroups`` call is taking this can be done by grepping the ``CR`` no. of that id lookup request. In this example here ``sssd_nss`` is taking ``1 sec`` to process the user group membership here we have only 39 groups associated with the user if the count is large say for example 500-600 and the ``ignore_group_members`` is not to set true then this is expected the id lookup will take some time with the cold cache.
26+
27+
.. code-block:: sssd-log
28+
29+
$ grep 'CR #3\:' /var/log/sssd/sssd_nss.log
30+
(2023-06-08 12:21:31): [nss] [cache_req_set_plugin] (0x2000): CR #3: Setting "Initgroups by name" plugin
31+
(2023-06-08 12:21:31): [nss] [cache_req_send] (0x0400): CR #3: New request 'Initgroups by name'
32+
(2023-06-08 12:21:31): [nss] [cache_req_process_input] (0x0400): CR #3: Parsing input name [roy]
33+
(2023-06-08 12:21:31): [nss] [cache_req_set_name] (0x0400): CR #3: Setting name [roy]
34+
(2023-06-08 12:21:31): [nss] [cache_req_select_domains] (0x0400): CR #3: Performing a multi-domain search
35+
(2023-06-08 12:21:31): [nss] [cache_req_search_domains] (0x0400): CR #3: Search will check the cache and check the data provider
36+
(2023-06-08 12:21:31): [nss] [cache_req_set_domain] (0x0400): CR #3: Using domain [redhat.com]
37+
(2023-06-08 12:21:31): [nss] [cache_req_prepare_domain_data] (0x0400): CR #3: Preparing input data for domain [redhat.com] rules
38+
(2023-06-08 12:21:31): [nss] [cache_req_search_send] (0x0400): CR #3: Looking up roy@redhat.com
39+
(2023-06-08 12:21:31): [nss] [cache_req_search_ncache] (0x0400): CR #3: Checking negative cache for [roy@redhat.com]
40+
(2023-06-08 12:21:31): [nss] [cache_req_search_ncache] (0x0400): CR #3: [roy@redhat.com] is not present in negative cache
41+
(2023-06-08 12:21:31): [nss] [cache_req_search_cache] (0x0400): CR #3: Looking up [roy@redhat.com] in cache
42+
(2023-06-08 12:21:31): [nss] [cache_req_search_send] (0x0400): CR #3: Object found, but needs to be refreshed.
43+
(2023-06-08 12:21:31): [nss] [cache_req_search_dp] (0x0400): CR #3: Looking up [roy@redhat.com] in data provider
44+
(2023-06-08 12:21:32): [nss] [cache_req_search_cache] (0x0400): CR #3: Looking up [roy@redhat.com] in cache
45+
(2023-06-08 12:21:32): [nss] [cache_req_search_ncache_filter] (0x0400): CR #3: This request type does not support filtering result by negative cache
46+
(2023-06-08 12:21:32): [nss] [cache_req_search_done] (0x0400): CR #3: Returning updated object [roy@redhat.com]
47+
(2023-06-08 12:21:32): [nss] [cache_req_create_and_add_result] (0x0400): CR #3: Found 39 entries in domain redhat.com <---------
48+
(2023-06-08 12:21:32): [nss] [cache_req_done] (0x0400): CR #3: Finished: Success
49+
50+
The above troubleshooting can be done easily with ``sssctl analyze request list`` and ``sssctl analyze request show <request_no>``. For more details, please refer to :doc:`Log Analyzer <analyzer>`.
51+
52+
Moreover, if the initgroup call is taking time then add ``debug_microseconds = true`` under the ``[$doamin]`` section of the ``/etc/sssd/sssd.conf`` and try to check the details ``sssd_nss.log`` and ``sssd_$domain.log`` where initgroup call is taking time. In ``sssd_$domain_.log`` start chaecking from ``[BE_USER]`` it will show you if sssd can retrieve the user after that sssd will start processing groups of the respective user which can be seen in ``[Initgroups]`` call in ``sssd_$domain.log``.By looking at this call you will be able to confirm if the user lookup and processing of membership is done properly. Here you will able see where the lookup is taking time or if there is any failure. Also, we can look at ``ltrace -r`` this will tell us which library calls is taking time. In the below example sssd is spending 26 sec in ``getgrouplist()``.
53+
54+
.. code-block:: sssd-log
55+
56+
0.000976 strrchr("id", '/') = nil
57+
….
58+
0.000383 __printf_chk(1, 0x557407f39a3c, 0x55740813d0ba, 0) = 14
59+
0.000175 getpwuid(0x6be2724d, 14, 0x7f9996950860, 0) = 0x7f99969514e0
60+
0.000860 __printf_chk(1, 0x557407f39a2d, 0x557409bb81b0, 0) = 9
61+
0.000119 dcgettext(0, 0x557407f39a32, 5, 0x55740813d09a) = 0x557407f39a32
62+
0.000119 __printf_chk(1, 0x557407f39a32, 0x55740813d09a, 0) = 15
63+
0.000117 getgrgid(0x6be26a81, 15, 0x7f9996950860, 0) = 0x7f9996951340
64+
0.290352 __printf_chk(1, 0x557407f39a2d, 0x557409bb9780, 0) = 14
65+
0.000174 realloc(nil, 40) = 0x557409bb9b90
66+
26.000152 getgrouplist(0x557409bb85c0, 0x6be26a81, 0x557409bb9b90, 0x7ffc51b78934) = 0xffffffff
67+
6.450594 realloc(0x557409bb9b90, 360)
68+
69+
On a side note, we can try to reduce the time spent in id lookup to some extent not that significantly by reducing the ``ldap_group_nesting_level`` to 0. The default value is 2.
70+
71+
72+
Mount the cache in tmpfs
73+
************************
74+
Quite a bit of the time spent processing the request is writing LDAP objects to the cache. Because the cache maintains full ACID properties, it does disk syncs with every internal SSSD transaction, which causes data to be written to disk. On the plus side, this makes sure the cache is always available in case of a network outage and will be usable after a machine crash, but on the other hand, writing the data takes time. You can mount the cache into a ramdisk, eliminating the disk IO cost.
75+
76+
- On RHEL 8.6 and higher, confirm that the SSSD user owns the config.ldb file:
77+
78+
.. code-block:: bash
79+
80+
# ls -al /var/lib/sss/db/config.ldb
81+
-rw-------. 1 sssd sssd 1286144 Jun 8 16:41 /var/lib/sss/db/config.ldb
82+
83+
In this case, add the following entry to the /etc/fstab file as a single line:
84+
85+
.. code-block:: bash
86+
87+
tmpfs /var/lib/sss/db/ tmpfs size=300M,mode=0700,uid=sssd,gid=sssd,rootcontext=system_u:object_r:sssd_var_lib_t:s0 0 0
88+
89+
- On RHEL 8 versions lower than 8.6, the config.ldb file is owned by the root user:
90+
91+
.. code-block:: bash
92+
93+
# ls -al /var/lib/sss/db/config.ldb
94+
-rw-------. 1 root root 1286144 Jun 8 14:15 /var/lib/sss/db/config.ldb
95+
96+
In this case, add the following entry to the /etc/fstab file as a single line:
97+
98+
.. code-block:: bash
99+
100+
tmpfs /var/lib/sss/db/ tmpfs size=300M,mode=0700,rootcontext=system_u:object_r:sssd_var_lib_t:s0 0 0
101+
102+
Tune the size parameter according to your IdM and AD directory size, estimating 100 MBs per 10,000 LDAP entries.
103+
104+
- Then mount the directory and restart the sssd afterwards:
105+
106+
.. code-block:: bash
107+
108+
# mount /var/lib/sss/db/
109+
# systemctl restart sssd
110+
111+
- Pros: I/O operations on the cache are much faster.
112+
- Cons: The cache does not persist across reboots. Practically, this means the cache must be rebuilt after machine reboot, but also that cached password are lost after a reboot.
113+
114+
Slow ssh/login
115+
**************
116+
117+
In order to troubleshoot slow ssh we should basically look into the ssh and sshd both part. We can enable ssh debugging or collect strace in order to check why ssh is slow `How can I gather straces from ssh and sshd? <https://access.redhat.com/solutions/5352541>`_. But here we will check from the sssd perspective if sssd is playing any role in slow ssh. In order to do that we will first valiadte if the ``id`` lookup is slow because at the time of ssh ``initgroup`` calls will be executed so we need to fix the slow id lookup which we have already discussed.
118+
If the ``id`` lookup is fast then we need to check the sssd debug logs. Go to the ``/var/log/sssd/sssd_$domain.log`` search for ``PAM_AUTH`` section this is the first phase of authentication we need to validated if this has been passed properly or we are noticing any issues here. A common example we often notice ``[krb5_child_timeout] (0x0040): Timeout for child [23514] reached.`` this indicates the KDC server is responding very slowly due to some reason. One of them could be an issue with the firewall or a slow network. As a workaround, consider increasing the value of ``krb5_auth_timeout`` which is 6 seconds by default.
119+
120+
.. code-block:: sssd-log
121+
122+
(Fri Apr 14 16:37:19 2023) [sssd[be[example.com]]] [krb5_child_timeout] (0x0040): Timeout for child [23514] reached. In case KDC is distant or network is slow you may consider increasing value of krb5_auth_timeout.
123+
124+
Another common example is ``pam_id_timeout`` user group membership is set for processes during login time. Therefore, during PAM conversation, SSSD has to prefer precision over speed and contact the server for accurate information. However, a single login can span over multiple PAM requests as PAM processing is split into several stages – for example, there might be a request for authentication and a separate request for account check (HBAC). It’s not beneficial to contact the server separately for both requests, therefore we set a very short timeout for the PAM responder during which the requests will be answered from in-memory cache. The default value of 5 seconds might not be enough in cases where complex group memberships are populated on server and client side. The recommended value of this option is as long as a single un-cached login takes. Add ``pam_id_timeout = n `` under the ``[pam]`` section of the ``/etc/sssd/sssd.conf`` followed by a ``sssd`` restart.
125+
126+
You could also notice ``sdap_async_sys_connect request failed: [110]: Connection timed out.`` sdap_async_sys_connect request failed occurs if sssd is not able to connect to the LDAP server within 6 seconds. This could be an issue with DNS or the network. Validate the DNS SRV records; if SRV records are not working, hardcoding the AD/LDAP server may help here. For an example, if id_provider = ad is being used then hardcoding of AD servers can be done as: add ``ad_server = ad1.example.com, ad2.example.com`` under the ``[$domain]`` section of the ``/etc/sssd/sssd.conf``. If the network is slow or ldap_network_timeout is reached, then consider increasing the value of ``ldap_network_timeout`` which is set to 6 seconds by default.
127+
128+
.. code-block:: sssd-log
129+
130+
(Fri Apr 14 16:07:19 2023) [sssd[be[example.com]]] [sssd_async_socket_init_done] (0x0020): sdap_async_sys_connect request failed: [110]: Connection timed out.
131+
132+
If the ``PAM_AUTH`` section has passed successfully without any errors then ``sssd`` jump into the authorization section i.e: ``PAM_ACC``. In this section sssd checks for ``GPO`` access control or any other authorization set by the user example: ``simple_allow_groups``, ``simple_allow_user`` or ``access_filter``. Go to the ``/var/log/sssd/sssd_$domain.log`` serach for ``PAM_ACC`` section and check if we notice any error or delay here. Sometime we have noticed GPO processing is slow if the user a very large policy.
133+
134+
High CPU Usage by sssd
135+
**********************
136+
137+
High Memory Usage by sssd
138+
*************************

0 commit comments

Comments
 (0)