TROUBLESHOOTING: Performance tuning in sssd

Roy214 · Roy214 · commit 2dc140b6c0cd · 2023-08-30T16:43:38.000-05:00
This page will describe the performance related issues in sssd and how
to troubleshoot those issues.
diff --git a/src/contents.rst b/src/contents.rst
@@ -93,6 +93,7 @@ Table of Contents
    Basics <troubleshooting/basics>
    Backend <troubleshooting/backend>
    SSSD Errors <troubleshooting/errors>
+   Performance Tuning in SSSD <troubleshooting/performance>
    Log Analyzer <troubleshooting/analyzer>
    Fleet Commander <troubleshooting/fleet-commander>
    SUDO <troubleshooting/sudo>
diff --git a/src/troubleshooting/performance.rst b/src/troubleshooting/performance.rst
@@ -0,0 +1,138 @@
+Performance tuning in SSSD
+##########################
+
+Slow id lookup
+**************
+This has been noticed id lookup become slow if the LDAP/AD user is a member of large groups say for example user is a member of 500+ groups. ``id`` is very heavy. ``id`` does a lot under its hood.
+Behind the scenes, when the ``id $user`` command is executed it triggers the following:
+
+- Get user information - getpwnam() for the user
+
+- Get primary group information - getgrgid() for the primary group of the user
+
+- Get list of groups - getgrouplist() for the user
+
+- Get group information for each group returned from step 3 - getgrid() for all GIDs returned from getgrouplist() call.
+
+We need to identify out of the above 4 steps which step is actually slow. In order to collect detailed infromation we need to add ``debug_level = 9`` under the ``[$domain]`` section of the ``/etc/sssd/sssd.conf`` followed by a ``sssd`` restart. We often noticed step 4 is the step where sssd takes most of its time because the most data-intensive operation is downloading the groups including their members and by default this feature is enabled we can turn this off by setting ``ignore_group_members = true``. 
+Usually, we are interested in what groups a user is a member of (id aduser@ad_domain) as the initial step rather than what members do specific groups include (getent group adgroup@ad_domain). Setting the ignore_group_members option to True makes all groups appear as empty, thus downloading only information about the group objects themselves and not their members, providing a significant performance boost. Please note that id aduser@ad_domain would still return all the correct groups.
+
+- Pros: getgrnam/getgrgid calls are significantly faster.
+- Cons: getgrnam/getgrgid calls only return the group information, not the members
+
+**WARNING! If the compat tree is used, DO NOT SET ignore_group_members = true because it breaks the compatibility logic.**
+
+If after disbaling the group_members call still the look is slow in that case we can get into the sssd debug logs and verify how long the ``Initgroups`` call is taking this can be done by grepping the ``CR`` no. of that id lookup request. In this example here ``sssd_nss`` is taking ``1 sec`` to process the user group membership here we have only 39 groups associated with the user if the count is large say for example 500-600 and the ``ignore_group_members`` is not to set true then this is expected the id lookup will take some time with the cold cache.
+
+.. code-block:: sssd-log
+
+        $ grep 'CR #3\:' /var/log/sssd/sssd_nss.log
+	(2023-06-08 12:21:31): [nss] [cache_req_set_plugin] (0x2000): CR #3: Setting "Initgroups by name" plugin
+	(2023-06-08 12:21:31): [nss] [cache_req_send] (0x0400): CR #3: New request 'Initgroups by name'
+	(2023-06-08 12:21:31): [nss] [cache_req_process_input] (0x0400): CR #3: Parsing input name [roy]
+	(2023-06-08 12:21:31): [nss] [cache_req_set_name] (0x0400): CR #3: Setting name [roy]
+	(2023-06-08 12:21:31): [nss] [cache_req_select_domains] (0x0400): CR #3: Performing a multi-domain search
+	(2023-06-08 12:21:31): [nss] [cache_req_search_domains] (0x0400): CR #3: Search will check the cache and check the data provider
+	(2023-06-08 12:21:31): [nss] [cache_req_set_domain] (0x0400): CR #3: Using domain [redhat.com]
+	(2023-06-08 12:21:31): [nss] [cache_req_prepare_domain_data] (0x0400): CR #3: Preparing input data for domain [redhat.com] rules
+	(2023-06-08 12:21:31): [nss] [cache_req_search_send] (0x0400): CR #3: Looking up roy@redhat.com
+	(2023-06-08 12:21:31): [nss] [cache_req_search_ncache] (0x0400): CR #3: Checking negative cache for [roy@redhat.com]
+	(2023-06-08 12:21:31): [nss] [cache_req_search_ncache] (0x0400): CR #3: [roy@redhat.com] is not present in negative cache
+	(2023-06-08 12:21:31): [nss] [cache_req_search_cache] (0x0400): CR #3: Looking up [roy@redhat.com] in cache
+	(2023-06-08 12:21:31): [nss] [cache_req_search_send] (0x0400): CR #3: Object found, but needs to be refreshed.
+	(2023-06-08 12:21:31): [nss] [cache_req_search_dp] (0x0400): CR #3: Looking up [roy@redhat.com] in data provider
+	(2023-06-08 12:21:32): [nss] [cache_req_search_cache] (0x0400): CR #3: Looking up [roy@redhat.com] in cache
+	(2023-06-08 12:21:32): [nss] [cache_req_search_ncache_filter] (0x0400): CR #3: This request type does not support filtering result by negative cache
+	(2023-06-08 12:21:32): [nss] [cache_req_search_done] (0x0400): CR #3: Returning updated object [roy@redhat.com]
+	(2023-06-08 12:21:32): [nss] [cache_req_create_and_add_result] (0x0400): CR #3: Found 39 entries in domain redhat.com <---------
+	(2023-06-08 12:21:32): [nss] [cache_req_done] (0x0400): CR #3: Finished: Success
+
+The above troubleshooting can be done easily with ``sssctl analyze request list`` and ``sssctl analyze request show <request_no>``. For more details, please refer to :doc:`Log Analyzer <analyzer>`.
+
+Moreover, if the initgroup call is taking time then add ``debug_microseconds = true`` under the ``[$doamin]`` section of the ``/etc/sssd/sssd.conf`` and try to check the details ``sssd_nss.log`` and ``sssd_$domain.log`` where initgroup call is taking time. In ``sssd_$domain_.log`` start chaecking from ``[BE_USER]`` it will show you if sssd can retrieve the user after that sssd will start processing groups of the respective user which can be seen in ``[Initgroups]`` call in ``sssd_$domain.log``.By looking at this call you will be able to confirm if the user lookup and processing of membership is done properly. Here you will able see where the lookup is taking time or if there is any failure. Also, we can look at ``ltrace -r`` this will tell us which library calls is taking time. In the below example sssd is spending 26 sec in ``getgrouplist()``.
+
+.. code-block:: sssd-log
+
+  0.000976 strrchr("id", '/')                                                                                 = nil
+  ….
+  0.000383 __printf_chk(1, 0x557407f39a3c, 0x55740813d0ba, 0)                                                 = 14
+  0.000175 getpwuid(0x6be2724d, 14, 0x7f9996950860, 0)                                                        = 0x7f99969514e0
+  0.000860 __printf_chk(1, 0x557407f39a2d, 0x557409bb81b0, 0)                                                 = 9
+  0.000119 dcgettext(0, 0x557407f39a32, 5, 0x55740813d09a)                                                    = 0x557407f39a32
+  0.000119 __printf_chk(1, 0x557407f39a32, 0x55740813d09a, 0)                                                 = 15
+  0.000117 getgrgid(0x6be26a81, 15, 0x7f9996950860, 0)                                                        = 0x7f9996951340
+  0.290352 __printf_chk(1, 0x557407f39a2d, 0x557409bb9780, 0)                                                 = 14
+  0.000174 realloc(nil, 40)                                                                                   = 0x557409bb9b90
+  26.000152 getgrouplist(0x557409bb85c0, 0x6be26a81, 0x557409bb9b90, 0x7ffc51b78934)                           = 0xffffffff
+  6.450594 realloc(0x557409bb9b90, 360)
+
+On a side note, we can try to reduce the time spent in id lookup to some extent not that significantly by reducing the ``ldap_group_nesting_level`` to 0. The default value is 2.
+
+
+Mount the cache in tmpfs
+************************
+Quite a bit of the time spent processing the request is writing LDAP objects to the cache. Because the cache maintains full ACID properties, it does disk syncs with every internal SSSD transaction, which causes data to be written to disk. On the plus side, this makes sure the cache is always available in case of a network outage and will be usable after a machine crash, but on the other hand, writing the data takes time. You can mount the cache into a ramdisk, eliminating the disk IO cost.
+
+- On RHEL 8.6 and higher, confirm that the SSSD user owns the config.ldb file:
+
+.. code-block:: bash
+
+        # ls -al /var/lib/sss/db/config.ldb
+        -rw-------. 1 sssd sssd 1286144 Jun  8 16:41 /var/lib/sss/db/config.ldb
+
+In this case, add the following entry to the /etc/fstab file as a single line:
+
+.. code-block:: bash
+
+        tmpfs /var/lib/sss/db/ tmpfs size=300M,mode=0700,uid=sssd,gid=sssd,rootcontext=system_u:object_r:sssd_var_lib_t:s0 0 0
+
+- On RHEL 8 versions lower than 8.6, the config.ldb file is owned by the root user:
+
+.. code-block:: bash
+
+   # ls -al /var/lib/sss/db/config.ldb
+   -rw-------. 1 root root 1286144 Jun  8 14:15 /var/lib/sss/db/config.ldb
+
+In this case, add the following entry to the /etc/fstab file as a single line:
+
+.. code-block:: bash
+
+   tmpfs /var/lib/sss/db/ tmpfs size=300M,mode=0700,rootcontext=system_u:object_r:sssd_var_lib_t:s0 0 0
+
+Tune the size parameter according to your IdM and AD directory size, estimating 100 MBs per 10,000 LDAP entries.
+
+- Then mount the directory and restart the sssd afterwards:
+
+.. code-block:: bash 
+
+   # mount /var/lib/sss/db/
+   # systemctl restart sssd
+
+- Pros: I/O operations on the cache are much faster.
+- Cons: The cache does not persist across reboots. Practically, this means the cache must be rebuilt after machine reboot, but also that cached password are lost after a reboot.
+
+Slow ssh/login
+**************
+
+In order to troubleshoot slow ssh we should basically look into the ssh and sshd both part. We can enable ssh debugging or collect strace in order to check why ssh is slow `How can I gather straces from ssh and sshd? <https://access.redhat.com/solutions/5352541>`_. But here we will check from the sssd perspective if sssd is playing any role in slow ssh. In order to do that we will first valiadte if the ``id`` lookup is slow because at the time of ssh ``initgroup`` calls will be executed so we need to fix the slow id lookup which we have already discussed.
+If the ``id`` lookup is fast then we need to check the sssd debug logs. Go to the ``/var/log/sssd/sssd_$domain.log`` search for ``PAM_AUTH`` section this is the first phase of authentication we need to validated if this has been passed properly or we are noticing any issues here. A common example we often notice ``[krb5_child_timeout] (0x0040): Timeout for child [23514] reached.`` this indicates the KDC server is responding very slowly due to some reason. One of them could be an issue with the firewall or a slow network. As a workaround, consider increasing the value of ``krb5_auth_timeout`` which is 6 seconds by default.
+
+.. code-block:: sssd-log
+
+      (Fri Apr 14 16:37:19 2023) [sssd[be[example.com]]] [krb5_child_timeout] (0x0040): Timeout for child [23514] reached. In case KDC is distant or network is slow you may consider increasing value of krb5_auth_timeout.
+
+Another common example is ``pam_id_timeout`` user group membership is set for processes during login time. Therefore, during PAM conversation, SSSD has to prefer precision over speed and contact the server for accurate information. However, a single login can span over multiple PAM requests as PAM processing is split into several stages – for example, there might be a request for authentication and a separate request for account check (HBAC). It’s not beneficial to contact the server separately for both requests, therefore we set a very short timeout for the PAM responder during which the requests will be answered from in-memory cache. The default value of 5 seconds might not be enough in cases where complex group memberships are populated on server and client side. The recommended value of this option is as long as a single un-cached login takes. Add ``pam_id_timeout = n `` under the ``[pam]`` section of the ``/etc/sssd/sssd.conf`` followed by a ``sssd`` restart.
+
+You could also notice ``sdap_async_sys_connect request failed: [110]: Connection timed out.`` sdap_async_sys_connect request failed occurs if sssd is not able to connect to the LDAP server within 6 seconds. This could be an issue with DNS or the network. Validate the DNS SRV records; if SRV records are not working, hardcoding the AD/LDAP server may help here. For an example, if id_provider = ad is being used then hardcoding of AD servers can be done as: add ``ad_server = ad1.example.com, ad2.example.com`` under the ``[$domain]`` section of the ``/etc/sssd/sssd.conf``. If the network is slow or ldap_network_timeout is reached, then consider increasing the value of ``ldap_network_timeout`` which is set to 6 seconds by default.
+
+.. code-block:: sssd-log
+
+      (Fri Apr 14 16:07:19 2023) [sssd[be[example.com]]] [sssd_async_socket_init_done] (0x0020): sdap_async_sys_connect request failed: [110]: Connection timed out.
+
+If the ``PAM_AUTH`` section has passed successfully without any errors then ``sssd`` jump into the authorization section i.e: ``PAM_ACC``. In this section sssd checks for ``GPO`` access control or any other authorization set by the user example: ``simple_allow_groups``, ``simple_allow_user`` or ``access_filter``. Go to the ``/var/log/sssd/sssd_$domain.log`` serach for ``PAM_ACC`` section and check if we notice any error or delay here. Sometime we have noticed GPO processing is slow if the user a very large policy.
+
+High CPU Usage by sssd
+**********************
+
+High Memory Usage by sssd
+*************************