Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

staticd: Fix an issue where SRv6 SIDs may not be allocated on heavily loaded systems #18317

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

cscarpitta
Copy link
Contributor

@cscarpitta cscarpitta commented Mar 5, 2025

On a heavily loaded system, it may take some time to establish a connection between staticd and SRv6 SID Manager.
As a result, staticd might try to allocate SIDs in SID Manager before the connection is fully established, leading to a failure.

This issue is reflected in the log with the following error message:

2025/03/04 15:24:01 STATIC: [HR66R-TWQYD][EC 100663302] srv6_manager_get_locator: invalid zclient socket

In this scenario, SIDs are still kept in the staticd configuration, but not allocated and not installed in the RIB.

This commit fixes the issue by forcing staticd to retry to allocate the SIDs once the connection with SID Manager is successfully established.

@frrbot frrbot bot added the staticd label Mar 5, 2025
@cscarpitta cscarpitta changed the title staticd: Allocate SRv6 SIDs after connecting with zebra staticd: Fix an issue where SRv6 SIDs may not be allocated on heavily loaded systems Mar 5, 2025
On a heavily loaded system, it may take some time to establish a
connection between staticd and SRv6 SID Manager.
As a result, staticd might try to allocate SIDs in SID Manager before the
connection is fully established, leading to a failure.

This issue is reflected in the log with the following error message:

```
2025/03/04 15:24:01 STATIC: [HR66R-TWQYD][EC 100663302] srv6_manager_get_locator: invalid zclient socket
```

In this scenario, SIDs are still kept in the staticd configuration,
but not allocated and not installed in the RIB.

This commit fixes the issue by forcing staticd to retry to allocate
the SIDs once the connection with SID Manager is successfully
established.

Signed-off-by: Carmine Scarpitta <[email protected]>
In same cases, it may happen staticd attempts to allocate a SID when it
does not know about the parent locator yet.
In this case, the locator pointer stored inside the SID data structure
is NULL. When `static_zebra_request_srv6_sid` attempts to dereference
the locator pointer, we get a NULL pointer dereference crash.

This commit adds a check to prevent the crash from happening.

If the locator pointer is NULL, it means that staticd is not aware
about the parent locator yet. Then, staticd should call
`static_zebra_srv6_manager_get_locator()` to get the locator from SRv6
Manager. Once staticd receives the requested locator, the existing
callback `static_zebra_process_srv6_locator_internal()` will take care
of retrying to allocate the SID.

Signed-off-by: Carmine Scarpitta <[email protected]>
@cscarpitta cscarpitta force-pushed the fix/static_request_srv6_sids branch from 896e120 to ecad2bd Compare March 5, 2025 16:27
@@ -921,6 +921,11 @@ extern void static_zebra_request_srv6_sid(struct static_srv6_sid *sid)
if (!sid)
return;

if (!sid->locator) {
static_zebra_srv6_manager_get_locator(sid->locator_name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is not it that we missed the srv6 locator event ?
can you please confirm if the srv6 locators are really populated in static?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cscarpitta , I reformulate.
If staticd did not have access to locator information, this means zebra may not have sent locators information to static.

  • either static is not part of the daemons interested in locators
  • or a retry needs to be done
  • or your patch is fine

please advert,

Copy link
Member

@pguibert6WIND pguibert6WIND Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1- This protection should be done for all dameons that read SRv6 SID. The same code pattern should be applied to ISIS and BGP.

2- A retry mechanism for locator should also be put in place. This hook should be provisioned in zebra, so that the current locator are resent to the various CP routing clients.

 hook_call(zserv_client_connect, client);

The 2nd part can be part of a separate pull request.

@Jafaral Jafaral added this to the 10.3 milestone Mar 6, 2025
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants