Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bgpd: check rmac and nh of evpn imported routes #18372

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dawkopagh
Copy link

Fix for an issue which has been described here: #18240
Basically in a setup where multiple peers advertise identical evpn type-5 rotues, if we clear session with any of them (and at least one is still up), we observe route withdrawals from kernel (even though we receive those routes from remaining peers). If the amount of routes is significant it causes a noticeable downtime before routes are readded to kernel.

I'd appreciate a feedback on why a route withdrawal is necessary in such case (per the comment above I can assume that rmac entry/nh might be lingering somewhere, but where? Change which introduced withdrawal has been merged long time ago, is it still necessary?)

Also menitoned behavior (where routes are being withdrawn immediately even if other peers advertise them) has started to occur after backpressure bgp zebra client #15524.

@donaldsharp
Copy link
Member

could you please add the verbiage you added to the PR to the actual commit message? I do not want to loose this data

Check rmac and nh of new bestpath for evpn imported prefix before withdrawal. Currently when
new bestpath is designated, evpn imported routes are being withdrawn from kernel causing
downtime which length depends on amount of routes to process.

Basically in a setup where multiple peers advertise identical evpn type-5 routes,
if we clear session with any of them (and at least one is still up), we observe
route withdrawals from kernel (even though we receive those routes from remaining peers).
If the amount of routes is significant it causes a noticeable downtime before routes are
readded to kernel.

Also menitoned behavior (where routes are being withdrawn immediately
even if other peers advertise them) has started to occur after
backpressure bgp zebra client FRRouting#15524

Let's check rmac entry and nh of new selected bestpath and do not actually withdraw them from kernel if
those two are the same. This fixes and issue where the same routes are being advertised by multiple
peers and we clear session with one of them FRRouting#18240

Signed-off-by: Dawid Kopec <[email protected]>
@dawkopagh dawkopagh force-pushed the check_rmac_nh_for_evpn_imported_route branch from 8833051 to 4865a82 Compare March 12, 2025 18:04
@ton31337
Copy link
Member

@Mergifyio backport stable/10.3 stable/10.2 stable/10.1

Copy link

mergify bot commented Mar 12, 2025

backport stable/10.3 stable/10.2 stable/10.1

🟠 Waiting for conditions to match

  • merged [📌 backport requirement]

@ton31337
Copy link
Member

@raja-rajasekar opinion?

@raja-rajasekar
Copy link
Contributor

raja-rajasekar commented Mar 12, 2025

The withdraw actual shouldn't be at this point here(An action item I had in my plate for some time but never got a chance to do it), and https://github.com/FRRouting/frr/pull/18158/files addressed that issue (Not yet merged).

This is because we can always end up in a situation where we immediately withdraw the route while the install is pending for later (backpressure), thereby blackhole-ing the traffic.

So I am not sure why we need to patch this code, and rather maybe directly check with the fix in https://github.com/FRRouting/frr/pull/18158/files (Of course this does some other things as well which I haven't looked at it yet, however the commit which removes this block is what I am talking about)

@dawkopagh
Copy link
Author

Indeed #18158 is a better fix and would be ideal to have. I have tested my case with frr build from that PR and there is no downtime (which is expected as the snippet with route withdrawal has been deleted).

However there were concerns with breaking a standard pattern in zebra and there were no updates under the PR so I thought that fix might not eventually make it to master and thought of less invasive fix addressing the case where we have an update with old and new bestpaths with the same next-hop resulting in a downtime due to kernel add/del operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants