Skip to content

Conversation

sknat
Copy link
Collaborator

@sknat sknat commented Sep 1, 2025

This patch splits the felix server in two pieces:

  • a felix watcher placed under agent/watchers/felix
  • a felix server placed under agent/felix

The former will have only the responsibility of watching
and submitting events into a single event queue.
The latter will receive the event in a single goroutine
and proceed to program VPP as a single thred.

The intent is to move away from a model with multiple servers
replicating state and communicating over a pubsub. This being
prone to race conditions, deadlocks, and not providing many
benefits as scale & asynchronicity will not be a constraint
on nodes with relatively small number of pods (~100) as is k8s
default.

sknat added 2 commits August 26, 2025 15:16
This patch changes the way we persist the data on disk when
running Calico/VPP. Instead of using struc and binary format
we transition to json files. Size should not be an issue as number
of pods per node are typically low (~100). This will make
troubleshooting easier and errors clearer when parsing fails.

We thus remove the /bin/debug troubleshooting utility as the
data format is not human readable.

Doing this, we address an issue where PBL indexes were reused
upon dataplane restart, as they were stored in a list. We now
will use a map to retain the containerIP mapping.

We also split the configuration from runtime spec in LocalPodSpec
and add a step to clear it when corresponding VRFs are not found
in VPP.

Finally we address an issue where uRPF was not properly set up
for ipv6.

Signed-off-by: Nathan Skrzypczak <[email protected]>
This patch splits the felix server in two pieces:
- a felix watcher placed under `agent/watchers/felix`
- a felix server placed under `agent/felix`

The former will have only the responsibility of watching
and submitting events into a single event queue.
The latter will receive the event in a single goroutine
and proceed to program VPP as a single thred.

The intent is to move away from a model with multiple servers
replicating state and communicating over a pubsub. This being
prone to race conditions, deadlocks, and not providing many
benefits as scale & asynchronicity will not be a constraint
on nodes with relatively small number of pods (~100) as is k8s
default.

Signed-off-by: Nathan Skrzypczak <[email protected]>
@sknat sknat force-pushed the nsk-split-felix-server branch from 0354524 to adbe7fe Compare September 1, 2025 13:35
@sknat sknat self-assigned this Sep 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant