Skip to content

L2 forwarding & learning

An L2 network is a single broadcast domain stretched across every node that hosts it, sharing one IP pool (see Networks). There is no kernel bridge spanning nodes; instead two cluster-scoped resources hold the L2 forwarding state, and every node projects them into its eBPF maps.

The data plane forwards from two maps, each backed by a resource that is the cluster-wide source of truth:

  • FDBEntry{network, MAC} → a delivery target and the endpoint’s policy segment. Drives both L2 forwarding and L3→L2 resolution.
  • Neighbor{network, IP} → MAC. Answers ARP and NDP locally (suppression) and resolves L3→L2; only IPs inside the network’s prefixes get one.

Every node watches both resources and programs its own copy of the maps. When an entry’s endpoint is local, the delivery target resolves to that pod’s device; when it is on another node, to that node’s SRv6 SID, so a frame for a remote MAC is encapsulated. The control plane never writes a node’s maps directly.

Each resource splits into:

  • spec — the immutable key: the network, the MAC or IP, and a source of static or learned.
  • status — the resolved binding: the owning endpoint, the managing node, the segment (FDBEntry), the resolved MAC (Neighbor), and a lastObserved timestamp for learned entries.

source is immutable and separates two independent writers — the static path and the learning loop — so they can never clobber each other. A static and a learned entry for the same key cannot coexist, since both derive the same name from their key.

Ordinary pods have their MAC and addresses the moment they attach, so their bindings need no learning. When a pod joins an L2 network, the node hosting it creates the static FDBEntry for the pod’s MAC and a static Neighbor per address, and claims the endpoint with a finalizer.

Their lifecycle is tied to that endpoint: when the pod is removed, the node deletes them as part of tearing the endpoint down. Static entries carry no timestamp and are never swept.

Workloads whose MAC isn’t known up front — VMs, macvlan guests, live-migrated workloads — are discovered from their traffic instead.

The L2 entry program inspects the source {MAC, IP} of the frames it forwards (ARP and NDP, and every unicast IP frame, so even a quiet endpoint that only carries traffic is seen). A per-CPU table suppresses repeats so a busy binding isn’t reported on every packet; new or due-for-refresh bindings are emitted on a ring buffer.

Not every address behind a MAC should become an IP→MAC binding. The network’s prefixes are the allow-list:

  • an address within a prefix is learned fully — the MAC (FDBEntry) and the IP→MAC binding (Neighbor),
  • an address outside every prefix, or none yet, is learned as a partial — the MAC only, never a Neighbor. A router MAC fronting nested workloads forwards their out-of-prefix IPs under its own MAC; those must not enter the neighbor table, but the forwarding MAC itself still must.

The node plugin reads the ring buffer, resolves each event’s source device to the local NetworkEndpoint that owns it, and stages the observation; a reconcile then creates or refreshes the matching learned FDBEntry and Neighbor. The node that owns the source endpoint is the one that writes the binding.

status.lastObserved records when the binding was last seen. To avoid a write on every packet of a hot binding, the managing node refreshes it only on a coarse cadence — though a change to the segment, endpoint, or managing node republishes immediately regardless.

A learned entry is owned by the node that last observed it, recorded in status.managingNode. Only the node actually seeing a workload’s frames observes its binding, so ownership follows the workload: when it moves, the binding surfaces on the new node, which claims managingNode the moment it observes it. Takeover is not debounced — the data plane already rate-limits how often a binding resurfaces, so ownership follows a move as fast as it is seen.

Because refresh is traffic-driven, a binding that goes silent must eventually be reaped. Each node periodically sweeps its own learned entries and deletes those whose lastObserved is older than the expiry window. Static entries, and entries managed by other nodes, are left alone.

The intervals are ordered refresh < expiry, so a binding that is briefly idle or mid-takeover is never deleted out from under an active workload. Each delete carries a resource-version precondition, so an entry refreshed between the sweep’s list and its delete is skipped rather than lost.

One gap: a managing node that dies cannot sweep its own entries; another node reclaims them only if it still sees the binding, so a binding genuinely gone under a dead node lingers.