Federation delivery observability
Event Federation (ADR 0006, Phase 2) pushes events between routerd nodes via
the outbox. This guide explains how to verify that deliveries are healthy, spot
problems, and act on them using routerctl.
Quick check
routerctl doctor federation --state-file /var/lib/routerd/routerd.db
A healthy system prints all PASS:
DOCTOR PASS pass=7 warn=0 fail=0 skip=0
AREA STATUS CHECK DETAIL REMEDY
federation PASS cloudedge/leaf-az failed deliveries no failed deliveries
federation PASS cloudedge/leaf-az pending deliveries no pending deliveries
federation PASS cloudedge/leaf-az stale TTL no stale TTL deliveries
federation PASS cloudedge/leaf-az delivery lag max delivery lag 2s
federation PASS cloudedge/leaf-az event expiry nearest event expires in 1740s
federation PASS cloudedge/leaf-az expected delivery all 3 active event(s) have delivery rows
federation SKIP expected peers no self-emitted active events to deliver
Commands
Delivery summary
Per-(group, peer) aggregate of all active delivery rows:
routerctl federation deliveries summary \
--group cloudedge \
--state-file /var/lib/routerd/routerd.db
Output:
GROUP PEER EVENTS DELIVERED STALE_TTL FAILED PENDING MAX_LAG MIN_EXPIRES_IN
cloudedge leaf-az 3 3 0 0 0 2s 29m0s
cloudedge leaf-oci 3 2 0 1 0 4s 29m0s
Add -o json or -o yaml for machine-readable output. Use --include-expired
to include events whose TTL has already passed.
Doctor federation
routerctl doctor federation runs two categories of checks:
- Recorded-delivery checks — examine existing delivery rows per (group, peer) for failures, pending expiry, stale TTL, and delivery lag.
- Expected-peer audit — derive the expected peer set from
EventGroupandEventPeerresources in the startup config, then verify that every self-emitted active event has a delivery row for each expected peer.
Run against a specific area:
routerctl doctor federation \
--state-file /var/lib/routerd/routerd.db \
--config /usr/local/etc/routerd/router.yaml
--config is needed for expected-peer checks because the audit reads
EventGroup (for the self node name) and EventPeer (for declared peers) from
the startup config.
Reading the summary table
| Column | Meaning |
|---|---|
| GROUP | EventGroup name (e.g. cloudedge) |
| PEER | Target peer node name |
| EVENTS | Total active event-delivery pairs |
| DELIVERED | Successfully pushed to peer |
| STALE_TTL | Delivered but the event TTL was refreshed since delivery (event.expires_at > delivery.event_expires_at); the peer holds a stale copy |
| FAILED | All retry attempts exhausted |
| PENDING | Enqueued but not yet delivered |
| MAX_LAG | Worst-case time between event observation and delivery |
| MIN_EXPIRES_IN | Time until the soonest event expires; negative means already expired |
Healthy state: DELIVERED == EVENTS, FAILED == 0, PENDING == 0,
STALE_TTL == 0.
FederationSLO
You can declare a FederationSLO resource to override the default doctor
thresholds for a specific EventGroup:
apiVersion: federation.routerd.net/v1alpha1
kind: FederationSLO
metadata:
name: cloudedge-slo
spec:
groupRef: cloudedge
delivery:
lagWarnSeconds: 60 # default: 60
lagFailSeconds: 180 # default: 180
expiresSoonSeconds: 120 # default: 120
subscription:
maxPendingRuns: 0 # default: 0 (any pending triggers warn)
maxFailedRuns: 0 # default: 0 (any failure triggers fail)
When a FederationSLO is present, routerctl doctor federation uses its
thresholds instead of the hardcoded defaults. The JSON output includes an
slo object showing the effective thresholds and any violations.
Without a FederationSLO resource, the default thresholds apply unchanged.
Validation rules:
spec.groupRefis required and must reference an existingEventGroupresource bymetadata.name.- Only one
FederationSLOperEventGroupis allowed; duplicategroupRefvalues are rejected at config validation time. spec.delivery.lagWarnSecondsmust be strictly less thanlagFailSecondsafter applying defaults (0 falls back to the default). For example,lagWarnSeconds: 200withlagFailSeconds: 0(default 180) is rejected because effective 200 >= 180.- Zero values (
0) mean "use the default threshold"; negative values are rejected.
Remediation plan
Generate a plan of suggested actions without executing them:
routerctl doctor federation --remediation-plan -o json \
--state-file /var/lib/routerd/routerd.db \
--config /usr/local/etc/routerd/router.yaml
The plan includes structured actions with action, reason, targetPeer,
targetGroup, safe, and requiresOperatorApproval fields:
{
"remediationPlan": {
"generatedAt": "2026-06-18T01:00:00Z",
"actions": [
{
"action": "retry-failed-deliveries",
"reason": "1 of 1 delivery(s) failed",
"targetPeer": "leaf-az",
"targetGroup": "cloudedge",
"safe": true,
"requiresOperatorApproval": false
}
]
}
}
The --remediation-plan flag is opt-in; without it, no remediationPlan key
appears in the JSON output. The plan is read-only — it never mutates state.
Field meanings:
safe: the suggested action is idempotent and can be retried without risk.requiresOperatorApproval: the action changes configuration and requires human review before execution.
Remediation action reference
| Action | Trigger | safe | requiresOperatorApproval |
|---|---|---|---|
retry-failed-deliveries | Failed delivery rows exist | true | false |
investigate-pending-deliveries | Pending deliveries with possible expiry risk | true | false |
force-repush-stale-ttl | Stale TTL detected (event TTL refreshed since last push) | true | false |
check-peer-connectivity | Delivery lag exceeds SLO threshold | true | false |
configure-peer-endpoint | EventPeer endpoint is empty | false | true |
investigate-missing-delivery-rows | Expected peer has no delivery rows | true | false |
inspect-failed-subscription-runs | Subscription runs in failed status | true | false |
Doctor check reference
Recorded-delivery checks
These run per (group, peer) pair that has delivery rows in the state store.
Thresholds are derived from FederationSLO when present, otherwise defaults
apply.
| Check | FAIL | WARN | PASS |
|---|---|---|---|
| failed deliveries | Any delivery row in failed status | — | FAILED == 0 |
| pending deliveries | Pending events whose TTL expires within expiresSoonSeconds | Pending exists but no imminent expiry | PENDING == 0 |
| stale TTL | All delivered events have stale TTL (STALE_TTL == DELIVERED) | Some deliveries have stale TTL | STALE_TTL == 0 |
| delivery lag | Max lag >= lagFailSeconds (default 180 s) | Max lag >= lagWarnSeconds (default 60 s) | Below lagWarnSeconds |
| event expiry | Nearest expiry < expiresSoonSeconds (default 120 s) with pending/failed deliveries | Nearest expiry < expiresSoonSeconds, all delivered | Comfortable margin |
Expected-peer audit
These check that the config-declared peers actually have delivery rows.
| Check | FAIL | SKIP | PASS |
|---|---|---|---|
| expected delivery | Self-emitted active events exist but no delivery row for this peer | No self-emitted events in the group | All active events have delivery rows |
| empty endpoint | EventPeer.spec.endpoint is empty | — | — |
The audit excludes the self node: if EventGroup.spec.nodeName matches
EventPeer.spec.nodeName, that peer is skipped (a node does not push events to
itself).
Common failures and remedies
EventPeer endpoint not set
FAIL cloudedge/leaf-oci expected delivery EventPeer endpoint is empty
set spec.endpoint on EventPeer/leaf-oci for group cloudedge
The EventPeer resource declares the peer but has no endpoint URL. The outbox
cannot push events without a target. Set spec.endpoint to the peer's federation
listener (e.g. https://10.252.0.3:8443/v1/federation/events).
Missing delivery rows for expected peer
FAIL cloudedge/leaf-oci expected delivery 2 of 3 active event(s) have no delivery row: evt-abc, evt-def
outbox never enqueued delivery for this peer; check EventPeer config and outbox peer filter
The outbox creates delivery rows only for events where SourceNode matches the
local EventGroup.spec.nodeName. If the self node name in the EventGroup does
not match the source_node column in federation_events, no delivery is
enqueued. Verify:
routerctl federation event list --group cloudedge --state-file /var/lib/routerd/routerd.db
Check the SOURCE column matches the EventGroup nodeName. Also confirm that
the outbox controller is running (eventd daemon) and that no types /
subjectPrefixes filter on the EventPeer silently excludes the events.
HMAC authentication mismatch
The outbox push returns HTTP 403 or 401. Check journalctl -u routerd-eventd
for authentication errors. Verify that both ends share the same
EventGroup.spec.auth.hmacSecretRef and that the referenced Secret exists and
contains the correct key.
Outbox not running
FAIL cloudedge/leaf-az pending deliveries 3 pending; 2 event(s) expire within 120s without delivery
outbox may be stalled or peer unreachable; check eventd logs and peer endpoint
Delivery rows were created but nothing was pushed. Confirm routerd-eventd is
running:
systemctl status routerd-eventd
journalctl -u routerd-eventd --since "10 min ago"
Stale TTL after refresh
WARN cloudedge/leaf-az stale TTL 1 of 3 delivered event(s) have stale TTL (event.expiresAt > delivery.eventExpiresAt)
outbox should re-push refreshed events on next tick; if this persists, check outbox interval and delivery filtering
An event's TTL was extended (e.g. by re-emitting with --ttl) but the delivery
record still holds the old event_expires_at. The outbox should detect this on
its next tick and re-push (PR #531). If the warning persists, check the outbox
interval and that the re-push logic is working:
routerctl federation event deliveries \
--group cloudedge --peer leaf-az \
--state-file /var/lib/routerd/routerd.db
Compare EVENT_EXPIRES_AT in the delivery row against the event's EXPIRES
column in routerctl federation event list.
SourceNode does not match EventGroup nodeName
Events emitted with a --source-node that differs from the local
EventGroup.spec.nodeName are not pushed by the outbox (the outbox only pushes
self-emitted events). The expected-peer audit will show these events as missing
delivery rows. Verify the source node matches:
routerctl federation event list --group cloudedge -o json \
--state-file /var/lib/routerd/routerd.db | jq '.[].sourceNode'
SAMSubnetPolicy delivery verification
For CloudEdge SAM deployments, routerd.mobility.shard.assigned events carry
the shard assignment from the hub to leaf nodes. Verify delivery:
# Hub node: check deliveries to all leaf peers
routerctl federation deliveries summary --group cloudedge \
--state-file /var/lib/routerd/routerd.db
# Hub node: doctor check including expected-peer audit
routerctl doctor federation \
--state-file /var/lib/routerd/routerd.db \
--config /usr/local/etc/routerd/router.yaml
# Leaf node: confirm the event was received and the subscription processed it
routerctl federation event list --group cloudedge \
--state-file /var/lib/routerd/routerd.db
routerctl dynamic list --state-file /var/lib/routerd/routerd.db
On the leaf, routerctl dynamic list should show a DynamicConfigPart with
provenance routerd.net/event-group: cloudedge. If the event is delivered but
no dynamic config appears, the issue is on the subscription/plugin side, not
delivery — see Event Federation Subscription.
JSON output for automation
Both commands support JSON output for integration with monitoring or scripts:
# Summary as JSON
routerctl federation deliveries summary --group cloudedge -o json \
--state-file /var/lib/routerd/routerd.db
# Doctor as JSON
routerctl doctor federation -o json \
--state-file /var/lib/routerd/routerd.db \
--config /usr/local/etc/routerd/router.yaml
The doctor JSON includes summary.overall ("pass", "warn", or "fail")
for alerting thresholds. A non-"pass" overall status indicates at least one
check that warrants investigation.
Federation summary in doctor JSON
When the federation area is checked, the doctor JSON includes a
federation object with aggregated delivery statistics:
{
"federation": {
"severityCounts": {"pass": 7, "warn": 0, "fail": 0, "skip": 1},
"failedDeliveryCount": 0,
"staleTTLCount": 0,
"pendingDeliveryCount": 0,
"missingExpectedPeerCount": 0,
"maxDeliveryLagSeconds": 2,
"minExpiresInSeconds": 1740,
"totalEvents": 3,
"totalDelivered": 3,
"subscriptionRunsTotal": 3,
"subscriptionRunsSucceeded": 3,
"subscriptionRunsFailed": 0,
"subscriptionRunsPending": 0,
"slo": {
"groups": [
{
"group": "cloudedge",
"defined": true,
"thresholds": {
"delivery": {
"lagWarnSeconds": 60,
"lagFailSeconds": 180,
"expiresSoonSeconds": 120
},
"subscription": {
"maxPendingRuns": 0,
"maxFailedRuns": 0
}
},
"violations": []
}
]
}
}
}
The slo.groups array contains one entry per EventGroup (union of config
and observed groups, sorted by name). Each entry has:
defined: whether aFederationSLOresource is configured for this groupthresholds.delivery: effective delivery lag/expiry thresholdsthresholds.subscription: effective subscription run thresholdsviolations: threshold breaches with check name, threshold, actual value, and severity (empty[]when healthy)
jq examples
# Overall status for alerting
routerctl doctor federation -o json \
--state-file /var/lib/routerd/routerd.db \
--config /usr/local/etc/routerd/router.yaml \
| jq -r '.summary.overall'
# Failed delivery count (exit 1 if > 0)
routerctl doctor federation -o json \
--state-file /var/lib/routerd/routerd.db \
--config /usr/local/etc/routerd/router.yaml \
| jq -e '.federation.failedDeliveryCount == 0'
# Missing expected peers
routerctl doctor federation -o json \
--state-file /var/lib/routerd/routerd.db \
--config /usr/local/etc/routerd/router.yaml \
| jq '.federation.missingExpectedPeerCount'
# All failing checks with remedies
routerctl doctor federation -o json \
--state-file /var/lib/routerd/routerd.db \
--config /usr/local/etc/routerd/router.yaml \
| jq '[.checks[] | select(.status == "fail") | {name, detail, remedy}]'
# Stale TTL and pending counts for monitoring
routerctl doctor federation -o json \
--state-file /var/lib/routerd/routerd.db \
--config /usr/local/etc/routerd/router.yaml \
| jq '{stale: .federation.staleTTLCount, pending: .federation.pendingDeliveryCount, lag: .federation.maxDeliveryLagSeconds}'
# Subscription run failures (exit 1 if > 0)
routerctl doctor federation -o json \
--state-file /var/lib/routerd/routerd.db \
--config /usr/local/etc/routerd/router.yaml \
| jq -e '.federation.subscriptionRunsFailed == 0'
Subscription monitoring
When EventSubscription resources are configured, routerctl doctor federation
also checks subscription processing health. Thresholds are derived from the
FederationSLO for the subscription's groupRef:
| Check | FAIL | WARN | PASS |
|---|---|---|---|
| subscription runs | Failed > maxFailedRuns (default 0) | Pending > maxPendingRuns (default 0) | Within SLO thresholds |
Inspecting subscription runs
List processing records for a specific subscription:
routerctl federation subscription runs \
--subscription EventSubscription/cloud-claims \
--state-file /var/lib/routerd/routerd.db
Output:
ID SUBSCRIPTION EVENT_ID GROUP PLUGIN STATUS ATTEMPTS STARTED COMPLETED ERROR
1 EventSubscription/cloud-claims evt-abc cloudedge cloud-claims succeeded 1 2026-06-17T12:00:00Z 2026-06-17T12:00:01Z
2 EventSubscription/cloud-claims evt-def cloudedge cloud-claims failed 3 2026-06-17T12:01:00Z 2026-06-17T12:01:05Z plugin exited 1
Add -o json for machine-readable output. Failed runs indicate the plugin
returned an error; check the plugin output and event payload.
End-to-end delivery verification
To verify the full pipeline from event emission to DynamicConfigPart:
# 1. Verify events are delivered
routerctl federation deliveries summary --group cloudedge \
--state-file /var/lib/routerd/routerd.db
# 2. Verify subscriptions processed events
routerctl federation subscription runs \
--subscription EventSubscription/cloud-claims \
--state-file /var/lib/routerd/routerd.db
# 3. Verify DynamicConfigParts were generated
routerctl dynamic list --state-file /var/lib/routerd/routerd.db
If events are delivered but subscription runs show failures, the issue is in the plugin or its inputs. If subscription runs succeed but no DynamicConfigPart appears, check the plugin output format.
OpenTelemetry metrics reference
routerd-eventd emits the following metrics when an OTLP endpoint is configured
(see Send telemetry to an OTLP collector):
Outbox delivery metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
routerd_eventd_outbox_delivery_total | counter | group, peer, event_type, status | Delivery results (delivered/failed) |
routerd_eventd_outbox_delivery_attempts_total | counter | group, peer, event_type, status | Total push attempts |
routerd_eventd_outbox_delivery_lag_seconds | histogram | group, peer, event_type | Time between event observation and delivery |
routerd_eventd_outbox_repush_total | counter | group, peer, event_type | Re-pushes after TTL refresh |
routerd_eventd_outbox_stale_ttl_delivery_total | counter | group, peer, event_type | Stale TTL detections |
Outbox loop health metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
routerd_eventd_outbox_tick_total | counter | group | Outbox loop iterations |
routerd_eventd_outbox_tick_errors_total | counter | group | Failed outbox iterations |
routerd_eventd_outbox_tick_duration_seconds | histogram | group | Time per outbox iteration |
Receiver metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
routerd_eventd_receiver_accepted_total | counter | group | Events accepted |
routerd_eventd_receiver_duplicate_total | counter | group | Duplicate events received |
routerd_eventd_receiver_reject_total | counter | group, reason | Events rejected (reason: bad_timestamp, stale_timestamp, bad_signature, bad_body, validation, read_body, store_error) |
Pruner metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
routerd_eventd_pruner_tick_total | counter | group | Pruner loop iterations |
routerd_eventd_pruner_pruned_total | counter | group | Events pruned by retention |
routerd_eventd_pruner_tick_errors_total | counter | group | Failed pruner iterations |
Label policy
Allowed labels: group, peer, event_type, status, reason. High-cardinality
values (event_id, subject, address, dedupe_key, endpoint, raw error
strings) are never used as metric labels.
Alerting thresholds
Recommended Prometheus alerting rules:
# Outbox loop stalled (no tick in 5 minutes)
rate(routerd_eventd_outbox_tick_total[5m]) == 0
# High delivery failure rate
rate(routerd_eventd_outbox_delivery_total{status="failed"}[5m]) > 0
# Outbox loop errors
rate(routerd_eventd_outbox_tick_errors_total[5m]) > 0
# Receiver rejecting events
rate(routerd_eventd_receiver_reject_total[5m]) > 0.1
# Pruner errors
rate(routerd_eventd_pruner_tick_errors_total[5m]) > 0
# Delivery lag p99 above 60s
histogram_quantile(0.99, rate(routerd_eventd_outbox_delivery_lag_seconds_bucket[5m])) > 60