跳至主要内容

Federation delivery observability

Event Federation (ADR 0006, Phase 2) pushes events between routerd nodes via the outbox. This guide explains how to verify that deliveries are healthy, spot problems, and act on them using routerctl.

Quick check

routerctl doctor federation --state-file /var/lib/routerd/routerd.db

A healthy system prints all PASS:

DOCTOR PASS pass=7 warn=0 fail=0 skip=0
AREA STATUS CHECK DETAIL REMEDY
federation PASS cloudedge/leaf-az failed deliveries no failed deliveries
federation PASS cloudedge/leaf-az pending deliveries no pending deliveries
federation PASS cloudedge/leaf-az stale TTL no stale TTL deliveries
federation PASS cloudedge/leaf-az delivery lag max delivery lag 2s
federation PASS cloudedge/leaf-az event expiry nearest event expires in 1740s
federation PASS cloudedge/leaf-az expected delivery all 3 active event(s) have delivery rows
federation SKIP expected peers no self-emitted active events to deliver

Commands

Delivery summary

Per-(group, peer) aggregate of all active delivery rows:

routerctl federation deliveries summary \
--group cloudedge \
--state-file /var/lib/routerd/routerd.db

Output:

GROUP PEER EVENTS DELIVERED STALE_TTL FAILED PENDING MAX_LAG MIN_EXPIRES_IN
cloudedge leaf-az 3 3 0 0 0 2s 29m0s
cloudedge leaf-oci 3 2 0 1 0 4s 29m0s

Add -o json or -o yaml for machine-readable output. Use --include-expired to include events whose TTL has already passed.

Doctor federation

routerctl doctor federation runs two categories of checks:

  1. Recorded-delivery checks — examine existing delivery rows per (group, peer) for failures, pending expiry, stale TTL, and delivery lag.
  2. Expected-peer audit — derive the expected peer set from EventGroup and EventPeer resources in the startup config, then verify that every self-emitted active event has a delivery row for each expected peer.

Run against a specific area:

routerctl doctor federation \
--state-file /var/lib/routerd/routerd.db \
--config /usr/local/etc/routerd/router.yaml

--config is needed for expected-peer checks because the audit reads EventGroup (for the self node name) and EventPeer (for declared peers) from the startup config.

Reading the summary table

ColumnMeaning
GROUPEventGroup name (e.g. cloudedge)
PEERTarget peer node name
EVENTSTotal active event-delivery pairs
DELIVEREDSuccessfully pushed to peer
STALE_TTLDelivered but the event TTL was refreshed since delivery (event.expires_at > delivery.event_expires_at); the peer holds a stale copy
FAILEDAll retry attempts exhausted
PENDINGEnqueued but not yet delivered
MAX_LAGWorst-case time between event observation and delivery
MIN_EXPIRES_INTime until the soonest event expires; negative means already expired

Healthy state: DELIVERED == EVENTS, FAILED == 0, PENDING == 0, STALE_TTL == 0.

FederationSLO

You can declare a FederationSLO resource to override the default doctor thresholds for a specific EventGroup:

apiVersion: federation.routerd.net/v1alpha1
kind: FederationSLO
metadata:
name: cloudedge-slo
spec:
groupRef: cloudedge
delivery:
lagWarnSeconds: 60 # default: 60
lagFailSeconds: 180 # default: 180
expiresSoonSeconds: 120 # default: 120
subscription:
maxPendingRuns: 0 # default: 0 (any pending triggers warn)
maxFailedRuns: 0 # default: 0 (any failure triggers fail)

When a FederationSLO is present, routerctl doctor federation uses its thresholds instead of the hardcoded defaults. The JSON output includes an slo object showing the effective thresholds and any violations.

Without a FederationSLO resource, the default thresholds apply unchanged.

Validation rules:

  • spec.groupRef is required and must reference an existing EventGroup resource by metadata.name.
  • Only one FederationSLO per EventGroup is allowed; duplicate groupRef values are rejected at config validation time.
  • spec.delivery.lagWarnSeconds must be strictly less than lagFailSeconds after applying defaults (0 falls back to the default). For example, lagWarnSeconds: 200 with lagFailSeconds: 0 (default 180) is rejected because effective 200 >= 180.
  • Zero values (0) mean "use the default threshold"; negative values are rejected.

Remediation plan

Generate a plan of suggested actions without executing them:

routerctl doctor federation --remediation-plan -o json \
--state-file /var/lib/routerd/routerd.db \
--config /usr/local/etc/routerd/router.yaml

The plan includes structured actions with action, reason, targetPeer, targetGroup, safe, and requiresOperatorApproval fields:

{
"remediationPlan": {
"generatedAt": "2026-06-18T01:00:00Z",
"actions": [
{
"action": "retry-failed-deliveries",
"reason": "1 of 1 delivery(s) failed",
"targetPeer": "leaf-az",
"targetGroup": "cloudedge",
"safe": true,
"requiresOperatorApproval": false
}
]
}
}

The --remediation-plan flag is opt-in; without it, no remediationPlan key appears in the JSON output. The plan is read-only — it never mutates state.

Field meanings:

  • safe: the suggested action is idempotent and can be retried without risk.
  • requiresOperatorApproval: the action changes configuration and requires human review before execution.

Remediation action reference

ActionTriggersaferequiresOperatorApproval
retry-failed-deliveriesFailed delivery rows existtruefalse
investigate-pending-deliveriesPending deliveries with possible expiry risktruefalse
force-repush-stale-ttlStale TTL detected (event TTL refreshed since last push)truefalse
check-peer-connectivityDelivery lag exceeds SLO thresholdtruefalse
configure-peer-endpointEventPeer endpoint is emptyfalsetrue
investigate-missing-delivery-rowsExpected peer has no delivery rowstruefalse
inspect-failed-subscription-runsSubscription runs in failed statustruefalse

Doctor check reference

Recorded-delivery checks

These run per (group, peer) pair that has delivery rows in the state store. Thresholds are derived from FederationSLO when present, otherwise defaults apply.

CheckFAILWARNPASS
failed deliveriesAny delivery row in failed statusFAILED == 0
pending deliveriesPending events whose TTL expires within expiresSoonSecondsPending exists but no imminent expiryPENDING == 0
stale TTLAll delivered events have stale TTL (STALE_TTL == DELIVERED)Some deliveries have stale TTLSTALE_TTL == 0
delivery lagMax lag >= lagFailSeconds (default 180 s)Max lag >= lagWarnSeconds (default 60 s)Below lagWarnSeconds
event expiryNearest expiry < expiresSoonSeconds (default 120 s) with pending/failed deliveriesNearest expiry < expiresSoonSeconds, all deliveredComfortable margin

Expected-peer audit

These check that the config-declared peers actually have delivery rows.

CheckFAILSKIPPASS
expected deliverySelf-emitted active events exist but no delivery row for this peerNo self-emitted events in the groupAll active events have delivery rows
empty endpointEventPeer.spec.endpoint is empty

The audit excludes the self node: if EventGroup.spec.nodeName matches EventPeer.spec.nodeName, that peer is skipped (a node does not push events to itself).

Common failures and remedies

EventPeer endpoint not set

FAIL cloudedge/leaf-oci expected delivery EventPeer endpoint is empty
set spec.endpoint on EventPeer/leaf-oci for group cloudedge

The EventPeer resource declares the peer but has no endpoint URL. The outbox cannot push events without a target. Set spec.endpoint to the peer's federation listener (e.g. https://10.252.0.3:8443/v1/federation/events).

Missing delivery rows for expected peer

FAIL cloudedge/leaf-oci expected delivery 2 of 3 active event(s) have no delivery row: evt-abc, evt-def
outbox never enqueued delivery for this peer; check EventPeer config and outbox peer filter

The outbox creates delivery rows only for events where SourceNode matches the local EventGroup.spec.nodeName. If the self node name in the EventGroup does not match the source_node column in federation_events, no delivery is enqueued. Verify:

routerctl federation event list --group cloudedge --state-file /var/lib/routerd/routerd.db

Check the SOURCE column matches the EventGroup nodeName. Also confirm that the outbox controller is running (eventd daemon) and that no types / subjectPrefixes filter on the EventPeer silently excludes the events.

HMAC authentication mismatch

The outbox push returns HTTP 403 or 401. Check journalctl -u routerd-eventd for authentication errors. Verify that both ends share the same EventGroup.spec.auth.hmacSecretRef and that the referenced Secret exists and contains the correct key.

Outbox not running

FAIL cloudedge/leaf-az pending deliveries 3 pending; 2 event(s) expire within 120s without delivery
outbox may be stalled or peer unreachable; check eventd logs and peer endpoint

Delivery rows were created but nothing was pushed. Confirm routerd-eventd is running:

systemctl status routerd-eventd
journalctl -u routerd-eventd --since "10 min ago"

Stale TTL after refresh

WARN cloudedge/leaf-az stale TTL 1 of 3 delivered event(s) have stale TTL (event.expiresAt > delivery.eventExpiresAt)
outbox should re-push refreshed events on next tick; if this persists, check outbox interval and delivery filtering

An event's TTL was extended (e.g. by re-emitting with --ttl) but the delivery record still holds the old event_expires_at. The outbox should detect this on its next tick and re-push (PR #531). If the warning persists, check the outbox interval and that the re-push logic is working:

routerctl federation event deliveries \
--group cloudedge --peer leaf-az \
--state-file /var/lib/routerd/routerd.db

Compare EVENT_EXPIRES_AT in the delivery row against the event's EXPIRES column in routerctl federation event list.

SourceNode does not match EventGroup nodeName

Events emitted with a --source-node that differs from the local EventGroup.spec.nodeName are not pushed by the outbox (the outbox only pushes self-emitted events). The expected-peer audit will show these events as missing delivery rows. Verify the source node matches:

routerctl federation event list --group cloudedge -o json \
--state-file /var/lib/routerd/routerd.db | jq '.[].sourceNode'

SAMSubnetPolicy delivery verification

For CloudEdge SAM deployments, routerd.mobility.shard.assigned events carry the shard assignment from the hub to leaf nodes. Verify delivery:

# Hub node: check deliveries to all leaf peers
routerctl federation deliveries summary --group cloudedge \
--state-file /var/lib/routerd/routerd.db

# Hub node: doctor check including expected-peer audit
routerctl doctor federation \
--state-file /var/lib/routerd/routerd.db \
--config /usr/local/etc/routerd/router.yaml

# Leaf node: confirm the event was received and the subscription processed it
routerctl federation event list --group cloudedge \
--state-file /var/lib/routerd/routerd.db
routerctl dynamic list --state-file /var/lib/routerd/routerd.db

On the leaf, routerctl dynamic list should show a DynamicConfigPart with provenance routerd.net/event-group: cloudedge. If the event is delivered but no dynamic config appears, the issue is on the subscription/plugin side, not delivery — see Event Federation Subscription.

JSON output for automation

Both commands support JSON output for integration with monitoring or scripts:

# Summary as JSON
routerctl federation deliveries summary --group cloudedge -o json \
--state-file /var/lib/routerd/routerd.db

# Doctor as JSON
routerctl doctor federation -o json \
--state-file /var/lib/routerd/routerd.db \
--config /usr/local/etc/routerd/router.yaml

The doctor JSON includes summary.overall ("pass", "warn", or "fail") for alerting thresholds. A non-"pass" overall status indicates at least one check that warrants investigation.

Federation summary in doctor JSON

When the federation area is checked, the doctor JSON includes a federation object with aggregated delivery statistics:

{
"federation": {
"severityCounts": {"pass": 7, "warn": 0, "fail": 0, "skip": 1},
"failedDeliveryCount": 0,
"staleTTLCount": 0,
"pendingDeliveryCount": 0,
"missingExpectedPeerCount": 0,
"maxDeliveryLagSeconds": 2,
"minExpiresInSeconds": 1740,
"totalEvents": 3,
"totalDelivered": 3,
"subscriptionRunsTotal": 3,
"subscriptionRunsSucceeded": 3,
"subscriptionRunsFailed": 0,
"subscriptionRunsPending": 0,
"slo": {
"groups": [
{
"group": "cloudedge",
"defined": true,
"thresholds": {
"delivery": {
"lagWarnSeconds": 60,
"lagFailSeconds": 180,
"expiresSoonSeconds": 120
},
"subscription": {
"maxPendingRuns": 0,
"maxFailedRuns": 0
}
},
"violations": []
}
]
}
}
}

The slo.groups array contains one entry per EventGroup (union of config and observed groups, sorted by name). Each entry has:

  • defined: whether a FederationSLO resource is configured for this group
  • thresholds.delivery: effective delivery lag/expiry thresholds
  • thresholds.subscription: effective subscription run thresholds
  • violations: threshold breaches with check name, threshold, actual value, and severity (empty [] when healthy)

jq examples

# Overall status for alerting
routerctl doctor federation -o json \
--state-file /var/lib/routerd/routerd.db \
--config /usr/local/etc/routerd/router.yaml \
| jq -r '.summary.overall'

# Failed delivery count (exit 1 if > 0)
routerctl doctor federation -o json \
--state-file /var/lib/routerd/routerd.db \
--config /usr/local/etc/routerd/router.yaml \
| jq -e '.federation.failedDeliveryCount == 0'

# Missing expected peers
routerctl doctor federation -o json \
--state-file /var/lib/routerd/routerd.db \
--config /usr/local/etc/routerd/router.yaml \
| jq '.federation.missingExpectedPeerCount'

# All failing checks with remedies
routerctl doctor federation -o json \
--state-file /var/lib/routerd/routerd.db \
--config /usr/local/etc/routerd/router.yaml \
| jq '[.checks[] | select(.status == "fail") | {name, detail, remedy}]'

# Stale TTL and pending counts for monitoring
routerctl doctor federation -o json \
--state-file /var/lib/routerd/routerd.db \
--config /usr/local/etc/routerd/router.yaml \
| jq '{stale: .federation.staleTTLCount, pending: .federation.pendingDeliveryCount, lag: .federation.maxDeliveryLagSeconds}'

# Subscription run failures (exit 1 if > 0)
routerctl doctor federation -o json \
--state-file /var/lib/routerd/routerd.db \
--config /usr/local/etc/routerd/router.yaml \
| jq -e '.federation.subscriptionRunsFailed == 0'

Subscription monitoring

When EventSubscription resources are configured, routerctl doctor federation also checks subscription processing health. Thresholds are derived from the FederationSLO for the subscription's groupRef:

CheckFAILWARNPASS
subscription runsFailed > maxFailedRuns (default 0)Pending > maxPendingRuns (default 0)Within SLO thresholds

Inspecting subscription runs

List processing records for a specific subscription:

routerctl federation subscription runs \
--subscription EventSubscription/cloud-claims \
--state-file /var/lib/routerd/routerd.db

Output:

ID SUBSCRIPTION EVENT_ID GROUP PLUGIN STATUS ATTEMPTS STARTED COMPLETED ERROR
1 EventSubscription/cloud-claims evt-abc cloudedge cloud-claims succeeded 1 2026-06-17T12:00:00Z 2026-06-17T12:00:01Z
2 EventSubscription/cloud-claims evt-def cloudedge cloud-claims failed 3 2026-06-17T12:01:00Z 2026-06-17T12:01:05Z plugin exited 1

Add -o json for machine-readable output. Failed runs indicate the plugin returned an error; check the plugin output and event payload.

End-to-end delivery verification

To verify the full pipeline from event emission to DynamicConfigPart:

# 1. Verify events are delivered
routerctl federation deliveries summary --group cloudedge \
--state-file /var/lib/routerd/routerd.db

# 2. Verify subscriptions processed events
routerctl federation subscription runs \
--subscription EventSubscription/cloud-claims \
--state-file /var/lib/routerd/routerd.db

# 3. Verify DynamicConfigParts were generated
routerctl dynamic list --state-file /var/lib/routerd/routerd.db

If events are delivered but subscription runs show failures, the issue is in the plugin or its inputs. If subscription runs succeed but no DynamicConfigPart appears, check the plugin output format.

OpenTelemetry metrics reference

routerd-eventd emits the following metrics when an OTLP endpoint is configured (see Send telemetry to an OTLP collector):

Outbox delivery metrics

MetricTypeLabelsDescription
routerd_eventd_outbox_delivery_totalcountergroup, peer, event_type, statusDelivery results (delivered/failed)
routerd_eventd_outbox_delivery_attempts_totalcountergroup, peer, event_type, statusTotal push attempts
routerd_eventd_outbox_delivery_lag_secondshistogramgroup, peer, event_typeTime between event observation and delivery
routerd_eventd_outbox_repush_totalcountergroup, peer, event_typeRe-pushes after TTL refresh
routerd_eventd_outbox_stale_ttl_delivery_totalcountergroup, peer, event_typeStale TTL detections

Outbox loop health metrics

MetricTypeLabelsDescription
routerd_eventd_outbox_tick_totalcountergroupOutbox loop iterations
routerd_eventd_outbox_tick_errors_totalcountergroupFailed outbox iterations
routerd_eventd_outbox_tick_duration_secondshistogramgroupTime per outbox iteration

Receiver metrics

MetricTypeLabelsDescription
routerd_eventd_receiver_accepted_totalcountergroupEvents accepted
routerd_eventd_receiver_duplicate_totalcountergroupDuplicate events received
routerd_eventd_receiver_reject_totalcountergroup, reasonEvents rejected (reason: bad_timestamp, stale_timestamp, bad_signature, bad_body, validation, read_body, store_error)

Pruner metrics

MetricTypeLabelsDescription
routerd_eventd_pruner_tick_totalcountergroupPruner loop iterations
routerd_eventd_pruner_pruned_totalcountergroupEvents pruned by retention
routerd_eventd_pruner_tick_errors_totalcountergroupFailed pruner iterations

Label policy

Allowed labels: group, peer, event_type, status, reason. High-cardinality values (event_id, subject, address, dedupe_key, endpoint, raw error strings) are never used as metric labels.

Alerting thresholds

Recommended Prometheus alerting rules:

# Outbox loop stalled (no tick in 5 minutes)
rate(routerd_eventd_outbox_tick_total[5m]) == 0

# High delivery failure rate
rate(routerd_eventd_outbox_delivery_total{status="failed"}[5m]) > 0

# Outbox loop errors
rate(routerd_eventd_outbox_tick_errors_total[5m]) > 0

# Receiver rejecting events
rate(routerd_eventd_receiver_reject_total[5m]) > 0.1

# Pruner errors
rate(routerd_eventd_pruner_tick_errors_total[5m]) > 0

# Delivery lag p99 above 60s
histogram_quantile(0.99, rate(routerd_eventd_outbox_delivery_lag_seconds_bucket[5m])) > 60