Skip to content

title: Keystone Operator Prometheus Metrics quadrant: operator


Keystone Operator Prometheus Metrics

Reference catalogue for every Prometheus collector the Keystone operator exposes on the controller-runtime metrics endpoint. Every metric name, label set, and histogram bucket list below is authoritative — the contract test operators/keystone/internal/metrics/contract_test.go fails the build if this document drifts from the registered collectors.

All metrics are registered via the process-wide sync.Once initializer globalCollectors() in operators/keystone/internal/metrics/collectors.go and attached to the controller-runtime registry (sigs.k8s.io/controller-runtime/pkg/metrics). They are served on the operator's metrics listener at :8080/metrics by default; see How to enable the Keystone operator metrics endpoint for cluster-side wiring.

Metric summary

MetricTypeLabelsPurpose
keystone_operator_reconcile_duration_secondsHistogramsub_reconcilerLatency of each sub-reconciler invocation
keystone_operator_reconcile_errors_totalCountersub_reconciler, condition_typeCount of sub-reconciler failures per Ready sub-condition
keystone_operator_key_rotation_age_secondsGaugekeystone, namespace, key_typeAge of the most recent successful Fernet/credential rotation
keystone_operator_db_sync_totalCounterkeystone, namespace, resultTerminal db_sync Job outcomes
keystone_operator_db_sync_duration_secondsHistogramkeystone, namespaceWall-clock duration of terminated db_sync Jobs

keystone_operator_reconcile_duration_seconds

Wall-clock duration of a single sub-reconciler invocation in seconds.

AttributeValue
TypeHistogram
Labelssub_reconciler
Buckets (seconds)0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30
Call siteinstrumentSubReconciler in operators/keystone/internal/controller/instrumentation.go

Label cardinality. The sub_reconciler value is drawn from a closed set of 14 strings — the keys of subReconcilerConditionTypes — so the series count is bounded regardless of how many Keystone CRs exist. The metric deliberately omits a keystone/namespace label to keep cardinality fleet-independent; per-CR attribution is available via logs.

Example PromQL. p95 per sub-reconciler over the last 5 minutes:

promql
histogram_quantile(
  0.95,
  sum by (sub_reconciler, le) (
    rate(keystone_operator_reconcile_duration_seconds_bucket[5m])
  )
)

keystone_operator_reconcile_errors_total

Count of sub-reconciler errors, partitioned by the offending sub-reconciler and the status.conditions[] type it failed to drive to True.

AttributeValue
TypeCounter
Labelssub_reconciler, condition_type
Call siteinstrumentSubReconciler (on non-nil error return)

Label cardinality. Both labels are drawn from closed sets: the 14 sub-reconciler names and the 12 Ready sub-condition types listed in subConditionTypes. The sub_reconcilercondition_type mapping is one-to-one for most entries; the exceptions (Secrets, DBConnectionSecret, Config all map to SecretsReady) collapse into fewer distinct (sub_reconciler, condition_type) pairs in practice.

The counter is not pre-created; a (sub_reconciler, condition_type) series only appears after its first error, so a healthy fleet emits no series for this metric.

Example PromQL. Error rate per sub-reconciler over the last 5 minutes:

promql
sum by (sub_reconciler) (
  rate(keystone_operator_reconcile_errors_total[5m])
)

keystone_operator_key_rotation_age_seconds

Age in seconds of the most recent successful key rotation for a given Keystone CR and key type, measured as time.Since(rotation-completed-at) at reconcile time.

AttributeValue
TypeGauge
Labelskeystone, namespace, key_type
Call sitereconcileFernetKeys / reconcileCredentialKeys on every reconcile pass

key_type values. Exactly fernet or credential.

Annotation source. The gauge reads the forge.c5c3.io/rotation-completed-at annotation from the production keys Secret first; applyRotationOutput stamps that annotation on the production Secret on every successful apply, so the timestamp is durable across the inter-rotation steady state and the gauge value (recomputed as time.Since(completedAt) on every reconcile) tracks wall-clock age correctly even after the staging Secret has been deleted. If the production Secret has no annotation yet — i.e. the very-first rotation has not been applied — the lookup falls back to the staging Secret to cover the post-CronJob-PATCH/pre-apply window.

When the gauge is NOT set. Both lookups skip the gauge on absent or malformed forge.c5c3.io/rotation-completed-at annotations: a missing annotation on both Secrets means rotation has never completed (alert via PromQL absent(...)); a malformed annotation is a script bug surfaced via the RotationAnnotationInvalid event. Validation rejections of a staging payload are tracked via the RotationRejected event (see Keystone Controller Events) and indirectly via keystone_operator_reconcile_errors_total{condition_type="FernetKeysReady"}.

Series lifecycle. When a Keystone CR is deleted, every rotation-age series tagged with that (keystone, namespace) pair is dropped by DeleteForKeystone in reconcileDelete, preventing orphan series from accumulating over the CR lifetime.

Example PromQL. Alert if any rotation is older than 14 days:

promql
max by (keystone, namespace, key_type) (
  keystone_operator_key_rotation_age_seconds
) > (14 * 86400)

keystone_operator_db_sync_total

Count of DB-related Job terminal transitions per Keystone CR, labelled by the terminal state.

AttributeValue
TypeCounter
Labelskeystone, namespace, result
Call sitereconcileDatabase, reconcileExpand, reconcileMigrate, reconcileContract on each Job's first terminal transition

Covered Jobs. The counter aggregates terminal transitions from every DB-related Job the operator runs against the Keystone CR:

  • <keystone>-db-sync — non-upgrade schema sync (reconcileDatabase).
  • <keystone>-schema-check — post-db_sync drift detection.
  • <keystone>-db-expand / -db-migrate / -db-contract — the three phases of the expand-migrate-contract upgrade flow.

This deliberate aggregation keeps the dashboard panel and failure-rate alerts populated during upgrades. The counter does NOT carry a per-phase label; use the matching DatabaseReady condition reasons (DBSyncFailed, ExpandFailed, MigrateFailed, ContractFailed, SchemaDriftDetected) on the CR to attribute a failure to a specific phase.

result values. succeeded (Job Complete=True) or failed (Job Failed=True).

Single-emit guarantee. Each phase is guarded by an independent transition predicate keyed on the Job UID and persisted via the forge.c5c3.io/last-<phase>-job-uid annotation on the Keystone CR; repeated reconciles observing the same terminated Job contribute at most one increment, so polling does not inflate the counter. The annotation patch is committed BEFORE the metric is recorded so a transient apiserver failure cannot cause the next reconcile to double-count the same Job UID; on patch failure the metric is deferred to the next reconcile, an Info-level log line is emitted, and a Warning event with reason DBSyncMetricEmissionDeferred is recorded on the Keystone CR so persistent apiserver failures are visible at default log verbosity and via kubectl describe keystone.

Series lifecycle. Deleted alongside the per-CR rotation-age gauge in DeleteForKeystone when the Keystone CR is removed.

Example PromQL. DB-related failure rate (any phase) over the last hour:

promql
sum by (keystone, namespace) (
  rate(keystone_operator_db_sync_total{result="failed"}[1h])
)

keystone_operator_db_sync_duration_seconds

Duration in seconds of terminated DB-related Jobs, measured as condition.LastTransitionTime − Job.CreationTimestamp.

AttributeValue
TypeHistogram
Labelskeystone, namespace
Buckets (seconds)1, 5, 10, 30, 60, 120, 300, 600
Call sitereconcileDatabase, reconcileExpand, reconcileMigrate, reconcileContract on each Job's first terminal transition

The histogram records one sample per terminated DB-related Job (db_sync, schema-check, db-expand, db-migrate, db-contract), regardless of result. Use keystone_operator_db_sync_total{result="failed"} to scope alerting to failures only; use this histogram to reason about how long a sync took when it did run.

Example PromQL. p95 db_sync duration over the last 24 hours:

promql
histogram_quantile(
  0.95,
  sum by (keystone, namespace, le) (
    rate(keystone_operator_db_sync_duration_seconds_bucket[24h])
  )
)

Enabling the ServiceMonitor

The Helm chart ships an opt-in monitoring.coreos.com/v1 ServiceMonitor. Enable it via values.yaml:

yaml
# values.yaml fragment
monitoring:
  serviceMonitor:
    enabled: true
    interval: 30s

The ServiceMonitor selector targets the operator Service by keystone-operator.selectorLabels on port metrics, scraping the path /metrics at the configured interval. Prometheus-operator CRDs (ServiceMonitor) must be installed in the cluster; the chart does not install them. For the end-to-end operator-side enablement flow, see How to enable the Keystone operator metrics endpoint.