Monitoring¶

Hybernate ships with Prometheus metrics, Grafana dashboards, and alerting rules.

Prometheus¶

ServiceMonitor¶

The project includes a ServiceMonitor in config/prometheus/ for automatic Prometheus scraping:

kubectl apply -f config/prometheus/monitor.yaml

This configures Prometheus to scrape the operator's metrics endpoint.

Key Metrics to Watch¶

Cluster health dashboard:

# Are workloads being managed?
hybernate_workloads_total

# Estimated savings (requires cluster autoscaler for realization)
hybernate_cost_estimated_savings_dollars

# Are there errors?
rate(hybernate_reconcile_errors_total[5m]) > 0

Per-workload health:

# Is the prediction engine learning?
hybernate_prediction_confidence_percent{season="daily"}

# Are workloads cycling too fast?
rate(hybernate_lifecycle_transitions_total[1h])

# Are scale-downs being blocked?
rate(hybernate_scale_guard_blocked_total[1h])

Grafana Dashboards¶

Pre-built dashboards are available in config/grafana/:

Hybernate Overview: cluster-wide workload counts, cost savings, phase distribution
Workload Detail: per-workload prediction confidence, scaling history, idle detection state

Import them via Grafana's dashboard import feature or deploy them as ConfigMaps if using the Grafana sidecar.

Alerting Rules¶

Sample alerting rules are in config/prometheus/:

Suggested Alerts¶

Alert	Condition	Severity
HybernateReconcileErrors	`rate(hybernate_reconcile_errors_total[5m]) > 0`	warning
HybernatePredictionLowConfidence	`hybernate_prediction_confidence_percent{season="daily"} < 50` for 1h	info
HybernateTargetUnavailable	`increase(hybernate_target_unavailable_total[10m]) > 0`	warning
HybernatePVCRetentionExpiring	`hybernate_pvc_retention_remaining_seconds < 3600`	warning
HybernateRegimeChange	`increase(hybernate_prediction_regime_changes_total[1h]) > 0`	info

Example Alert Rule¶

alerts.yaml
groups:
  - name: hybernate
    rules:
      - alert: HybernateReconcileErrors
        expr: rate(hybernate_reconcile_errors_total[5m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Hybernate reconciliation errors detected"
          description: "Controller {{ $labels.controller }} is experiencing reconcile errors."

Health Checks¶

The operator exposes health endpoints:

# Liveness
curl http://localhost:8081/healthz

# Readiness
curl http://localhost:8081/readyz

Configure these in your Deployment's liveness and readiness probes (already set up in the default manifests).

Operator Logs¶

For debugging, check the operator logs:

kubectl logs -n hybernate-system deployment/hybernate-controller-manager -f

Key log entries to watch for:

"pool scaled": scaling events with before/after counts
"phase transition": lifecycle state changes
"idle confirmed": idle detection results
"regime change": prediction engine pattern shifts
"drift detected": external replica changes