Monitoring¶
Hybernate ships with Prometheus metrics, Grafana dashboards, and alerting rules.
Prometheus¶
ServiceMonitor¶
The project includes a ServiceMonitor in config/prometheus/ for automatic Prometheus scraping:
This configures Prometheus to scrape the operator's metrics endpoint.
Key Metrics to Watch¶
Cluster health dashboard:
# Are workloads being managed?
hybernate_workloads_total
# Estimated savings (requires cluster autoscaler for realization)
hybernate_cost_estimated_savings_dollars
# Are there errors?
rate(hybernate_reconcile_errors_total[5m]) > 0
Per-workload health:
# Is the prediction engine learning?
hybernate_prediction_confidence_percent{season="daily"}
# Are workloads cycling too fast?
rate(hybernate_lifecycle_transitions_total[1h])
# Are scale-downs being blocked?
rate(hybernate_scale_guard_blocked_total[1h])
Grafana Dashboards¶
Pre-built dashboards are available in config/grafana/:
- Hybernate Overview: cluster-wide workload counts, cost savings, phase distribution
- Workload Detail: per-workload prediction confidence, scaling history, idle detection state
Import them via Grafana's dashboard import feature or deploy them as ConfigMaps if using the Grafana sidecar.
Alerting Rules¶
Sample alerting rules are in config/prometheus/:
Suggested Alerts¶
| Alert | Condition | Severity |
|---|---|---|
| HybernateReconcileErrors | rate(hybernate_reconcile_errors_total[5m]) > 0 |
warning |
| HybernatePredictionLowConfidence | hybernate_prediction_confidence_percent{season="daily"} < 50 for 1h |
info |
| HybernateTargetUnavailable | increase(hybernate_target_unavailable_total[10m]) > 0 |
warning |
| HybernatePVCRetentionExpiring | hybernate_pvc_retention_remaining_seconds < 3600 |
warning |
| HybernateRegimeChange | increase(hybernate_prediction_regime_changes_total[1h]) > 0 |
info |
Example Alert Rule¶
Health Checks¶
The operator exposes health endpoints:
Configure these in your Deployment's liveness and readiness probes (already set up in the default manifests).
Operator Logs¶
For debugging, check the operator logs:
Key log entries to watch for:
"pool scaled": scaling events with before/after counts"phase transition": lifecycle state changes"idle confirmed": idle detection results"regime change": prediction engine pattern shifts"drift detected": external replica changes