Skip to content

Metrics Reference

All metrics are prefixed with hybernate_ and registered with the controller-runtime metrics registry. They are served on the metrics endpoint (default :8443 HTTPS).

Tier 1: Cluster Health

These metrics provide a high-level view of Hybernate's impact.

Metric Type Labels Description
hybernate_workloads_total Gauge phase Managed workloads by lifecycle phase
hybernate_active_workloads Gauge Workloads in a running state
hybernate_paused_workloads Gauge Currently paused workloads
hybernate_destroyed_workloads Gauge Destroyed workloads
hybernate_reconcile_errors_total Counter controller Reconciliation errors by controller
hybernate_lifecycle_transitions_total Counter from, to Phase transitions
hybernate_lifecycle_action_duration_seconds Histogram action Duration of lifecycle actions (pause, resume, destroy, scale)
hybernate_cost_estimated_savings_dollars Gauge Estimated monthly savings (requires autoscaler for realization)
hybernate_cost_estimated_dollars Gauge Total estimated monthly cost
hybernate_cost_estimated_without_management_dollars Gauge Estimated cost without Hybernate
hybernate_resource_reduction_cpu_millicores Gauge Total CPU millicores freed by Hybernate
hybernate_resource_reduction_memory_bytes Gauge Total memory bytes freed by Hybernate

Tier 2: Operational Insight

These metrics help you understand what the operator is doing and why.

Metric Type Labels Description
hybernate_prediction_confidence_percent Gauge season, namespace, workload Prediction accuracy by season
hybernate_prediction_phase Gauge namespace, workload Engine phase (0-4)
hybernate_prediction_data_points Gauge namespace, workload Data points collected
hybernate_prediction_anomalies_total Counter namespace, workload Anomalies detected
hybernate_cost_cpu_hours Gauge Total vCPU-hours this month
hybernate_cost_memory_hours Gauge Total GiB memory hours
hybernate_cost_storage_hours Gauge Total GiB storage hours
hybernate_scale_events_total Counter direction, namespace, workload Scale events by direction
hybernate_scale_replicas Gauge namespace, workload Current replica count
hybernate_idle_detections_total Counter action, namespace, workload Idle detections by action
hybernate_pause_expiry_actions_total Counter action Pause expiry events
hybernate_drift_detections_total Counter policy Replica drift detections

Tier 3: Debugging

These metrics help troubleshoot specific workload behavior.

Metric Type Labels Description
hybernate_idle_signal_result Gauge namespace, workload Idle signal status (1=active, 2=signals_confirm, 3=grace_period, 4=idle)
hybernate_scale_guard_blocked_total Counter namespace, workload Scale-downs blocked by guard probes
hybernate_idle_fluke_total Counter namespace, workload Signals confirmed idle but prediction disagreed
hybernate_prediction_regime_changes_total Counter namespace, workload Regime changes detected
hybernate_pvc_retention_remaining_seconds Gauge namespace, workload Seconds until PVC cleanup
hybernate_automation_skipped_total Counter namespace, workload Automation skipped (manual override active)
hybernate_dryrun_actions_total Counter action Actions that would have been taken in dry-run
hybernate_target_unavailable_total Counter namespace, workload Target workload not found

Discovery

Metric Type Labels Description
hybernate_discovery_scan_duration_seconds Histogram Scan duration
hybernate_discovery_workloads Gauge classification Discovered workloads by class
hybernate_discovery_estimated_savings_dollars Gauge Estimated savings from discoveries
hybernate_discovery_auto_managed_total Counter Auto-created ManagedWorkloads

Prediction Phase Values

The hybernate_prediction_phase gauge uses numeric values:

Value Phase
0 Observing
1 DailySuggesting
2 DailyActive
3 WeeklySuggesting
4 FullyActive

Useful PromQL Queries

# Total workloads by phase
hybernate_workloads_total

# Estimated monthly savings trend
hybernate_cost_estimated_savings_dollars

# Workloads with low prediction confidence
hybernate_prediction_confidence_percent{season="daily"} < 70

# Recent scale events
rate(hybernate_scale_events_total[1h])

# Idle flukes (signals say idle, prediction disagrees)
rate(hybernate_idle_fluke_total[1h]) > 0

# Workloads stuck in grace period
hybernate_idle_signal_result == 3