Metrics Reference¶

All metrics are prefixed with hybernate_ and registered with the controller-runtime metrics registry. They are served on the metrics endpoint (default :8443 HTTPS).

Tier 1: Cluster Health¶

These metrics provide a high-level view of Hybernate's impact.

Metric	Type	Labels	Description
`hybernate_workloads_total`	Gauge	`phase`	Managed workloads by lifecycle phase
`hybernate_active_workloads`	Gauge		Workloads in a running state
`hybernate_paused_workloads`	Gauge		Currently paused workloads
`hybernate_destroyed_workloads`	Gauge		Destroyed workloads
`hybernate_reconcile_errors_total`	Counter	`controller`	Reconciliation errors by controller
`hybernate_lifecycle_transitions_total`	Counter	`from`, `to`	Phase transitions
`hybernate_lifecycle_action_duration_seconds`	Histogram	`action`	Duration of lifecycle actions (pause, resume, destroy, scale)
`hybernate_cost_estimated_savings_dollars`	Gauge		Estimated monthly savings (requires autoscaler for realization)
`hybernate_cost_estimated_dollars`	Gauge		Total estimated monthly cost
`hybernate_cost_estimated_without_management_dollars`	Gauge		Estimated cost without Hybernate
`hybernate_resource_reduction_cpu_millicores`	Gauge		Total CPU millicores freed by Hybernate
`hybernate_resource_reduction_memory_bytes`	Gauge		Total memory bytes freed by Hybernate

Tier 2: Operational Insight¶

These metrics help you understand what the operator is doing and why.

Metric	Type	Labels	Description
`hybernate_prediction_confidence_percent`	Gauge	`season`, `namespace`, `workload`	Prediction accuracy by season
`hybernate_prediction_phase`	Gauge	`namespace`, `workload`	Engine phase (0-4)
`hybernate_prediction_data_points`	Gauge	`namespace`, `workload`	Data points collected
`hybernate_prediction_anomalies_total`	Counter	`namespace`, `workload`	Anomalies detected
`hybernate_cost_cpu_hours`	Gauge		Total vCPU-hours this month
`hybernate_cost_memory_hours`	Gauge		Total GiB memory hours
`hybernate_cost_storage_hours`	Gauge		Total GiB storage hours
`hybernate_scale_events_total`	Counter	`direction`, `namespace`, `workload`	Scale events by direction
`hybernate_scale_replicas`	Gauge	`namespace`, `workload`	Current replica count
`hybernate_idle_detections_total`	Counter	`action`, `namespace`, `workload`	Idle detections by action
`hybernate_pause_expiry_actions_total`	Counter	`action`	Pause expiry events
`hybernate_drift_detections_total`	Counter	`policy`	Replica drift detections

Tier 3: Debugging¶

These metrics help troubleshoot specific workload behavior.

Metric	Type	Labels	Description
`hybernate_idle_signal_result`	Gauge	`namespace`, `workload`	Idle signal status (1=active, 2=signals_confirm, 3=grace_period, 4=idle)
`hybernate_scale_guard_blocked_total`	Counter	`namespace`, `workload`	Scale-downs blocked by guard probes
`hybernate_idle_fluke_total`	Counter	`namespace`, `workload`	Signals confirmed idle but prediction disagreed
`hybernate_prediction_regime_changes_total`	Counter	`namespace`, `workload`	Regime changes detected
`hybernate_pvc_retention_remaining_seconds`	Gauge	`namespace`, `workload`	Seconds until PVC cleanup
`hybernate_automation_skipped_total`	Counter	`namespace`, `workload`	Automation skipped (manual override active)
`hybernate_dryrun_actions_total`	Counter	`action`	Actions that would have been taken in dry-run
`hybernate_target_unavailable_total`	Counter	`namespace`, `workload`	Target workload not found

Discovery¶

Metric	Type	Labels	Description
`hybernate_discovery_scan_duration_seconds`	Histogram		Scan duration
`hybernate_discovery_workloads`	Gauge	`classification`	Discovered workloads by class
`hybernate_discovery_estimated_savings_dollars`	Gauge		Estimated savings from discoveries
`hybernate_discovery_auto_managed_total`	Counter		Auto-created ManagedWorkloads

Prediction Phase Values¶

The hybernate_prediction_phase gauge uses numeric values:

Value	Phase
0	Observing
1	DailySuggesting
2	DailyActive
3	WeeklySuggesting
4	FullyActive

Useful PromQL Queries¶

# Total workloads by phase
hybernate_workloads_total

# Estimated monthly savings trend
hybernate_cost_estimated_savings_dollars

# Workloads with low prediction confidence
hybernate_prediction_confidence_percent{season="daily"} < 70

# Recent scale events
rate(hybernate_scale_events_total[1h])

# Idle flukes (signals say idle, prediction disagrees)
rate(hybernate_idle_fluke_total[1h]) > 0

# Workloads stuck in grace period
hybernate_idle_signal_result == 3