Troubleshooting¶

Common Issues¶

Workload is idle but not being paused¶

Check 1: Is dry run enabled?

kubectl get managedworkload my-api -n staging -o jsonpath='{.spec.dryRun}'

If true, the operator evaluates but doesn't act. Set to false to enable.

Check 2: Are all signals confirming?

kubectl describe managedworkload my-api -n staging

Look at events for signal evaluation results. If any signal denies, idle detection resets.

Check 3: Is the grace period still running?

Check the idle signal metric:

# 3 = InGracePeriod, 4 = Idle
kubectl get --raw /metrics | grep hybernate_idle_signal_result

Check 4: Is the prediction engine active?

kubectl get managedworkload my-api -n staging -o jsonpath='{.status.prediction}'

If dailyPhase is Observing, the engine hasn't collected enough data yet (needs 24+ hours).

Workload keeps cycling between paused and running¶

This usually means idle detection triggers pause, then auto-resume immediately detects "not idle" (because paused workloads have zero CPU, which is below threshold, but the workload has no pods to measure).

Fix: Ensure your Prometheus signals check for actual traffic, not just CPU:

managedworkload.yaml
idlePolicy:
  signals:
    - source: prometheus
      promQL: 'rate(http_requests_total{service="my-api"}[10m]) == 0'

Prediction confidence stays at 0¶

The confidence scorer needs a full 24-hour window of data before reporting. Wait 24+ hours after creating the ManagedWorkload.

Also check that metrics-server is running and returning data:

kubectl top pods -n staging

Scale-down is blocked¶

kubectl describe managedworkload my-api -n staging

Look for:

"in stabilization window": cooldown from a recent scale event. Wait for the stabilization period to elapse.
Guard probe denial: a Prometheus guard query returned zero/empty. Check the query against Prometheus directly.

Target not found¶

kubectl get managedworkload my-api -n staging -o jsonpath='{.status.conditions}'

If you see a Degraded condition with "target not found":

Verify the target exists: kubectl get deployment my-api -n staging
Check that target.kind matches (Deployment vs StatefulSet)
Ensure the ManagedWorkload is in the same namespace as the target

Duplicate target error¶

Only one ManagedWorkload can manage a given workload. Check for duplicates:

kubectl get managedworkloads -n staging -o jsonpath='{range .items[*]}{.metadata.name}: {.spec.target.name}{"\n"}{end}'

Cost data shows $0.00¶

Cost accumulation requires metrics-server data. Check kubectl top pods.
On day 1 of the month, estimatedMonthlyCost shows "pending" until day 2.

Debug Checklist¶

Operator running? kubectl get pods -n hybernate-system
CRDs installed? kubectl get crd managedworkloads.hybernate.io
Metrics server running? kubectl top nodes
RBAC correct? kubectl auth can-i get pods --as=system:serviceaccount:hybernate-system:hybernate-controller-manager
Events? kubectl describe managedworkload <name> -n <ns>
Logs? kubectl logs -n hybernate-system deployment/hybernate-controller-manager
Status? kubectl get managedworkload <name> -n <ns> -o yaml

Getting Help¶

If you've gone through this checklist and still have issues:

Check the GitHub Issues for known problems
Open a new issue with:
Operator version and Kubernetes version
ManagedWorkload YAML (sanitized)
Relevant operator logs
Output of kubectl describe managedworkload