Back to Blog

The Complete Kubernetes Cluster Health Check Checklist

A production-ready checklist covering security posture, RBAC policies, resource utilization, network policies, and CIS benchmark compliance — the same framework we use in every client engagement.

Every Kubernetes cluster drifts. What started as a well-architected deployment gradually accumulates technical debt — overly permissive RBAC bindings added during a late-night incident, resource limits that were "temporary," network policies that never got written. A structured health check is the only way to catch these issues before they become outages or security incidents.

At k8s.sa, we've conducted health checks on clusters ranging from 5-node development environments to 200+ node production fleets. This checklist distills the patterns we see repeatedly. It's not theoretical — every item here has been the root cause of a real production issue we've encountered in client engagements.

1. Control Plane Health

The control plane is the nervous system of your cluster. If it's degraded, everything downstream is unreliable. Start here before examining workloads.

API Server

API server response latency p99 is under 1 second for non-LIST requests
Audit logging is enabled and shipping to a durable backend (not just local files)
API server is configured with --anonymous-auth=false (unless explicitly required)
Admission controllers are enabled: NodeRestriction, PodSecurity, ResourceQuota, LimitRanger

Check API server health directly:

kubectl get --raw='/readyz?verbose' | grep -E '^\[|ok|failed'
kubectl get --raw='/metrics' | grep apiserver_request_duration_seconds

etcd

etcd cluster has an odd number of members (3 or 5 for production)
etcd database size is under 8GB (default quota)
Defragmentation is scheduled regularly
etcd encryption at rest is enabled for Secrets (--encryption-provider-config)
# Check etcd health (from a control plane node or etcd pod)
etcdctl endpoint health --cluster
etcdctl endpoint status --cluster -w table

Scheduler & Controller Manager

Both components report healthy via /healthz endpoints
Leader election is functioning (check logs for leader transitions)
No persistent scheduling failures visible in events

2. RBAC & Authentication

RBAC misconfigurations are the single most common security finding in our audits. The pattern is always the same: someone grants cluster-admin to unblock a deployment, and it never gets reverted.

No unnecessary cluster-admin ClusterRoleBindings exist beyond system components
Service accounts use namespace-scoped RoleBindings, not ClusterRoleBindings
Default service account tokens are not auto-mounted (automountServiceAccountToken: false)
No Roles or ClusterRoles grant wildcard (*) access to resources or verbs
OIDC or external identity provider is configured for human access (not client certificates)

Audit your RBAC bindings with:

# Find all cluster-admin bindings
kubectl get clusterrolebindings -o json | jq -r '
  .items[] | select(.roleRef.name=="cluster-admin") |
  "\(.metadata.name): \(.subjects[]?.kind)/\(.subjects[]?.name)"'

# Find roles with wildcard permissions
kubectl get clusterroles -o json | jq -r '
  .items[] | select(.rules[]?.verbs[]? == "*") |
  .metadata.name'

We also recommend running kubectl-who-can from the Aqua Security project to audit specific sensitive operations like create secrets or delete pods across all namespaces.

3. Resource Management

Resource limits aren't optional in production. Without them, a single misbehaving pod can trigger OOM kills across an entire node, taking down unrelated workloads.

Every container has CPU and memory requests defined
Every container has memory limits defined (CPU limits are debatable — see below)
ResourceQuotas are set on every namespace
LimitRanges provide sensible defaults for pods without explicit limits
No pods are in a perpetual Pending state due to insufficient resources
Vertical Pod Autoscaler (VPA) recommendations have been reviewed for right-sizing
# Find containers without resource limits
kubectl get pods -A -o json | jq -r '
  .items[] | .spec.containers[] |
  select(.resources.limits == null) |
  "\(.name) has no limits"'

# Check node resource pressure
kubectl top nodes
kubectl describe nodes | grep -A5 "Allocated resources"

A note on CPU limits: There's an ongoing debate in the Kubernetes community about whether CPU limits cause more harm than good due to CFS throttling. Our recommendation: always set CPU requests for scheduling, but evaluate CPU limits on a per-workload basis. Latency-sensitive services often perform better without CPU limits. Use container_cpu_cfs_throttled_seconds_total in Prometheus to detect throttling.

4. Network Policies

By default, every pod in a Kubernetes cluster can communicate with every other pod. This is the equivalent of running a flat network with no firewall rules. In a multi-tenant or production environment, this is unacceptable.

A default-deny ingress NetworkPolicy exists in every namespace
A default-deny egress NetworkPolicy exists in every namespace (with DNS exceptions)
Explicit allow policies exist for each legitimate communication path
CNI plugin supports NetworkPolicy enforcement (Calico, Cilium, or equivalent)
Network policies have been tested — not just applied (use netcat or curl from test pods)
# Default deny all ingress in a namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress

If you're using Cilium, consider upgrading to CiliumNetworkPolicy resources for L7 visibility and DNS-aware policies. The ability to write policies like "allow egress only to api.stripe.com" is a significant security improvement over IP-based rules.

5. Pod Security

Pod Security Admission (PSA) replaced PodSecurityPolicies in Kubernetes 1.25. If you're still running without any pod security enforcement, every pod in your cluster can run as root, mount the host filesystem, and access the host network.

Pod Security Admission is configured at the namespace level (restricted or baseline)
No containers run as root (runAsNonRoot: true)
No containers use privileged mode
All containers drop all capabilities and add back only what's needed
Read-only root filesystem is enabled where possible (readOnlyRootFilesystem: true)
Host namespaces are not shared (hostNetwork, hostPID, hostIPC are false)
# Label namespaces for Pod Security Admission
kubectl label namespace production \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/audit=restricted \
  pod-security.kubernetes.io/warn=restricted

6. Image Security & Supply Chain

All images are pulled from a private registry (not directly from Docker Hub)
Image tags are immutable — use digests (@sha256:...) or a tag-immutability policy
A vulnerability scanner (Trivy, Grype, or Snyk) runs in CI and as an admission controller
Image pull policy is Always for mutable tags or IfNotPresent for digest-pinned images
Cosign or Notary signatures are verified before admission (via Kyverno or OPA Gatekeeper)
# Scan all running images with Trivy
kubectl get pods -A -o jsonpath='{range .items[*]}{range .spec.containers[*]}{.image}{"\n"}{end}{end}' | \
  sort -u | xargs -I{} trivy image --severity HIGH,CRITICAL {}

7. Secrets Management

Kubernetes Secrets are base64-encoded, not encrypted. Without additional measures, anyone with read access to Secrets in a namespace can decode every credential stored there.

etcd encryption at rest is enabled for Secret resources
External secrets management is in use (HashiCorp Vault, AWS Secrets Manager, or equivalent)
Secrets are not stored in Git (even encrypted — use External Secrets Operator or Sealed Secrets)
RBAC restricts Secret read access to only the service accounts that need it
Secret rotation is automated and tested

8. Monitoring & Observability

Prometheus (or equivalent) is scraping all nodes, pods, and control plane components
Alerting rules exist for: node not ready, pod crash loops, PVC near capacity, certificate expiry
Centralized logging is configured (Loki, Elasticsearch, or cloud-native equivalent)
Distributed tracing is in place for service-to-service calls (Jaeger, Tempo, or equivalent)
Dashboards exist for the four golden signals: latency, traffic, errors, saturation
PagerDuty, Opsgenie, or equivalent on-call routing is configured for critical alerts

9. Backup & Disaster Recovery

etcd snapshots are taken automatically on a schedule (at minimum daily)
etcd restore has been tested — not just the backup (test quarterly at minimum)
Velero or equivalent backs up namespaced resources and persistent volumes
Backup storage is in a different failure domain than the cluster
Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are documented and tested
# Take an etcd snapshot
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify the snapshot
etcdctl snapshot status /backup/etcd-$(date +%Y%m%d).db -w table

10. CIS Benchmark Compliance

The CIS Kubernetes Benchmark provides a comprehensive set of security recommendations. Running it isn't optional for production clusters — it's the baseline.

kube-bench has been run against all control plane and worker nodes
All FAIL results have been triaged — either remediated or documented with accepted risk
CIS benchmark scans are automated and run on a schedule (not just once)
# Run kube-bench on a node
docker run --rm --pid=host -v /etc:/etc:ro -v /var:/var:ro \
  aquasec/kube-bench:latest run --targets=node

# Or as a Kubernetes Job
kubectl apply -f https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job.yaml
kubectl logs -l app=kube-bench

Putting It All Together

A health check isn't a one-time event. The clusters that stay healthy are the ones where these checks are automated and run continuously. Tools like Polaris, Popeye, and kube-score can automate many of these checks and integrate into your CI/CD pipeline.

At minimum, we recommend:

The goal isn't perfection on day one — it's establishing a baseline and improving continuously. Start with the items that carry the highest risk: RBAC, network policies, and backup verification. Everything else can follow.

Need a Professional Cluster Health Check?

Our team conducts thorough Kubernetes health assessments with actionable findings and prioritized remediation plans. No vendor lock-in — just expert advisory.

Schedule a Health Check