Every Kubernetes cluster drifts. What started as a well-architected deployment gradually accumulates technical debt — overly permissive RBAC bindings added during a late-night incident, resource limits that were "temporary," network policies that never got written. A structured health check is the only way to catch these issues before they become outages or security incidents.
At k8s.sa, we've conducted health checks on clusters ranging from 5-node development environments to 200+ node production fleets. This checklist distills the patterns we see repeatedly. It's not theoretical — every item here has been the root cause of a real production issue we've encountered in client engagements.
1. Control Plane Health
The control plane is the nervous system of your cluster. If it's degraded, everything downstream is unreliable. Start here before examining workloads.
API Server
--anonymous-auth=false (unless explicitly required)NodeRestriction, PodSecurity, ResourceQuota, LimitRangerCheck API server health directly:
kubectl get --raw='/readyz?verbose' | grep -E '^\[|ok|failed'
kubectl get --raw='/metrics' | grep apiserver_request_duration_seconds
etcd
--encryption-provider-config)# Check etcd health (from a control plane node or etcd pod)
etcdctl endpoint health --cluster
etcdctl endpoint status --cluster -w table
Scheduler & Controller Manager
/healthz endpoints2. RBAC & Authentication
RBAC misconfigurations are the single most common security finding in our audits. The pattern is always the same: someone grants cluster-admin to unblock a deployment, and it never gets reverted.
cluster-admin ClusterRoleBindings exist beyond system componentsautomountServiceAccountToken: false)*) access to resources or verbsAudit your RBAC bindings with:
# Find all cluster-admin bindings
kubectl get clusterrolebindings -o json | jq -r '
.items[] | select(.roleRef.name=="cluster-admin") |
"\(.metadata.name): \(.subjects[]?.kind)/\(.subjects[]?.name)"'
# Find roles with wildcard permissions
kubectl get clusterroles -o json | jq -r '
.items[] | select(.rules[]?.verbs[]? == "*") |
.metadata.name'
We also recommend running kubectl-who-can from the Aqua Security project to audit specific sensitive operations like create secrets or delete pods across all namespaces.
3. Resource Management
Resource limits aren't optional in production. Without them, a single misbehaving pod can trigger OOM kills across an entire node, taking down unrelated workloads.
Pending state due to insufficient resources# Find containers without resource limits
kubectl get pods -A -o json | jq -r '
.items[] | .spec.containers[] |
select(.resources.limits == null) |
"\(.name) has no limits"'
# Check node resource pressure
kubectl top nodes
kubectl describe nodes | grep -A5 "Allocated resources"
A note on CPU limits: There's an ongoing debate in the Kubernetes community about whether CPU limits cause more harm than good due to CFS throttling. Our recommendation: always set CPU requests for scheduling, but evaluate CPU limits on a per-workload basis. Latency-sensitive services often perform better without CPU limits. Use
container_cpu_cfs_throttled_seconds_totalin Prometheus to detect throttling.
4. Network Policies
By default, every pod in a Kubernetes cluster can communicate with every other pod. This is the equivalent of running a flat network with no firewall rules. In a multi-tenant or production environment, this is unacceptable.
netcat or curl from test pods)# Default deny all ingress in a namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
If you're using Cilium, consider upgrading to CiliumNetworkPolicy resources for L7 visibility and DNS-aware policies. The ability to write policies like "allow egress only to api.stripe.com" is a significant security improvement over IP-based rules.
5. Pod Security
Pod Security Admission (PSA) replaced PodSecurityPolicies in Kubernetes 1.25. If you're still running without any pod security enforcement, every pod in your cluster can run as root, mount the host filesystem, and access the host network.
restricted or baseline)runAsNonRoot: true)readOnlyRootFilesystem: true)hostNetwork, hostPID, hostIPC are false)# Label namespaces for Pod Security Admission
kubectl label namespace production \
pod-security.kubernetes.io/enforce=restricted \
pod-security.kubernetes.io/audit=restricted \
pod-security.kubernetes.io/warn=restricted
6. Image Security & Supply Chain
@sha256:...) or a tag-immutability policyAlways for mutable tags or IfNotPresent for digest-pinned images# Scan all running images with Trivy
kubectl get pods -A -o jsonpath='{range .items[*]}{range .spec.containers[*]}{.image}{"\n"}{end}{end}' | \
sort -u | xargs -I{} trivy image --severity HIGH,CRITICAL {}
7. Secrets Management
Kubernetes Secrets are base64-encoded, not encrypted. Without additional measures, anyone with read access to Secrets in a namespace can decode every credential stored there.
8. Monitoring & Observability
9. Backup & Disaster Recovery
# Take an etcd snapshot
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Verify the snapshot
etcdctl snapshot status /backup/etcd-$(date +%Y%m%d).db -w table
10. CIS Benchmark Compliance
The CIS Kubernetes Benchmark provides a comprehensive set of security recommendations. Running it isn't optional for production clusters — it's the baseline.
# Run kube-bench on a node
docker run --rm --pid=host -v /etc:/etc:ro -v /var:/var:ro \
aquasec/kube-bench:latest run --targets=node
# Or as a Kubernetes Job
kubectl apply -f https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job.yaml
kubectl logs -l app=kube-bench
Putting It All Together
A health check isn't a one-time event. The clusters that stay healthy are the ones where these checks are automated and run continuously. Tools like Polaris, Popeye, and kube-score can automate many of these checks and integrate into your CI/CD pipeline.
At minimum, we recommend:
- Weekly: Automated scans with Polaris or kube-score, image vulnerability scanning
- Monthly: RBAC audit, resource utilization review, certificate expiry check
- Quarterly: Full CIS benchmark run, disaster recovery test, network policy review
The goal isn't perfection on day one — it's establishing a baseline and improving continuously. Start with the items that carry the highest risk: RBAC, network policies, and backup verification. Everything else can follow.
Need a Professional Cluster Health Check?
Our team conducts thorough Kubernetes health assessments with actionable findings and prioritized remediation plans. No vendor lock-in — just expert advisory.
Schedule a Health Check