Kubernetes failures rarely arrive as neat root causes. They show up as symptoms: a pod stuck in CrashLoopBackOff, a deployment that never becomes ready, a service that works inside the cluster but not from outside, or a node that suddenly starts evicting workloads. This guide is built as a practical Kubernetes troubleshooting hub you can return to whenever something breaks. Instead of treating debugging as a set of isolated commands, it organizes the work by symptom, explains the most common causes behind each failure mode, and gives you a repeatable path to isolate, confirm, and fix issues with less guesswork.
Overview
The goal of Kubernetes troubleshooting is not to memorize every possible error message. It is to narrow the search space quickly. Most production issues can be traced to a small set of layers: application startup, container image and runtime, scheduling, configuration, storage, networking, permissions, or cluster resource pressure.
A useful mental model is to troubleshoot from the outside in:
- What is the visible symptom? A failed rollout, a restarting pod, a timeout, an unschedulable workload, a DNS issue, or degraded node health.
- What changed recently? New image, new config, new secret, node pool change, autoscaler event, network policy, ingress rule, or dependency outage.
- Which Kubernetes object first shows the failure? Pod, ReplicaSet, Deployment, StatefulSet, Job, Service, Ingress, PersistentVolumeClaim, or Node.
- What does the control plane say? Events, conditions, readiness status, scheduling messages, and restart history often tell you where to look next.
For day-to-day kubectl troubleshooting, a small command set solves a surprising amount of the work:
kubectl get pods -Ato spot broad symptoms quicklykubectl describe pod <name>to inspect events, conditions, mounts, probes, and restart causeskubectl logs <pod> --previousto capture logs from the last crashed containerkubectl get events -A --sort-by=.lastTimestampto identify recent cluster-level signalskubectl get deploy,rs,svc,ing,pvc -n <namespace>to understand surrounding objectskubectl exec -it <pod> -- shfor in-container checks when the container stays up long enough
If your team runs a delivery platform or internal developer platform, it helps to formalize these checks into standard runbooks. That reduces panic-driven debugging and improves developer experience, especially when engineers outside the platform team need to diagnose their own workloads.
Topic map
Use this section as your symptom-based index. Start with the problem you can observe, then work downward until you find a likely cause.
1. Pod stuck in CrashLoopBackOff
This is one of the most common Kubernetes errors. The pod starts, fails, restarts, and enters an exponential backoff cycle.
Common causes:
- Application process exits immediately because the command or entrypoint is wrong
- Missing environment variables, secrets, or config files
- Dependency failures during startup, such as database connection errors
- Liveness probe kills the container before startup completes
- Port binding issues or startup scripts that assume a non-container environment
- Out-of-memory termination leading to repeated restarts
What to check:
kubectl describe podfor restart counts, termination reason, and event messageskubectl logsandkubectl logs --previousfor startup failures- Probe settings: initial delay, timeout, path, and port
- Container command, args, image tag, and environment injection
- Exit code patterns such as 1, 137, or 139
Typical fixes:
- Correct the startup command or image entrypoint
- Adjust liveness and readiness probes so slow-starting apps are not killed prematurely
- Mount the expected config and secrets
- Increase memory limits if OOM is confirmed, but only after checking for leaks or oversized startup behavior
2. Pod in Pending state
When a pod stays pending, Kubernetes usually cannot schedule it or cannot finish required setup before scheduling completes.
Common causes:
- Insufficient CPU or memory on available nodes
- Taints without matching tolerations
- Node selector or affinity rules that are too restrictive
- PersistentVolumeClaim not yet bound
- Image pull secret or admission policy issues in some environments
What to check:
kubectl describe podfor scheduler messageskubectl get nodesandkubectl describe nodefor allocatable resources and taintskubectl get pvcfor storage binding state- Affinity, anti-affinity, topology spread constraints, and tolerations in the pod spec
Typical fixes:
- Reduce requests if they are unrealistically high
- Add capacity or adjust autoscaling if the cluster is resource constrained
- Correct labels, node selectors, and toleration logic
- Provision or troubleshoot storage classes and volume binding
3. ImagePullBackOff or ErrImagePull
If Kubernetes cannot fetch the container image, the pod never gets to application startup.
Common causes:
- Incorrect image name or tag
- Private registry credentials missing or invalid
- Registry network access blocked
- Image removed, renamed, or not yet pushed by CI/CD
What to check:
- The exact image reference in the manifest
- Pod events for registry authentication or DNS messages
- Image pull secrets and service account configuration
- Whether the artifact was published successfully in the pipeline
Typical fixes:
- Fix the image tag or repository path
- Attach the correct pull secret to the namespace or service account
- Verify that your ci cd pipeline pushes the image before deployment steps run
4. Deployment rollout hangs or fails readiness
A deployment can create pods successfully but still fail to complete if readiness checks never pass or old replicas cannot be replaced.
Common causes:
- Readiness probe path or port mismatch
- Application depends on a service, secret, or migration that is not ready
- Rolling update parameters too aggressive for the workload
- Config drift between environments
What to check:
kubectl rollout status deployment/<name>kubectl describe deploymentand ReplicaSet events- Pod readiness conditions and probe failures
- Whether startup time exceeds probe windows
Typical fixes:
- Separate startup, liveness, and readiness concerns more clearly
- Tune max surge and max unavailable for safer rollouts
- Use staged rollout patterns where appropriate as part of your Kubernetes deployment strategy
5. Service unreachable or traffic not routing
Networking issues often feel random until you check each hop in order: pod, service, endpoints, ingress, and external DNS.
Common causes:
- Service selector does not match pod labels
- Container listens on a different port than the service targetPort
- Ingress backend misconfiguration
- NetworkPolicy blocks traffic
- DNS or external load balancer propagation issues
What to check:
kubectl get svc,endpoints- Pod labels versus service selectors
- Actual listening port inside the container
- Ingress rules, TLS settings, and backend service names
- Cluster DNS resolution using a temporary debug pod
Typical fixes:
- Align labels and selectors
- Correct port mappings and backend references
- Review network policies namespace by namespace, not only at the destination workload
6. DNS resolution failures inside the cluster
When workloads cannot resolve service names, upstream systems are often blamed too early.
Common causes:
- CoreDNS issues or overloaded DNS pods
- Wrong namespace-qualified service name
- Network policy blocking DNS traffic
- Application-level resolver caching oddities
What to check:
- Run DNS lookups from a debug pod in the same namespace
- Inspect CoreDNS pod health and logs
- Confirm the exact service FQDN being queried
Typical fixes:
- Restore DNS pod health and capacity
- Permit DNS egress in network policy rules
- Use consistent service naming conventions to avoid namespace mistakes
7. OOMKilled and resource pressure
An OOMKilled event is usually clear in hindsight but can be misdiagnosed as an application bug if you only look at restarts.
Common causes:
- Memory limits set too low
- Unexpected traffic or batch load
- Large startup allocations
- Language runtime tuning mismatch inside containers
What to check:
- Pod termination reason and restart history
- Requests versus limits
- Recent deployment changes, especially cache sizes and concurrency settings
- Node-level pressure events
Typical fixes:
- Right-size requests and limits from observed usage, not guesswork
- Reduce concurrency or memory-heavy startup work
- Track recurring patterns with observability tools so resource tuning becomes proactive rather than reactive
8. Persistent volume or mount problems
Storage-related failures can block pod startup entirely or cause subtle runtime errors.
Common causes:
- PVC unbound due to storage class issues
- Access mode mismatch
- Mount path conflicts inside the container
- Permissions problems on attached volumes
What to check:
kubectl get pvc,pv- Events on the pod and PVC
- Storage class behavior and volume binding mode
- Container user and file system permissions
Typical fixes:
- Use the correct storage class and access mode
- Review init containers or security context settings that affect ownership and permissions
9. RBAC and permission-denied errors
If a pod can start but cannot call the Kubernetes API or a cluster-integrated service, RBAC is a common culprit.
Common causes:
- Service account lacks required Role or ClusterRole bindings
- Namespace-scoped permissions assumed to be cluster-wide
- Token or identity changes during platform hardening
What to check:
- The pod's service account
- Bound roles and namespaces
- Audit messages or application logs showing forbidden actions
Typical fixes:
- Grant only the precise verbs and resources needed
- Document workload identity patterns clearly, especially in regulated or multi-team clusters
10. Node not ready, evictions, or widespread instability
When multiple workloads degrade at once, the issue may be node or cluster level rather than app specific.
Common causes:
- Disk pressure, memory pressure, or network disruption on nodes
- Container runtime problems
- Misbehaving daemonsets
- Cluster autoscaler lag or faulty node images
What to check:
kubectl get nodesand node conditions- Recent eviction events across namespaces
- System component logs where available
- Whether a rollout coincided with node saturation
Typical fixes:
- Relieve pressure, cordon and drain unstable nodes when needed
- Review daemonset resource usage and node image changes
- Treat recurring node issues as platform engineering work, not one-off firefighting
Related subtopics
Good Kubernetes debugging connects symptoms to surrounding operational practices. If you only fix the immediate pod failure, the same class of incident often returns.
Observability and incident handling. Logs and events are not enough for every outage. Metrics, traces, and dependency maps can help confirm whether a restart loop is the cause of the outage or just a secondary effect. If your team needs a structured way to move from detection to coordination, see Incident Response Runbook Checklist for DevOps and SRE Teams.
Software delivery quality. Many cluster issues originate before workloads reach Kubernetes: bad image tags, environment mismatches, fragile rollout logic, or missing deployment checks. To connect troubleshooting effort back to engineering performance, it helps to review deployment-oriented metrics over time. See DORA Metrics Benchmarks: What Good Looks Like for Elite, High, and Medium Performing Teams.
CI/CD pipeline design. Repeated image pull errors, broken manifests, or rollout failures are often delivery pipeline issues as much as infrastructure issues. If your current tooling makes release validation difficult, compare your options in GitHub Actions vs GitLab CI vs Jenkins: Which CI/CD Tool Fits Your Team in 2026? and Best Jenkins Alternatives for Modern CI/CD Teams.
Identity and access. Permission failures in cluster-integrated workloads become more common as organizations standardize service accounts, workload identity, and least-privilege access. For teams thinking beyond human users and API keys, Workload Identity for AI Agents: Separating Who from What They Can Do in Multi‑Protocol Systems is a useful adjacent read.
Domain-specific operations. Troubleshooting patterns also change when reliability, data handling, or integration complexity is unusually high. Teams working on regulated or data-heavy systems may benefit from environment-specific delivery guidance such as Operationalizing Payer Interoperability: DevOps Patterns for Healthcare Integrations, Designing Robust Payer‑to‑Payer APIs: Member Identity, Consent and Reliable Exchanges, and CI/CD for Spatial Apps: Testing, Dataset Versioning and Reproducible Deployments.
How to use this hub
The fastest way to improve Kubernetes troubleshooting is to make it systematic. Use this hub as a lightweight triage workflow rather than a reference you scan only after an outage grows.
- Start with the symptom, not the theory. If a pod is pending, do not begin with application logs. If a service is unreachable, do not assume ingress first. Match the symptom to the section above.
- Check events before changing manifests. Kubernetes often tells you what it cannot do: schedule a pod, mount a volume, pull an image, or route traffic. Events reduce guesswork.
- Inspect the surrounding object chain. For a pod issue, also inspect the Deployment or StatefulSet. For networking, inspect Service, Endpoints, and Ingress together. For storage, include PVC and storage class state.
- Look for recent change boundaries. New image, new node pool, changed secret, tightened network policy, altered resource limits, or CI/CD edits can sharply narrow likely causes.
- Confirm a fix with the same symptom path. If readiness was failing, verify readiness transitions and rollout completion. If DNS was broken, test name resolution from the same namespace and workload context.
- Write down the pattern. Every repeated issue should become a short runbook entry: symptom, commands, likely causes, and verified fix. That is how a debugging culture becomes platform knowledge.
A few habits make this easier over time:
- Standardize labels, namespaces, and service naming conventions
- Use consistent probe design across services
- Keep resource requests realistic and review them periodically
- Make pipeline artifacts and deployment metadata easy to trace back to source commits
- Add temporary debug tooling thoughtfully rather than shipping bloated production images
If you support many teams, package these checks into templates, dashboards, or internal docs. That is a practical form of platform engineering: reducing repeated cognitive load for engineers who need to solve incidents quickly.
When to revisit
Come back to this guide whenever your Kubernetes environment changes in ways that create new failure modes. Troubleshooting advice stays useful longest when you update it with the patterns your own systems actually produce.
Revisit this hub when:
- You adopt a new ingress controller, service mesh, CNI, or storage backend
- Your team introduces stricter RBAC, workload identity, or network policy controls
- You move from a few services to many teams and namespaces
- You add autoscaling, spot capacity, or new node pool classes
- Your CI/CD workflow changes image publication, promotion, or deployment logic
- You see the same incident category appear more than twice in a quarter
Turn this into an action plan:
- Create a symptom index in your internal docs that mirrors the sections in this article.
- For each major incident, record the first visible symptom, the confirming signal, and the final fix.
- Convert high-frequency failures into guardrails: admission checks, manifest linting, policy defaults, safer rollout settings, or reusable templates.
- Review troubleshooting trends alongside delivery metrics so repeated platform problems are visible to engineering leadership.
- Schedule a quarterly cleanup of stale runbooks, deprecated commands, and outdated assumptions about your cluster setup.
Kubernetes troubleshooting gets easier when your team stops treating every incident as unique. The cluster will always produce new edge cases, but the path to diagnosis is often repeatable. Build around symptoms, standardize the first checks, and keep adding your own patterns to the map. That is what makes a troubleshooting guide worth revisiting.