Kubernetes Troubleshooting Guide: Errors and Fixes

A practical Kubernetes troubleshooting hub organized by symptoms, common causes, and repeatable fixes for pods, networking, storage, and nodes.

Kubernetes failures rarely arrive as neat root causes. They show up as symptoms: a pod stuck in CrashLoopBackOff, a deployment that never becomes ready, a service that works inside the cluster but not from outside, or a node that suddenly starts evicting workloads. This guide is built as a practical Kubernetes troubleshooting hub you can return to whenever something breaks. Instead of treating debugging as a set of isolated commands, it organizes the work by symptom, explains the most common causes behind each failure mode, and gives you a repeatable path to isolate, confirm, and fix issues with less guesswork.

Overview

The goal of Kubernetes troubleshooting is not to memorize every possible error message. It is to narrow the search space quickly. Most production issues can be traced to a small set of layers: application startup, container image and runtime, scheduling, configuration, storage, networking, permissions, or cluster resource pressure.

A useful mental model is to troubleshoot from the outside in:

What is the visible symptom? A failed rollout, a restarting pod, a timeout, an unschedulable workload, a DNS issue, or degraded node health.
What changed recently? New image, new config, new secret, node pool change, autoscaler event, network policy, ingress rule, or dependency outage.
Which Kubernetes object first shows the failure? Pod, ReplicaSet, Deployment, StatefulSet, Job, Service, Ingress, PersistentVolumeClaim, or Node.
What does the control plane say? Events, conditions, readiness status, scheduling messages, and restart history often tell you where to look next.

For day-to-day kubectl troubleshooting, a small command set solves a surprising amount of the work:

kubectl get pods -A to spot broad symptoms quickly
kubectl describe pod <name> to inspect events, conditions, mounts, probes, and restart causes
kubectl logs <pod> --previous to capture logs from the last crashed container
kubectl get events -A --sort-by=.lastTimestamp to identify recent cluster-level signals
kubectl get deploy,rs,svc,ing,pvc -n <namespace> to understand surrounding objects
kubectl exec -it <pod> -- sh for in-container checks when the container stays up long enough

If your team runs a delivery platform or internal developer platform, it helps to formalize these checks into standard runbooks. That reduces panic-driven debugging and improves developer experience, especially when engineers outside the platform team need to diagnose their own workloads.

Topic map

Use this section as your symptom-based index. Start with the problem you can observe, then work downward until you find a likely cause.

1. Pod stuck in CrashLoopBackOff

This is one of the most common Kubernetes errors. The pod starts, fails, restarts, and enters an exponential backoff cycle.

Common causes:

Application process exits immediately because the command or entrypoint is wrong
Missing environment variables, secrets, or config files
Dependency failures during startup, such as database connection errors
Liveness probe kills the container before startup completes
Port binding issues or startup scripts that assume a non-container environment
Out-of-memory termination leading to repeated restarts

What to check:

kubectl describe pod for restart counts, termination reason, and event messages
kubectl logs and kubectl logs --previous for startup failures
Probe settings: initial delay, timeout, path, and port
Container command, args, image tag, and environment injection
Exit code patterns such as 1, 137, or 139

Typical fixes:

Correct the startup command or image entrypoint
Adjust liveness and readiness probes so slow-starting apps are not killed prematurely
Mount the expected config and secrets
Increase memory limits if OOM is confirmed, but only after checking for leaks or oversized startup behavior

2. Pod in Pending state

When a pod stays pending, Kubernetes usually cannot schedule it or cannot finish required setup before scheduling completes.

Common causes:

Insufficient CPU or memory on available nodes
Taints without matching tolerations
Node selector or affinity rules that are too restrictive
PersistentVolumeClaim not yet bound
Image pull secret or admission policy issues in some environments

What to check:

kubectl describe pod for scheduler messages
kubectl get nodes and kubectl describe node for allocatable resources and taints
kubectl get pvc for storage binding state
Affinity, anti-affinity, topology spread constraints, and tolerations in the pod spec

Typical fixes:

Reduce requests if they are unrealistically high
Add capacity or adjust autoscaling if the cluster is resource constrained
Correct labels, node selectors, and toleration logic
Provision or troubleshoot storage classes and volume binding

3. ImagePullBackOff or ErrImagePull

If Kubernetes cannot fetch the container image, the pod never gets to application startup.

Common causes:

Incorrect image name or tag
Private registry credentials missing or invalid
Registry network access blocked
Image removed, renamed, or not yet pushed by CI/CD

What to check:

The exact image reference in the manifest
Pod events for registry authentication or DNS messages
Image pull secrets and service account configuration
Whether the artifact was published successfully in the pipeline

Typical fixes:

Fix the image tag or repository path
Attach the correct pull secret to the namespace or service account
Verify that your ci cd pipeline pushes the image before deployment steps run

4. Deployment rollout hangs or fails readiness

A deployment can create pods successfully but still fail to complete if readiness checks never pass or old replicas cannot be replaced.

Common causes:

Readiness probe path or port mismatch
Application depends on a service, secret, or migration that is not ready
Rolling update parameters too aggressive for the workload
Config drift between environments

What to check:

kubectl rollout status deployment/<name>
kubectl describe deployment and ReplicaSet events
Pod readiness conditions and probe failures
Whether startup time exceeds probe windows

Typical fixes:

Separate startup, liveness, and readiness concerns more clearly
Tune max surge and max unavailable for safer rollouts
Use staged rollout patterns where appropriate as part of your Kubernetes deployment strategy

5. Service unreachable or traffic not routing

Networking issues often feel random until you check each hop in order: pod, service, endpoints, ingress, and external DNS.

Common causes:

Service selector does not match pod labels
Container listens on a different port than the service targetPort
Ingress backend misconfiguration
NetworkPolicy blocks traffic
DNS or external load balancer propagation issues

What to check:

kubectl get svc,endpoints
Pod labels versus service selectors
Actual listening port inside the container
Ingress rules, TLS settings, and backend service names
Cluster DNS resolution using a temporary debug pod

Typical fixes:

Align labels and selectors
Correct port mappings and backend references
Review network policies namespace by namespace, not only at the destination workload

6. DNS resolution failures inside the cluster

When workloads cannot resolve service names, upstream systems are often blamed too early.

Common causes:

CoreDNS issues or overloaded DNS pods
Wrong namespace-qualified service name
Network policy blocking DNS traffic
Application-level resolver caching oddities

What to check:

Run DNS lookups from a debug pod in the same namespace
Inspect CoreDNS pod health and logs
Confirm the exact service FQDN being queried

Typical fixes:

Restore DNS pod health and capacity
Permit DNS egress in network policy rules
Use consistent service naming conventions to avoid namespace mistakes

7. OOMKilled and resource pressure

An OOMKilled event is usually clear in hindsight but can be misdiagnosed as an application bug if you only look at restarts.

Common causes:

Memory limits set too low
Unexpected traffic or batch load
Large startup allocations
Language runtime tuning mismatch inside containers

What to check:

Pod termination reason and restart history
Requests versus limits
Recent deployment changes, especially cache sizes and concurrency settings
Node-level pressure events

Typical fixes:

Right-size requests and limits from observed usage, not guesswork
Reduce concurrency or memory-heavy startup work
Track recurring patterns with observability tools so resource tuning becomes proactive rather than reactive

8. Persistent volume or mount problems

Storage-related failures can block pod startup entirely or cause subtle runtime errors.

Common causes:

PVC unbound due to storage class issues
Access mode mismatch
Mount path conflicts inside the container
Permissions problems on attached volumes

What to check:

kubectl get pvc,pv
Events on the pod and PVC
Storage class behavior and volume binding mode
Container user and file system permissions

Typical fixes:

Use the correct storage class and access mode
Review init containers or security context settings that affect ownership and permissions

9. RBAC and permission-denied errors

If a pod can start but cannot call the Kubernetes API or a cluster-integrated service, RBAC is a common culprit.

Common causes:

Service account lacks required Role or ClusterRole bindings
Namespace-scoped permissions assumed to be cluster-wide
Token or identity changes during platform hardening

What to check:

The pod's service account
Bound roles and namespaces
Audit messages or application logs showing forbidden actions

Typical fixes:

Grant only the precise verbs and resources needed
Document workload identity patterns clearly, especially in regulated or multi-team clusters

10. Node not ready, evictions, or widespread instability

When multiple workloads degrade at once, the issue may be node or cluster level rather than app specific.

Common causes:

Disk pressure, memory pressure, or network disruption on nodes
Container runtime problems
Misbehaving daemonsets
Cluster autoscaler lag or faulty node images

What to check:

kubectl get nodes and node conditions
Recent eviction events across namespaces
System component logs where available
Whether a rollout coincided with node saturation

Typical fixes:

Relieve pressure, cordon and drain unstable nodes when needed
Review daemonset resource usage and node image changes
Treat recurring node issues as platform engineering work, not one-off firefighting

Good Kubernetes debugging connects symptoms to surrounding operational practices. If you only fix the immediate pod failure, the same class of incident often returns.

Observability and incident handling. Logs and events are not enough for every outage. Metrics, traces, and dependency maps can help confirm whether a restart loop is the cause of the outage or just a secondary effect. If your team needs a structured way to move from detection to coordination, see Incident Response Runbook Checklist for DevOps and SRE Teams.

Software delivery quality. Many cluster issues originate before workloads reach Kubernetes: bad image tags, environment mismatches, fragile rollout logic, or missing deployment checks. To connect troubleshooting effort back to engineering performance, it helps to review deployment-oriented metrics over time. See DORA Metrics Benchmarks: What Good Looks Like for Elite, High, and Medium Performing Teams.

CI/CD pipeline design. Repeated image pull errors, broken manifests, or rollout failures are often delivery pipeline issues as much as infrastructure issues. If your current tooling makes release validation difficult, compare your options in GitHub Actions vs GitLab CI vs Jenkins: Which CI/CD Tool Fits Your Team in 2026? and Best Jenkins Alternatives for Modern CI/CD Teams.

Identity and access. Permission failures in cluster-integrated workloads become more common as organizations standardize service accounts, workload identity, and least-privilege access. For teams thinking beyond human users and API keys, Workload Identity for AI Agents: Separating Who from What They Can Do in Multi‑Protocol Systems is a useful adjacent read.

Domain-specific operations. Troubleshooting patterns also change when reliability, data handling, or integration complexity is unusually high. Teams working on regulated or data-heavy systems may benefit from environment-specific delivery guidance such as Operationalizing Payer Interoperability: DevOps Patterns for Healthcare Integrations, Designing Robust Payer‑to‑Payer APIs: Member Identity, Consent and Reliable Exchanges, and CI/CD for Spatial Apps: Testing, Dataset Versioning and Reproducible Deployments.

How to use this hub

The fastest way to improve Kubernetes troubleshooting is to make it systematic. Use this hub as a lightweight triage workflow rather than a reference you scan only after an outage grows.

Start with the symptom, not the theory. If a pod is pending, do not begin with application logs. If a service is unreachable, do not assume ingress first. Match the symptom to the section above.
Check events before changing manifests. Kubernetes often tells you what it cannot do: schedule a pod, mount a volume, pull an image, or route traffic. Events reduce guesswork.
Inspect the surrounding object chain. For a pod issue, also inspect the Deployment or StatefulSet. For networking, inspect Service, Endpoints, and Ingress together. For storage, include PVC and storage class state.
Look for recent change boundaries. New image, new node pool, changed secret, tightened network policy, altered resource limits, or CI/CD edits can sharply narrow likely causes.
Confirm a fix with the same symptom path. If readiness was failing, verify readiness transitions and rollout completion. If DNS was broken, test name resolution from the same namespace and workload context.
Write down the pattern. Every repeated issue should become a short runbook entry: symptom, commands, likely causes, and verified fix. That is how a debugging culture becomes platform knowledge.

A few habits make this easier over time:

Standardize labels, namespaces, and service naming conventions
Use consistent probe design across services
Keep resource requests realistic and review them periodically
Make pipeline artifacts and deployment metadata easy to trace back to source commits
Add temporary debug tooling thoughtfully rather than shipping bloated production images

If you support many teams, package these checks into templates, dashboards, or internal docs. That is a practical form of platform engineering: reducing repeated cognitive load for engineers who need to solve incidents quickly.

When to revisit

Come back to this guide whenever your Kubernetes environment changes in ways that create new failure modes. Troubleshooting advice stays useful longest when you update it with the patterns your own systems actually produce.

Revisit this hub when:

You adopt a new ingress controller, service mesh, CNI, or storage backend
Your team introduces stricter RBAC, workload identity, or network policy controls
You move from a few services to many teams and namespaces
You add autoscaling, spot capacity, or new node pool classes
Your CI/CD workflow changes image publication, promotion, or deployment logic
You see the same incident category appear more than twice in a quarter

Turn this into an action plan:

Create a symptom index in your internal docs that mirrors the sections in this article.
For each major incident, record the first visible symptom, the confirming signal, and the final fix.
Convert high-frequency failures into guardrails: admission checks, manifest linting, policy defaults, safer rollout settings, or reusable templates.
Review troubleshooting trends alongside delivery metrics so repeated platform problems are visible to engineering leadership.
Schedule a quarterly cleanup of stale runbooks, deprecated commands, and outdated assumptions about your cluster setup.

Kubernetes troubleshooting gets easier when your team stops treating every incident as unique. The cluster will always produce new edge cases, but the path to diagnosis is often repeatable. Build around symptoms, standardize the first checks, and keep adding your own patterns to the map. That is what makes a troubleshooting guide worth revisiting.

Kubernetes Troubleshooting Guide: Common Errors, Causes, and Fixes

Overview