The Reality of Production Incidents
It is 3 AM. Your pager rings. You type kubectl get pods. The screen shows your pods are Running. Yet your app is still dropping user traffic. Looking at app logs will not help much here. In large incidents, logs often stop at the container level. A Running status only means the main process is alive. It does not mean the entire system is working as expected. Skilled engineers do more than react to symptoms. They identify the root cause by examining the system layer by layer.
When you check these layers in order, you can resolve incidents much faster:
| Layer | What to Inspect |
|---|---|
| Container & Process | Exit codes, memory limits, Java memory, and hidden OOM kills. |
| Runtime & Probes | CPU throttling, dead processes, and bad health checks. |
| Node & Eviction | Node memory limits, Pod rules (PDBs), and crash loops. |
| Storage Layer | Volume locks, stuck storage, and ConfigMap bugs. |
| Network & Routing | DNS delays, full network tables, and dead IP addresses. |
This guide shows you the exact commands and simple explanations to fix these problems.
Container & Process Level: Decoding Exit Codes and Memory
When a container dies, Linux saves an exit code. Read these codes carefully. They tell you exactly what the system did.
Exit Code 137: The OOMKilled Reality
Exit Code 137 means the system killed your container because it used too much memory. Do not just add more memory. Find out why it happened first. Linux picks what to kill based on a score called oom_score_adj.
- Guaranteed pods: The score is -997. They are very hard to kill.
- BestEffort pods: The score is 1000. They get killed first.
- Burstable pods: The score depends on memory requests. Warning: smaller memory requests actually give your pod a higher chance to be killed.
Before Kubernetes 1.28, if a child process used too much memory, only the child died. Kubernetes did not notice. Now, with cgroup v2, Kubernetes 1.28 kills the whole container if any child process uses too much memory.
If you want the old way back, Kubernetes 1.32 added a new flag called singleProcessOOMKill. Set this to true to only kill the bad process, not the whole container. Also, Kubernetes 1.36 added a better way to protect memory called Tiered Memory Protection. Guaranteed pods get strict protection using memory.min. Burstable pods get soft protection using memory.low.
# Get the exact termination reason from the Kubernetes API -o jsonpath='{. status. containerStatuses[*]. lastState. terminated. reason}' |
Exit Code 139: Segmentation Faults
Exit Code 139 means the app tried to use memory it does not own. This is usually a bug in C/C++ code. You need to check the core dump file to fix it.
JVM Memory Misalignment and cgroup v2
Java apps often get killed even when the main memory (heap) is only half full. This happens because Java uses extra memory outside the heap for background tasks. If you use the -Xmx flag and set it too high, this extra memory pushes you over the container limit. The system kills the pod.
Fix: Do not use the static -Xmx flag. Use -XX:MaxRAMPercentage=75.0 instead. This tells Java to save 25% of the memory for those extra background tasks.
Runtime Limits & Probes: When the Pod Lies About Its Health
A running pod can still be broken. Bad health checks can restart a perfectly healthy app.
CPU Throttling-Induced Probe Timeouts
If a health check times out, your app might not be frozen. It might just be paused by the CPU. The system gives your container a small slice of CPU time. If your app uses all its time too fast, the system pauses it. If Kubernetes runs a health check while the app is paused, the check fails. Kubernetes then restarts your app for no reason.
To check for this:
# Check cgroup v2 CPU throttle stats directly on the node cat /sys/fs/cgroup/cpu. stat | grep throttled
# Or check per-container stats via containerd crictl stats <container-id> |
The Cascading Failure of Dependency-Checking Probes
Never check your database from a liveness probe. If the database goes offline for 5 seconds, all your pod liveness probes will fail at the same time. Kubernetes will kill and restart all your pods at once. When the database comes back, hundreds of pods will hit it at the exact same time. The database will crash again.
Rule: Liveness probes should only check if the app itself is stuck.
Exec Probes and Zombie Processes
Using shell scripts for health checks can leave dead "zombie" processes behind. These dead processes will slowly fill up your system.
Fix: Use an init system like dumb-init inside your container, or use simple HTTP probes instead.
Node Pressure & Eviction: Debugging CrashLoopBackOff
CrashLoopBackOff means your pod keeps crashing. Kubernetes makes it wait longer and longer before trying again. If the logs are empty, the pod is not crashing from a code bug. The node is kicking it out.
Kubelet Evictions vs. Application Crashes
When a node runs out of memory or disk space, it kicks pods off. If your pod gets sent back to the exact same full node, it gets kicked off again. This looks like a crash loop. Stop looking at app logs. Look at the node status instead.
| # Check why the previous container instance actually died kubectl logs <pod-name> --previous # Inspect node pressure conditions kubectl describe node <node-name> | grep -A5 Conditions # Find all evicted pods across the cluster kubectl get pods --all-namespaces --field-selector status. phase=Failed kubectl get events --all-namespaces --field-selector reason=Evicted |
The PodDisruptionBudget Mask
A PodDisruptionBudget (PDB) keeps your app safe when you want to drain a node. But a node that is out of memory does not care about your PDB. It will kick the pod out immediately. Make sure your PDB allows at least one pod to go down, or it might hide bigger node problems.
Storage Layer: Deadlocks and Finalizers
Sometimes storage volumes make pods refuse to start or stop.
The ReadWriteOncePod (RWOP) Evolution
Old storage volumes used a mode called ReadWriteOnce (RWO). This locked the storage to one node. During an update, a new pod on a new node could get stuck. It would wait forever for the old node to let go of the storage.
Fix: In Kubernetes 1.29, ReadWriteOncePod (RWOP) became fully ready (GA). It locks the storage to exactly one pod in the whole cluster. Use this to stop pods from getting stuck.
PVCs Stuck in Terminating
If a node dies suddenly, the storage might stay locked to it.
Fix: Check if the storage is detached in the cloud. Only after that, remove the finalizer to unlock it.
| # Inspect stuck volume attachments kubectl get volumeattachments # Remove the protection finalizer -- only after confirming storage detached kubectl patch pvc <pvc-name> -p '{"metadata":{"finalizers":null}}' |
The ConfigMap subPath Trap
If you mount a ConfigMap as a full folder, Kubernetes updates the files automatically. But if you mount just one file using subPath, it will never update. This happens because Linux locks onto the file's ID (called an inode). When Kubernetes makes a new file, the ID changes, but the pod stays locked to the old one.
Fix: Mount the whole folder instead of using subPath.
Network Layer: DNS Delays and IP Exhaustion
Network bugs are very hard to find. They usually look like random timeouts.
The DNS 4-Query Tax (ndots:5)
Kubernetes adds a rule called ndots:5. This makes the system search cluster names before trying a real web address. If you look for api. external. com, the system tries three wrong names first. This makes your network slow.
Fix: Add a dot at the end of your address in your code (api. external. com.). This skips the extra searches.
Conntrack Exhaustion and the 5-Second Delay
If your DNS sometimes takes exactly 5 seconds, you have a Linux bug. When your app asks for IPv4 and IPv6 addresses at the exact same time using UDP, the Linux network tracker (nf_conntrack) gets confused. It drops the second packet. Because UDP does not retry right away, your app just sits there and waits 5 seconds before trying again
| # Check current conntrack table fill level on the node cat /proc/sys/net/netfilter/nf_conntrack_count cat /proc/sys/net/netfilter/nf_conntrack_max |
Fix: Install NodeLocal DNSCache. It runs a local cache and changes UDP requests to TCP. This fixes the bug forever.
Subnet IP Exhaustion (CNI Limitations)
In cloud setups, pods get real IP addresses from the network. The network tool holds a pool of extra IPs. If your subnet is small, it can run out of IPs before pods even start.
Fix: Put pod IP addresses in a different, larger network space than your main nodes.
Service-Level Routing: Beyond iptables
If your pod is healthy but you see 502 Bad Gateway errors, the network routing is broken. Old tools (kube-proxy with iptables) get very slow when you have many pods. They fall behind and send traffic to dead pods.
| Solution | Mechanism | Scale | Recommended For |
|---|---|---|---|
| kube-proxy (iptables) | Sequential rules | Up to ~1,000 nodes | Small clusters |
| kube-proxy (IPVS) | Faast Lookups | Up to ~3,000 nodes | Medium clusters |
| Cilium (eBPF) | Deep kernel routing | Any scale | 2026 standard |
Observability & Advanced Debugging Tools
Ephemeral Containers for Distroless Images
Very safe container images do not have tools like curl or a shell. You cannot use kubectl exec. Instead, use an ephemeral container. This attaches a temporary container with tools right into your running pod:
kubectl debug -it <pod-name> \ --image=nicolaka/netshoot \ --target=<container-name> |
If you need to check the node itself:
# Spawn a privileged debug container on the node itself kubectl debug node/<node-name> -it --image=ubuntu -- bash |
eBPF: Kernel-Level Tracing
eBPF lets you watch the system deep inside the kernel without changing your code. Old security tools checked data before it was fully loaded. Hackers could change the data and hide. This is called a TOCTOU attack. Modern eBPF tools like Tetragon use LSM hooks. LSM hooks read the data deep inside the kernel where it is safe from hackers.
The Clock Skew Anomaly in Distributed Tracing
If your tracing tools show a child task starting before its parent, your code is probably fine. This happens when the physical hardware clock on a worker node is out of sync (NTP drift).
Fix: Fix the hardware clock on the worker node.
| # Debug NTP synchronization directly on the affected node kubectl debug node/<node-name> -it --image=ubuntu -- \ bash -c "apt-get update -q && apt-get install -y chrony && chronyc tracking" |
The Live Debug Playbook
Use this table when your pager rings. Run these commands before guessing what is wrong.
| Symptom | First Commands to Run | Likely Root Cause |
|---|---|---|
| Exit Code 137 (OOMKilled) | dmesg | grep -i oomkubectl get pod -o jsonpath='{. . . reason}' | JVM heap + off-heap over cgroup limit; wrong QoS class |
| Exit Code 139 (SIGSEGV) | Generate core dump; check JNI library CPU architecture | Native C/C++ bug or wrong-arch shared library |
| CrashLoopBackOff (empty logs) | kubectl logs --previouskubectl describe node | grep Conditions | Node memory/disk pressure eviction loop |
| 502 Bad Gateway (pod healthy) | kubectl get endpoints <svc>iptables -L -t nat | grep <svc-ip> | Stale kube-proxy iptables rules routing to dead pod IP |
| DNS 5-second delay | cat /proc/sys/net/netfilter/nf_conntrack_count | conntrack UDP race deploy NodeLocal DNSCache |
| Pods stuck in Pending | kubectl describe pod | grep Events | CNI subnet IP exhaustion; expand CIDR or use ENIConfig |
| PVC stuck in Terminating | kubectl get volumeattachments | Dead node holding VolumeAttachment; patch PVC finalizers |
| Trace spans out of order | kubectl debug node/<n> -- chronyc tracking | NTP clock drift on worker node; resync chronyd |
Building Systemic Reliability
Do not just look at application logs. Good engineers look at the whole system. Use this guide to find out if the problem is in the kernel, the network, or the hardware. By learning these simple patterns, you will stop guessing. You will find and fix the real problem the first time.

