Health Checks: Probes
Configure liveness, readiness, and startup probes to ensure your applications are running correctly.
Health Checks: Probes
In the previous tutorial, we learned about labels and selectors — the sticky notes and search engine of Kubernetes. Now let's tackle something critical: how does Kubernetes know if your app is actually working?
Here's the dirty truth: a running container doesn't mean a healthy application. Your app might be deadlocked, stuck in an infinite loop, or just sitting there staring into the void. Like that coworker who's technically at their desk but hasn't done anything productive in three hours.
Probes let Kubernetes detect these conditions and take action automatically. No more babysitting.
Three Types of Probes
Think of these as three different doctors, each asking a different question:
| Probe | Question | On Failure |
|---|---|---|
| Liveness | "Are you even alive?" | Restart the container (the Kubernetes equivalent of "have you tried turning it off and on again?") |
| Readiness | "Are you ready to work?" | Remove from Service endpoints (stop sending traffic) |
| Startup | "Are you done starting up?" | Keep checking (delays liveness/readiness checks) |
Liveness Probes
"Is my application alive or just pretending?"
If the liveness probe fails, Kubernetes kills the container and starts a new one. It's brutal, but effective. Like a bouncer checking if you're still conscious.
Use this to recover from deadlocks or stuck states.
HTTP Liveness Probe
The most common type for web applications — just hit an endpoint and see if you get a 200 back:
apiVersion: v1
kind: Pod
metadata:
name: liveness-http
spec:
containers:
- name: app
image: nginx
livenessProbe:
httpGet:
path: /healthz
port: 80
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
| Field | Description |
|---|---|
httpGet.path | Endpoint to check |
httpGet.port | Port to connect to |
initialDelaySeconds | Wait before first check |
periodSeconds | How often to check |
timeoutSeconds | How long to wait for response |
failureThreshold | Failures before restarting ("three strikes and you're out") |
Kubernetes expects HTTP 200-399 for success. Anything else? That container is getting recycled.
TCP Liveness Probe
For non-HTTP services like databases or caches — just check if the port is open:
livenessProbe:
tcpSocket:
port: 3306
initialDelaySeconds: 15
periodSeconds: 20
Success means TCP connection was established.
Command Liveness Probe
Run a command inside the container — if it returns exit code 0, you're good:
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
Exit code 0 means success. Anything else means trouble.
Readiness Probes
"Is my application ready to receive traffic, or is it still warming up?"
This is the key difference from liveness: if the readiness probe fails, the Pod is removed from Service endpoints — no traffic is routed to it. But the container is NOT restarted. It just gets taken out of the rotation until it's ready again.
Perfect for startup, loading data, or when your app is temporarily overwhelmed.
apiVersion: v1
kind: Pod
metadata:
name: readiness-demo
spec:
containers:
- name: app
image: nginx
readinessProbe:
httpGet:
path: /ready
port: 80
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
failureThreshold: 3
| Field | Description |
|---|---|
successThreshold | Successes needed after failure to be ready again |
Liveness vs Readiness — When to Use Which
This table will save you so much confusion:
| Scenario | Liveness | Readiness |
|---|---|---|
| Application deadlocked | ✓ Restart it | |
| Temporary overload | ✓ Stop sending traffic | |
| Warming up cache | ✓ Wait until ready | |
| Broken beyond repair | ✓ Restart it | |
| Database connection lost | ✓ Stop traffic until reconnected |
The rule of thumb: "Can it be fixed by restarting?" → Liveness. "Is it temporary?" → Readiness.
Use both together for the full picture:
spec:
containers:
- name: app
image: myapp
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Startup Probes
"My app takes 2 minutes to start. Won't liveness kill it before it's even ready?"
Excellent question! That's exactly what startup probes solve. Some applications take forever to start — loading large datasets, warming caches, Java apps doing... Java things. Without startup probes, you'd need a ridiculously long initialDelaySeconds for liveness, which delays actual failure detection.
Startup probes disable liveness and readiness checks until the app has fully started:
apiVersion: v1
kind: Pod
metadata:
name: slow-start
spec:
containers:
- name: app
image: myapp
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
The startup probe allows up to 300 seconds (30 × 10) for the app to start. Once it passes, liveness and readiness take over. Think of it as saying "let them finish getting dressed before you start checking if they're working."
Practical Example: Web Application
Okay, let's put it all together. Here's a realistic configuration for a web application with all three probes:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 3
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
containers:
- name: web
image: nginx
ports:
- containerPort: 80
startupProbe:
httpGet:
path: /
port: 80
failureThreshold: 12
periodSeconds: 5
livenessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 0
periodSeconds: 15
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 0
periodSeconds: 5
timeoutSeconds: 3
successThreshold: 1
failureThreshold: 2
Here's what happens, step by step:
- Pod starts
- Startup probe checks every 5s (up to 60s total for the app to boot)
- Once startup passes, readiness kicks in (every 5s)
- Liveness checks every 15s in the background
- If liveness fails 3 times → restart the container
- If readiness fails 2 times → remove from Service (but keep it alive)
How cool is that? Three layers of protection, all automatic.
See Probe Status
Want to see what's actually happening with your probes? describe is your friend:
kubectl describe pod <pod-name>
Look at the Conditions section:
Conditions:
Type Status
Initialized True
Ready True # Readiness probe passing
ContainersReady True
PodScheduled True
And the Events section shows the drama as it unfolds:
Events:
Type Reason Message
---- ------ -------
Warning Unhealthy Readiness probe failed: HTTP probe failed...
Warning Unhealthy Liveness probe failed: HTTP probe failed...
Normal Killing Container failed liveness probe, will be restarted
If you see a Killing event — that's Kubernetes pulling the trigger because liveness failed. Cold, but effective.
Test Probes
Wanna see probes in action? Let's create a Pod that starts healthy and then becomes unhealthy. It's like a controlled experiment for chaos:
apiVersion: v1
kind: Pod
metadata:
name: probe-test
spec:
containers:
- name: app
image: busybox
args:
- /bin/sh
- -c
- touch /tmp/healthy; sleep 30; rm -f /tmp/healthy; sleep 600
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
Watch the drama unfold:
kubectl apply -f probe-test.yaml
kubectl get pod probe-test --watch
After 30 seconds, the file gets deleted, liveness starts failing, and — boom — restart. You'll see the RESTARTS counter go up. It's weirdly satisfying to watch Kubernetes handle this automatically.
Probe Configuration Tips
Alright, here's the wisdom section. These tips will save you from 3am pager alerts.
Set Appropriate Timeouts
This is an art, not a science:
- Too short → false positives during garbage collection pauses or high load (your app is fine, you're just impatient)
- Too long → slow failure detection (your app has been dead for 5 minutes and no one noticed)
# Good starting point for most apps
timeoutSeconds: 5
periodSeconds: 10
failureThreshold: 3
Use Different Endpoints
This is important — liveness and readiness should check different things:
livenessProbe:
httpGet:
path: /healthz # Basic "am I alive" check
readinessProbe:
httpGet:
path: /ready # Check DB connection, dependencies
Liveness Should Be Simple
The liveness check should be fast and simple. Don't check external dependencies — if the database is down, restarting your app won't fix the database. That's like changing the tires because you ran out of gas.
# Good liveness endpoint
@app.route('/healthz')
def healthz():
return 'OK', 200
# Good readiness endpoint
@app.route('/ready')
def ready():
if database.is_connected() and cache.is_ready():
return 'OK', 200
return 'Not Ready', 503
See the difference? Liveness is just "am I alive?" Readiness is "am I actually ready to work?"
Account for Startup Time
Java apps and apps loading large datasets need longer startup times. Don't be stingy:
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 60 # Allow 5 minutes
periodSeconds: 5
Common Patterns
Here are some copy-paste-ready probe configs for common services:
Database Connection Check (PostgreSQL)
readinessProbe:
exec:
command:
- pg_isready
- -h
- localhost
- -p
- "5432"
periodSeconds: 10
Redis Check
livenessProbe:
exec:
command:
- redis-cli
- ping
periodSeconds: 5
gRPC Health Check
livenessProbe:
grpc:
port: 50051
periodSeconds: 10
(Requires Kubernetes 1.24+ and gRPC health checking protocol)
Troubleshooting
When probes go wrong, here's your debugging playbook:
Container Keeps Restarting
Something is killing your container. Let's find out what:
kubectl describe pod <pod-name>
kubectl logs <pod-name> --previous
Common causes:
initialDelaySecondstoo short (app hasn't started yet and you're already poking it)- Endpoint returns 500 during high load
- Timeout too short (app is slow, not dead)
Pod Never Becomes Ready
kubectl describe pod <pod-name>
Check readiness probe configuration. Is the endpoint correct? Is the port right? Is the app actually exposing that endpoint?
High Restart Count
kubectl get pods
NAME READY STATUS RESTARTS AGE
myapp-xyz 1/1 Running 47 2h
47 restarts in 2 hours?! That container is having a really bad day. Either the liveness probe is too aggressive, or the app has genuine issues. Check the logs with kubectl logs <pod-name> --previous to see what happened right before the last crash.
Clean Up
kubectl delete pod liveness-http readiness-demo slow-start probe-test 2>/dev/null
kubectl delete deployment web-app 2>/dev/null
What's Next?
Nice work! You now know how to make Kubernetes automatically detect and recover from application failures. No more manually checking if things are running — your cluster is self-healing now.
But what about tasks that aren't meant to run forever? Like batch processing, database migrations, or scheduled cleanup scripts? In the next tutorial, we'll dive into Jobs and CronJobs — Kubernetes' way of running "do this once" and "do this every Tuesday at 3am" workloads. Let's go!