Health Checks: Probes

Configure liveness, readiness, and startup probes to ensure your applications are running correctly.

9 min read

Health Checks: Probes

In the previous tutorial, we learned about labels and selectors — the sticky notes and search engine of Kubernetes. Now let's tackle something critical: how does Kubernetes know if your app is actually working?

Here's the dirty truth: a running container doesn't mean a healthy application. Your app might be deadlocked, stuck in an infinite loop, or just sitting there staring into the void. Like that coworker who's technically at their desk but hasn't done anything productive in three hours.

Probes let Kubernetes detect these conditions and take action automatically. No more babysitting.

Three Types of Probes

Think of these as three different doctors, each asking a different question:

ProbeQuestionOn Failure
Liveness"Are you even alive?"Restart the container (the Kubernetes equivalent of "have you tried turning it off and on again?")
Readiness"Are you ready to work?"Remove from Service endpoints (stop sending traffic)
Startup"Are you done starting up?"Keep checking (delays liveness/readiness checks)

Liveness Probes

"Is my application alive or just pretending?"

If the liveness probe fails, Kubernetes kills the container and starts a new one. It's brutal, but effective. Like a bouncer checking if you're still conscious.

Use this to recover from deadlocks or stuck states.

HTTP Liveness Probe

The most common type for web applications — just hit an endpoint and see if you get a 200 back:

apiVersion: v1
kind: Pod
metadata:
  name: liveness-http
spec:
  containers:
  - name: app
    image: nginx
    livenessProbe:
      httpGet:
        path: /healthz
        port: 80
      initialDelaySeconds: 5
      periodSeconds: 10
      timeoutSeconds: 3
      failureThreshold: 3
FieldDescription
httpGet.pathEndpoint to check
httpGet.portPort to connect to
initialDelaySecondsWait before first check
periodSecondsHow often to check
timeoutSecondsHow long to wait for response
failureThresholdFailures before restarting ("three strikes and you're out")

Kubernetes expects HTTP 200-399 for success. Anything else? That container is getting recycled.

TCP Liveness Probe

For non-HTTP services like databases or caches — just check if the port is open:

livenessProbe:
  tcpSocket:
    port: 3306
  initialDelaySeconds: 15
  periodSeconds: 20

Success means TCP connection was established.

Command Liveness Probe

Run a command inside the container — if it returns exit code 0, you're good:

livenessProbe:
  exec:
    command:
    - cat
    - /tmp/healthy
  initialDelaySeconds: 5
  periodSeconds: 5

Exit code 0 means success. Anything else means trouble.

Readiness Probes

"Is my application ready to receive traffic, or is it still warming up?"

This is the key difference from liveness: if the readiness probe fails, the Pod is removed from Service endpoints — no traffic is routed to it. But the container is NOT restarted. It just gets taken out of the rotation until it's ready again.

Perfect for startup, loading data, or when your app is temporarily overwhelmed.

apiVersion: v1
kind: Pod
metadata:
  name: readiness-demo
spec:
  containers:
  - name: app
    image: nginx
    readinessProbe:
      httpGet:
        path: /ready
        port: 80
      initialDelaySeconds: 5
      periodSeconds: 5
      successThreshold: 1
      failureThreshold: 3
FieldDescription
successThresholdSuccesses needed after failure to be ready again

Liveness vs Readiness — When to Use Which

This table will save you so much confusion:

ScenarioLivenessReadiness
Application deadlocked✓ Restart it
Temporary overload✓ Stop sending traffic
Warming up cache✓ Wait until ready
Broken beyond repair✓ Restart it
Database connection lost✓ Stop traffic until reconnected

The rule of thumb: "Can it be fixed by restarting?" → Liveness. "Is it temporary?" → Readiness.

Use both together for the full picture:

spec:
  containers:
  - name: app
    image: myapp
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 30
      periodSeconds: 10
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5

Startup Probes

"My app takes 2 minutes to start. Won't liveness kill it before it's even ready?"

Excellent question! That's exactly what startup probes solve. Some applications take forever to start — loading large datasets, warming caches, Java apps doing... Java things. Without startup probes, you'd need a ridiculously long initialDelaySeconds for liveness, which delays actual failure detection.

Startup probes disable liveness and readiness checks until the app has fully started:

apiVersion: v1
kind: Pod
metadata:
  name: slow-start
spec:
  containers:
  - name: app
    image: myapp
    startupProbe:
      httpGet:
        path: /healthz
        port: 8080
      failureThreshold: 30
      periodSeconds: 10
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      periodSeconds: 10
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      periodSeconds: 5

The startup probe allows up to 300 seconds (30 × 10) for the app to start. Once it passes, liveness and readiness take over. Think of it as saying "let them finish getting dressed before you start checking if they're working."

Practical Example: Web Application

Okay, let's put it all together. Here's a realistic configuration for a web application with all three probes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: web
        image: nginx
        ports:
        - containerPort: 80
        startupProbe:
          httpGet:
            path: /
            port: 80
          failureThreshold: 12
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 0
          periodSeconds: 15
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 0
          periodSeconds: 5
          timeoutSeconds: 3
          successThreshold: 1
          failureThreshold: 2

Here's what happens, step by step:

  1. Pod starts
  2. Startup probe checks every 5s (up to 60s total for the app to boot)
  3. Once startup passes, readiness kicks in (every 5s)
  4. Liveness checks every 15s in the background
  5. If liveness fails 3 times → restart the container
  6. If readiness fails 2 times → remove from Service (but keep it alive)

How cool is that? Three layers of protection, all automatic.

See Probe Status

Want to see what's actually happening with your probes? describe is your friend:

kubectl describe pod <pod-name>

Look at the Conditions section:

Conditions:
  Type              Status
  Initialized       True
  Ready             True      # Readiness probe passing
  ContainersReady   True
  PodScheduled      True

And the Events section shows the drama as it unfolds:

Events:
  Type     Reason     Message
  ----     ------     -------
  Warning  Unhealthy  Readiness probe failed: HTTP probe failed...
  Warning  Unhealthy  Liveness probe failed: HTTP probe failed...
  Normal   Killing    Container failed liveness probe, will be restarted

If you see a Killing event — that's Kubernetes pulling the trigger because liveness failed. Cold, but effective.

Test Probes

Wanna see probes in action? Let's create a Pod that starts healthy and then becomes unhealthy. It's like a controlled experiment for chaos:

apiVersion: v1
kind: Pod
metadata:
  name: probe-test
spec:
  containers:
  - name: app
    image: busybox
    args:
    - /bin/sh
    - -c
    - touch /tmp/healthy; sleep 30; rm -f /tmp/healthy; sleep 600
    livenessProbe:
      exec:
        command:
        - cat
        - /tmp/healthy
      initialDelaySeconds: 5
      periodSeconds: 5

Watch the drama unfold:

kubectl apply -f probe-test.yaml
kubectl get pod probe-test --watch

After 30 seconds, the file gets deleted, liveness starts failing, and — boom — restart. You'll see the RESTARTS counter go up. It's weirdly satisfying to watch Kubernetes handle this automatically.

Probe Configuration Tips

Alright, here's the wisdom section. These tips will save you from 3am pager alerts.

Set Appropriate Timeouts

This is an art, not a science:

  • Too short → false positives during garbage collection pauses or high load (your app is fine, you're just impatient)
  • Too long → slow failure detection (your app has been dead for 5 minutes and no one noticed)
# Good starting point for most apps
timeoutSeconds: 5
periodSeconds: 10
failureThreshold: 3

Use Different Endpoints

This is important — liveness and readiness should check different things:

livenessProbe:
  httpGet:
    path: /healthz      # Basic "am I alive" check
readinessProbe:
  httpGet:
    path: /ready        # Check DB connection, dependencies

Liveness Should Be Simple

The liveness check should be fast and simple. Don't check external dependencies — if the database is down, restarting your app won't fix the database. That's like changing the tires because you ran out of gas.

# Good liveness endpoint
@app.route('/healthz')
def healthz():
    return 'OK', 200

# Good readiness endpoint
@app.route('/ready')
def ready():
    if database.is_connected() and cache.is_ready():
        return 'OK', 200
    return 'Not Ready', 503

See the difference? Liveness is just "am I alive?" Readiness is "am I actually ready to work?"

Account for Startup Time

Java apps and apps loading large datasets need longer startup times. Don't be stingy:

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 60    # Allow 5 minutes
  periodSeconds: 5

Common Patterns

Here are some copy-paste-ready probe configs for common services:

Database Connection Check (PostgreSQL)

readinessProbe:
  exec:
    command:
    - pg_isready
    - -h
    - localhost
    - -p
    - "5432"
  periodSeconds: 10

Redis Check

livenessProbe:
  exec:
    command:
    - redis-cli
    - ping
  periodSeconds: 5

gRPC Health Check

livenessProbe:
  grpc:
    port: 50051
  periodSeconds: 10

(Requires Kubernetes 1.24+ and gRPC health checking protocol)

Troubleshooting

When probes go wrong, here's your debugging playbook:

Container Keeps Restarting

Something is killing your container. Let's find out what:

kubectl describe pod <pod-name>
kubectl logs <pod-name> --previous

Common causes:

  • initialDelaySeconds too short (app hasn't started yet and you're already poking it)
  • Endpoint returns 500 during high load
  • Timeout too short (app is slow, not dead)

Pod Never Becomes Ready

kubectl describe pod <pod-name>

Check readiness probe configuration. Is the endpoint correct? Is the port right? Is the app actually exposing that endpoint?

High Restart Count

kubectl get pods
NAME        READY   STATUS    RESTARTS   AGE
myapp-xyz   1/1     Running   47         2h

47 restarts in 2 hours?! That container is having a really bad day. Either the liveness probe is too aggressive, or the app has genuine issues. Check the logs with kubectl logs <pod-name> --previous to see what happened right before the last crash.

Clean Up

kubectl delete pod liveness-http readiness-demo slow-start probe-test 2>/dev/null
kubectl delete deployment web-app 2>/dev/null

What's Next?

Nice work! You now know how to make Kubernetes automatically detect and recover from application failures. No more manually checking if things are running — your cluster is self-healing now.

But what about tasks that aren't meant to run forever? Like batch processing, database migrations, or scheduled cleanup scripts? In the next tutorial, we'll dive into Jobs and CronJobs — Kubernetes' way of running "do this once" and "do this every Tuesday at 3am" workloads. Let's go!