Analyzing 5xx Errors During Kubernetes Rolling Deployment

2 min readJan 27, 2025

Reference: https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-terminating-with-grace

Recently I was doing some testing and observed 5xx errors during Kubernetes Rolling Deployment. Basically when a pod is getting terminated and your application is not configured to handle it gracefully, you may encounter 5xx errors.

Let’s do the setup and do some testing to analyze the issue.

Step 1: Follow Step1–9 given at https://aws.plainenglish.io/cloudfront-blue-green-deployment-using-gitlab-where-origin-is-alb-eks-se-8f2d95b14ffd to create an nginx deployment(you don’t need blue/green deployment for this test).

Instead of index.html, create app.py with following code

from flask import Flask

app = Flask(__name__)

# Define a route for the home page
@app.route("/")
def home():
    return "Hello, World! Welcome to the Flask App!"

# Define a health check endpoint
@app.route("/health")
def health():
    return "OK", 200

if __name__ == "__main__":
    # Run the app

and Dockerfile like this

# Use the official Python image as the base image
FROM python:3.9-slim

# Set the working directory in the container
WORKDIR /app

# Copy the application code to the container
COPY app.py /app/

# Install dependencies
RUN pip install flask gunicorn

# Expose the port that Gunicorn will run on
EXPOSE 80

# Command to start Gunicorn with 2 workers and proper signal handling
CMD ["gunicorn", "-w", "2", "-b", "0.0.0.0:80", "--timeout", "30", "--graceful-timeout", "30", "--log-level", "debug", "app:app"]

Here we are using gunicorn to run our application with graceful shutdown.

Step 2: Edit nginx deployment to add preStop lifecycle hook and liveness/readiness probes.

spec:
      containers:
      - image: vinycoolguy/mypyapp:v14
        imagePullPolicy: IfNotPresent
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sleep
              - "10"
        livenessProbe:
          failureThreshold: 2
          httpGet:
            path: /
            port: 80
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: nginx
        ports:
        - containerPort: 80
          protocol: TCP
        readinessProbe:
          failureThreshold: 2
          httpGet:
            path: /health
            port: 80
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30

and

change Target group deregistration delay to 30 seconds(same as Pod terminationGracePeriodSeconds).

Also configure health check settings like this. Make sure readiness probe’s failureThreshold/periodSeconds and Target group unhealthy threshold/interval are same.

Now do a rollout restart or a fresh deployment and monitor the status. I did a couple of deployments and didn’t see any 5xx error.

I tried the deployment multiple times, and once in a while got few 502/504 errors. So if you have any suggestions to make this setup more robust, do let me know in the comments.

Analyzing 5xx Errors During Kubernetes Rolling Deployment

Written by Vinayak Pandey

No responses yet