Analyzing 5xx Errors During Kubernetes Rolling Deployment

Vinayak Pandey
2 min readJan 27, 2025

--

Reference: https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-terminating-with-grace

Recently I was doing some testing and observed 5xx errors during Kubernetes Rolling Deployment. Basically when a pod is getting terminated and your application is not configured to handle it gracefully, you may encounter 5xx errors.

Let’s do the setup and do some testing to analyze the issue.

Step 1: Follow Step1–9 given at https://aws.plainenglish.io/cloudfront-blue-green-deployment-using-gitlab-where-origin-is-alb-eks-se-8f2d95b14ffd to create an nginx deployment(you don’t need blue/green deployment for this test).

Instead of index.html, create app.py with following code

from flask import Flask

app = Flask(__name__)

# Define a route for the home page
@app.route("/")
def home():
return "Hello, World! Welcome to the Flask App!"

# Define a health check endpoint
@app.route("/health")
def health():
return "OK", 200

if __name__ == "__main__":
# Run the app

and Dockerfile like this

# Use the official Python image as the base image
FROM python:3.9-slim

# Set the working directory in the container
WORKDIR /app

# Copy the application code to the container
COPY app.py /app/

# Install dependencies
RUN pip install flask gunicorn

# Expose the port that Gunicorn will run on
EXPOSE 80

# Command to start Gunicorn with 2 workers and proper signal handling
CMD ["gunicorn", "-w", "2", "-b", "0.0.0.0:80", "--timeout", "30", "--graceful-timeout", "30", "--log-level", "debug", "app:app"]

Here we are using gunicorn to run our application with graceful shutdown.

Step 2: Edit nginx deployment to add preStop lifecycle hook and liveness/readiness probes.

spec:
containers:
- image: vinycoolguy/mypyapp:v14
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
exec:
command:
- /bin/sleep
- "10"
livenessProbe:
failureThreshold: 2
httpGet:
path: /
port: 80
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: nginx
ports:
- containerPort: 80
protocol: TCP
readinessProbe:
failureThreshold: 2
httpGet:
path: /health
port: 80
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30

and

change Target group deregistration delay to 30 seconds(same as Pod terminationGracePeriodSeconds).

Also configure health check settings like this. Make sure readiness probe’s failureThreshold/periodSeconds and Target group unhealthy threshold/interval are same.

Now do a rollout restart or a fresh deployment and monitor the status. I did a couple of deployments and didn’t see any 5xx error.

I tried the deployment multiple times, and once in a while got few 502/504 errors. So if you have any suggestions to make this setup more robust, do let me know in the comments.

--

--

Vinayak Pandey
Vinayak Pandey

Written by Vinayak Pandey

Experienced Cloud Engineer with a knack of automation. Linkedin profile: https://www.linkedin.com/in/vinayakpandeyit/

No responses yet