
Kubernetes Best Practices for Production Workloads
Why Production Kubernetes Is Different
Getting a Kubernetes cluster running locally is straightforward. Running it in production — with real traffic, real SLAs, and real cost pressure — is a different challenge entirely. Most teams learn this the hard way after their first incident at 2am.
At Let'sOps, we've operated production Kubernetes clusters for startups and established companies across the MENA region. Here are the practices that consistently make the difference.
1. Resource Requests and Limits
Every container should declare CPU and memory requests and limits. Without them, the scheduler can't make informed decisions, and you'll see unpredictable evictions and OOM kills.
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
Start with requests set to your average usage and limits at 2-3x. Monitor actual consumption and adjust over time.
2. Pod Disruption Budgets
PDBs prevent Kubernetes from taking down too many pods during voluntary disruptions (node upgrades, scaling, etc.). Without them, a rolling update can temporarily take your service offline.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: api-server
3. Health Checks That Matter
Liveness and readiness probes are not optional. A liveness probe that always returns 200 is worse than no probe at all — it hides real failures.
- Liveness probe: checks if the process is alive. Failure triggers a restart.
- Readiness probe: checks if the pod can serve traffic. Failure removes it from the service.
- Startup probe: gives slow-starting containers time to initialize without being killed.
4. Namespace Isolation
Don't run everything in the default namespace. Use namespaces to isolate teams, environments, or services. Combine with NetworkPolicies to enforce service-to-service communication rules.
5. Observability From Day One
You can't manage what you can't see. Set up metrics (Prometheus), logs (Loki or ELK), and traces (Jaeger or Tempo) before your first production deployment — not after your first incident.
Key metrics to watch: pod restart count, request latency p99, error rate, node CPU/memory pressure, and persistent volume usage.
Bottom Line
Production Kubernetes isn't about complexity — it's about consistency. These five practices form the foundation. Get them right, and everything else (autoscaling, CI/CD, multi-cluster) becomes much easier to build on top of.