Kubernetes in Production: What They Don't Tell You
Kubernetes in Production: What They Don’t Tell You
Hard-won lessons from running Kubernetes in production, from resource management to debugging failures at 3 AM.
The Hype vs Reality
Kubernetes solves real problems. But it also introduces complexity that catches teams off guard. After migrating 30+ services to Kubernetes and on-call for production incidents, here’s what I wish someone had told me.
Resource Management: Get This Right First
CPU and Memory Requests vs Limits
This is where most teams get burned:
resources:
requests:
memory: "256Mi" # Guaranteed amount
cpu: "100m" # 0.1 CPU core
limits:
memory: "512Mi" # Maximum allowed
cpu: "500m" # Can burst to 0.5 cores
Key lessons:
- Set requests based on actual usage - Profile your app under load
- Memory limits kill pods - If you hit the limit, pod gets OOMKilled
- CPU limits cause throttling - Your app will slow down, not crash
- Different for different workloads - Web servers need different resources than batch jobs
The OOMKilled Disaster
Our payment service kept crashing. Logs showed nothing. Turns out:
$ kubectl describe pod payment-service-xxx
...
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Memory limit was too low. Under load, memory usage spiked, Kubernetes killed the pod.
Fix: Increased memory limit from 256Mi to 512Mi based on actual metrics.
Deployment Strategies
Rolling Updates Done Right
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # How many extra pods during update
maxUnavailable: 0 # Keep all pods running during update
Critical: Set maxUnavailable: 0 for production services. Never have fewer than desired replicas running.
Health Checks That Actually Work
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
Liveness vs Readiness:
- Liveness: Is the app alive? If not, restart it
- Readiness: Is the app ready for traffic? If not, remove from load balancer
Don’t make liveness too aggressive or you’ll get crash loops.
Debugging Production Issues
The CrashLoopBackOff Nightmare
Your pod keeps restarting. Here’s the debugging process:
# Check pod status
kubectl get pods
# Get recent events
kubectl describe pod problem-pod-xxx
# Check logs (including previous crashes)
kubectl logs problem-pod-xxx --previous
# Get a shell if pod is running long enough
kubectl exec -it problem-pod-xxx -- /bin/sh
Common causes:
- App crashes on startup (check logs)
- Liveness probe failing (check probe config)
- OOMKilled (check memory limits)
- Missing config/secrets (check environment)
Network Policies Gone Wrong
Traffic between services suddenly stopped working. Turns out we added a NetworkPolicy that was too restrictive:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-ingress
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend # Only allow frontend pods
ports:
- port: 8080
Lesson: Start with permissive policies, then lock down incrementally.
ConfigMaps and Secrets
ConfigMaps for Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
database_host: "postgres.default.svc.cluster.local"
log_level: "info"
feature_flags: |
{
"new_checkout": true,
"beta_features": false
}
Mount as environment variables or files:
envFrom:
- configMapRef:
name: app-config
Secrets for Sensitive Data
apiVersion: v1
kind: Secret
metadata:
name: db-credentials
type: Opaque
data:
username: YWRtaW4= # base64 encoded
password: cGFzc3dvcmQ=
Important: Base64 is NOT encryption. Use external secret management (AWS Secrets Manager, HashiCorp Vault) for production.
Horizontal Pod Autoscaling
Scale based on CPU, memory, or custom metrics:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Gotcha: HPA needs metrics-server installed. If pods don’t scale, check if metrics-server is running.
Persistent Storage
StatefulSets for Databases
Don’t run databases in Deployment. Use StatefulSet:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
serviceName: postgres
replicas: 3
template:
spec:
containers:
- name: postgres
image: postgres:14
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 100Gi
StatefulSets provide:
- Stable network identities (postgres-0, postgres-1)
- Stable persistent storage
- Ordered deployment and scaling
PersistentVolumeClaims
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: app-storage
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: gp3 # AWS EBS gp3
Storage classes matter: gp3 is cheaper and faster than gp2. io2 for high IOPS workloads.
Monitoring and Observability
Prometheus for Metrics
Expose metrics in your app:
import "github.com/prometheus/client_golang/prometheus"
var (
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
)
// Instrument handlers
httpRequestsTotal.WithLabelValues("GET", "/api/users", "200").Inc()
Prometheus scrapes /metrics endpoint automatically.
Logging Best Practices
log.WithFields(log.Fields{
"request_id": requestID,
"user_id": userID,
"duration_ms": duration,
}).Info("Request completed")
Use structured logging (JSON) so you can search/filter in CloudWatch/ELK.
Cost Optimization
Right-Size Your Pods
We wasted $3000/month on over-provisioned pods:
# Check actual resource usage
kubectl top pods
# Compare to requests/limits
kubectl describe pod xxx | grep -A 5 "Requests:"
If actual usage is much lower than requests, reduce requests.
Cluster Autoscaler
Scale nodes based on pending pods:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler
data:
min-nodes: "2"
max-nodes: "10"
scale-down-delay: "10m"
Nodes automatically add/remove based on demand.
Lessons Learned
- Start simple - Don’t use every K8s feature on day one
- Resource limits are crucial - Prevent noisy neighbor problems****
- Health checks save lives - Catch issues before users do
- Logs are essential - You’ll need them at 3 AM
- Cost adds up fast - Monitor and optimize regularly
When NOT to Use Kubernetes
Kubernetes is overkill if you:
- Have < 5 services
- Don’t need auto-scaling
- Have a small team (< 3 engineers)
- Run mostly serverless workloads
Consider ECS, Cloud Run, or App Runner instead.
Conclusion
Kubernetes solves real orchestration problems, but it’s complex. The payoff comes when you need its features: auto-scaling, self-healing, rolling updates, and multi-cloud portability.
Invest time in understanding the primitives (Pods, Services, Deployments). Get resource management right. Build good observability. Then Kubernetes becomes a powerful platform rather than a source of 3 AM pages.