Kubernetes in Production: What They Don't Tell You

Kubernetes in Production: What They Don’t Tell You

Hard-won lessons from running Kubernetes in production, from resource management to debugging failures at 3 AM.

The Hype vs Reality

Kubernetes solves real problems. But it also introduces complexity that catches teams off guard. After migrating 30+ services to Kubernetes and on-call for production incidents, here’s what I wish someone had told me.

Resource Management: Get This Right First

CPU and Memory Requests vs Limits

This is where most teams get burned:

resources:
  requests:
    memory: "256Mi"   # Guaranteed amount
    cpu: "100m"       # 0.1 CPU core
  limits:
    memory: "512Mi"   # Maximum allowed
    cpu: "500m"       # Can burst to 0.5 cores

Key lessons:

Set requests based on actual usage - Profile your app under load
Memory limits kill pods - If you hit the limit, pod gets OOMKilled
CPU limits cause throttling - Your app will slow down, not crash
Different for different workloads - Web servers need different resources than batch jobs

The OOMKilled Disaster

Our payment service kept crashing. Logs showed nothing. Turns out:

$ kubectl describe pod payment-service-xxx
...
Last State: Terminated
  Reason: OOMKilled
  Exit Code: 137

Memory limit was too low. Under load, memory usage spiked, Kubernetes killed the pod.

Fix: Increased memory limit from 256Mi to 512Mi based on actual metrics.

Deployment Strategies

Rolling Updates Done Right

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1          # How many extra pods during update
    maxUnavailable: 0    # Keep all pods running during update

Critical: Set maxUnavailable: 0 for production services. Never have fewer than desired replicas running.

Health Checks That Actually Work

livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5

Liveness vs Readiness:

Liveness: Is the app alive? If not, restart it
Readiness: Is the app ready for traffic? If not, remove from load balancer

Don’t make liveness too aggressive or you’ll get crash loops.

Debugging Production Issues

The CrashLoopBackOff Nightmare

Your pod keeps restarting. Here’s the debugging process:

# Check pod status
kubectl get pods

# Get recent events
kubectl describe pod problem-pod-xxx

# Check logs (including previous crashes)
kubectl logs problem-pod-xxx --previous

# Get a shell if pod is running long enough
kubectl exec -it problem-pod-xxx -- /bin/sh

Common causes:

App crashes on startup (check logs)
Liveness probe failing (check probe config)
OOMKilled (check memory limits)
Missing config/secrets (check environment)

Network Policies Gone Wrong

Traffic between services suddenly stopped working. Turns out we added a NetworkPolicy that was too restrictive:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-ingress
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend  # Only allow frontend pods
    ports:
    - port: 8080

Lesson: Start with permissive policies, then lock down incrementally.

ConfigMaps and Secrets

ConfigMaps for Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  database_host: "postgres.default.svc.cluster.local"
  log_level: "info"
  feature_flags: |
    {
      "new_checkout": true,
      "beta_features": false
    }

Mount as environment variables or files:

envFrom:
- configMapRef:
    name: app-config

Secrets for Sensitive Data

apiVersion: v1
kind: Secret
metadata:
  name: db-credentials
type: Opaque
data:
  username: YWRtaW4=  # base64 encoded
  password: cGFzc3dvcmQ=

Important: Base64 is NOT encryption. Use external secret management (AWS Secrets Manager, HashiCorp Vault) for production.

Horizontal Pod Autoscaling

Scale based on CPU, memory, or custom metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Gotcha: HPA needs metrics-server installed. If pods don’t scale, check if metrics-server is running.

Persistent Storage

StatefulSets for Databases

Don’t run databases in Deployment. Use StatefulSet:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  serviceName: postgres
  replicas: 3
  template:
    spec:
      containers:
      - name: postgres
        image: postgres:14
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 100Gi

StatefulSets provide:

Stable network identities (postgres-0, postgres-1)
Stable persistent storage
Ordered deployment and scaling

PersistentVolumeClaims

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: app-storage
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
  storageClassName: gp3  # AWS EBS gp3

Storage classes matter: gp3 is cheaper and faster than gp2. io2 for high IOPS workloads.

Monitoring and Observability

Prometheus for Metrics

Expose metrics in your app:

import "github.com/prometheus/client_golang/prometheus"

var (
    httpRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )
)

// Instrument handlers
httpRequestsTotal.WithLabelValues("GET", "/api/users", "200").Inc()

Prometheus scrapes /metrics endpoint automatically.

Logging Best Practices

log.WithFields(log.Fields{
    "request_id": requestID,
    "user_id": userID,
    "duration_ms": duration,
}).Info("Request completed")

Use structured logging (JSON) so you can search/filter in CloudWatch/ELK.

Cost Optimization

Right-Size Your Pods

We wasted $3000/month on over-provisioned pods:

# Check actual resource usage
kubectl top pods

# Compare to requests/limits
kubectl describe pod xxx | grep -A 5 "Requests:"

If actual usage is much lower than requests, reduce requests.

Cluster Autoscaler

Scale nodes based on pending pods:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler
data:
  min-nodes: "2"
  max-nodes: "10"
  scale-down-delay: "10m"

Nodes automatically add/remove based on demand.

Lessons Learned

Start simple - Don’t use every K8s feature on day one
Resource limits are crucial - Prevent noisy neighbor problems****
Health checks save lives - Catch issues before users do
Logs are essential - You’ll need them at 3 AM
Cost adds up fast - Monitor and optimize regularly

When NOT to Use Kubernetes

Kubernetes is overkill if you:

Have < 5 services
Don’t need auto-scaling
Have a small team (< 3 engineers)
Run mostly serverless workloads

Consider ECS, Cloud Run, or App Runner instead.

Conclusion

Kubernetes solves real orchestration problems, but it’s complex. The payoff comes when you need its features: auto-scaling, self-healing, rolling updates, and multi-cloud portability.

Invest time in understanding the primitives (Pods, Services, Deployments). Get resource management right. Build good observability. Then Kubernetes becomes a powerful platform rather than a source of 3 AM pages.