System Design: Designing for Scale

Practical patterns for building systems that handle millions of users, from caching strategies to database sharding.

When 10 Users Becomes 10 Million

Our application worked great with 10 concurrent users. At 100, we saw occasional slowdowns. At 1,000, the database was the bottleneck. At 10,000, everything fell apart.

This is the story of how we redesigned for scale and the system design patterns that got us to millions of users.

The Scaling Journey

Phase 1: Single Server (0-1K users)

Everything on one machine:

[Load Balancer]
      |
[Single Server: App + DB + Cache]

Limitations:

Single point of failure
Vertical scaling only (limited)
Database becomes bottleneck

Phase 2: Separate Database (1K-10K users)

Move database to dedicated server:

[Load Balancer]
      |
[App Servers (2-3)]
      |
[Database Server]

Benefits:

Independent scaling
Better resource allocation
Still manageable

New problems:

Database still single point of failure
Network latency between app and DB

Caching Strategy

Phase 3: Add Caching Layer (10K-100K users)

Introduce Redis for frequently accessed data:

[Load Balancer]
      |
[App Servers (5-10)]
      |
   [Redis]    [Database]

What to cache:

User sessions
API responses (with TTL)
Database query results
Computed values

Implementation:

func GetUser(userID string) (*User, error) {
    // 1. Check cache
    cached, err := redis.Get(ctx, fmt.Sprintf("user:%s", userID))
    if err == nil {
        var user User
        json.Unmarshal([]byte(cached), &user)
        return &user, nil
    }
    
    // 2. Cache miss: query database
    user, err := db.QueryUser(userID)
    if err != nil {
        return nil, err
    }
    
    // 3. Store in cache (TTL: 5 minutes)
    data, _ := json.Marshal(user)
    redis.Set(ctx, fmt.Sprintf("user:%s", userID), data, 5*time.Minute)
    
    return user, nil
}

Cache Invalidation Strategies

The two hard problems in computer science: cache invalidation and naming things.

Time-based (TTL):

redis.Set(ctx, key, value, 5*time.Minute)

Event-based:

func UpdateUser(user *User) error {
    // Update database
    err := db.UpdateUser(user)
    if err != nil {
        return err
    }
    
    // Invalidate cache
    redis.Del(ctx, fmt.Sprintf("user:%s", user.ID))
    return nil
}

Write-through:

func UpdateUser(user *User) error {
    // Update database
    err := db.UpdateUser(user)
    if err != nil {
        return err
    }
    
    // Update cache immediately
    data, _ := json.Marshal(user)
    redis.Set(ctx, fmt.Sprintf("user:%s", user.ID), data, 5*time.Minute)
    return nil
}

Real-World Impact

After adding Redis:

Database load: Down 70%
API response time: 200ms → 45ms (4.4x faster)
Peak throughput: 500 req/s → 2,000 req/s (4x improvement)

Database Replication

Phase 4: Read Replicas (100K-1M users)

Most applications are read-heavy (90% reads, 10% writes). Use read replicas:

[Load Balancer]
      |
[App Servers (10-20)]
      |
[Redis Cache]
      |
Primary DB ----→ Read Replica 1
       |------→ Read Replica 2
       |------→ Read Replica 3

Implementation:

type DBCluster struct {
    primary  *sql.DB
    replicas []*sql.DB
    nextIdx  int
}

func (c *DBCluster) Write(query string, args ...interface{}) error {
    // All writes go to primary
    return c.primary.Exec(query, args...)
}

func (c *DBCluster) Read(query string, args ...interface{}) (*sql.Rows, error) {
    // Round-robin across replicas
    replica := c.replicas[c.nextIdx]
    c.nextIdx = (c.nextIdx + 1) % len(c.replicas)
    return replica.Query(query, args...)
}

Replication Lag:

Replicas aren’t instantly up-to-date. Typical lag: 1-100ms.

func CreatePost(post *Post) error {
    // Write to primary
    err := db.Primary.Insert(post)
    if err != nil {
        return err
    }
    
    // Read immediately after write?
    // MUST read from primary to avoid stale data
    return db.Primary.QueryOne(post.ID)
}

Benefits:

Horizontal scaling for reads
High availability (if primary fails, promote replica)
Geographically distributed replicas reduce latency

Load Balancing

Layer 7 (Application) Load Balancing

Route requests based on content:

[Application Load Balancer]
      |
      ├─→ /api/orders → Order Service
      ├─→ /api/users  → User Service
      └─→ /api/search → Search Service

ALB Configuration:

resource "aws_lb_listener_rule" "orders" {
  listener_arn = aws_lb_listener.app.arn
  priority     = 100

  action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.orders.arn
  }

  condition {
    path_pattern {
      values = ["/api/orders/*"]
    }
  }
}

Health Checks

Load balancers need to know which instances are healthy:

// Health check endpoint
http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
    // Check database connectivity
    err := db.Ping()
    if err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    
    // Check Redis connectivity
    err = redis.Ping(ctx)
    if err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    
    w.WriteHeader(http.StatusOK)
    w.Write([]byte("healthy"))
})

Sticky Sessions

Sometimes you need requests from the same user to go to the same server:

User A ──→ Server 1 (session stored)
User A ──→ Server 1 (same server!)
User B ──→ Server 2
User B ──→ Server 2 (same server!)

Implementation: Cookie-based routing or consistent hashing.

Better approach: Store sessions in Redis (shared state), then any server can handle any request.

Database Sharding

Phase 5: Sharding (1M+ users)

When a single database can’t handle the load, shard it:

[App Servers]
      |
      ├─→ Shard 1 (Users A-F)
      ├─→ Shard 2 (Users G-M)
      ├─→ Shard 3 (Users N-S)
      └─→ Shard 4 (Users T-Z)

Sharding Strategies:

1. Range-based sharding:

func GetShardForUser(userID string) *Database {
    firstLetter := strings.ToUpper(userID[0:1])
    
    switch {
    case firstLetter <= "F":
        return shard1
    case firstLetter <= "M":
        return shard2
    case firstLetter <= "S":
        return shard3
    default:
        return shard4
    }
}

2. Hash-based sharding:

func GetShardForUser(userID string) *Database {
    hash := crc32.ChecksumIEEE([]byte(userID))
    shardIdx := hash % uint32(len(shards))
    return shards[shardIdx]
}

3. Geographic sharding:

func GetShardForUser(region string) *Database {
    switch region {
    case "us-east":
        return usEastShard
    case "eu-west":
        return euWestShard
    case "ap-south":
        return apSouthShard
    default:
        return defaultShard
    }
}

Challenges:

Cross-shard queries are complex
Rebalancing shards is difficult
Application complexity increases

When to shard:

Single database can’t handle write load
Data doesn’t fit on one server
You’ve exhausted vertical scaling

CDN for Static Assets

Phase 6: CDN (Global scale)

Serve static assets from edge locations near users:

User in Tokyo ─→ CloudFront Edge (Tokyo) ─→ Origin (US)
                     ↓ (cached)
                Returns from edge

What to cache:

Images, videos
CSS, JavaScript bundles
API responses (with appropriate TTL)

CloudFront Configuration:

resource "aws_cloudfront_distribution" "main" {
  enabled = true

  origin {
    domain_name = aws_s3_bucket.assets.bucket_regional_domain_name
    origin_id   = "S3-Assets"
  }

  default_cache_behavior {
    allowed_methods  = ["GET", "HEAD"]
    cached_methods   = ["GET", "HEAD"]
    target_origin_id = "S3-Assets"

    forwarded_values {
      query_string = false
      cookies {
        forward = "none"
      }
    }

    min_ttl     = 0
    default_ttl = 3600  # 1 hour
    max_ttl     = 86400 # 24 hours
  }

  restrictions {
    geo_restriction {
      restriction_type = "none"
    }
  }

  viewer_certificate {
    cloudfront_default_certificate = true
  }
}

Benefits:

Reduced latency (assets served from nearby edge)
Reduced load on origin servers
Better user experience globally

Asynchronous Processing

Message Queues for Background Jobs

Don’t make users wait for slow operations:

User Request ─→ API Server ─→ Response (fast!)
                     |
                     ├─→ Queue: Send Email
                     ├─→ Queue: Process Image
                     └─→ Queue: Update Analytics
                              ↓
                        [Worker Processes]

Implementation with SQS:

// Enqueue job
type EmailJob struct {
    UserID    string
    Template  string
    Variables map[string]string
}

func SendWelcomeEmail(userID string) error {
    job := EmailJob{
        UserID:   userID,
        Template: "welcome",
        Variables: map[string]string{
            "name": user.Name,
        },
    }
    
    data, _ := json.Marshal(job)
    _, err := sqs.SendMessage(&sqs.SendMessageInput{
        QueueUrl:    aws.String(queueURL),
        MessageBody: aws.String(string(data)),
    })
    
    return err
}

// Worker process
func ProcessEmailQueue() {
    for {
        // Receive messages
        result, err := sqs.ReceiveMessage(&sqs.ReceiveMessageInput{
            QueueUrl:            aws.String(queueURL),
            MaxNumberOfMessages: aws.Int64(10),
            WaitTimeSeconds:     aws.Int64(20), // Long polling
        })
        
        for _, message := range result.Messages {
            var job EmailJob
            json.Unmarshal([]byte(*message.Body), &job)
            
            // Process job
            err := emailService.Send(job)
            if err == nil {
                // Delete message from queue
                sqs.DeleteMessage(&sqs.DeleteMessageInput{
                    QueueUrl:      aws.String(queueURL),
                    ReceiptHandle: message.ReceiptHandle,
                })
            }
        }
    }
}

Benefits:

Faster API response times
Retry failed jobs automatically
Scale workers independently
System survives traffic spikes (queue absorbs load)

Circuit Breaker Pattern

Prevent cascading failures:

type CircuitBreaker struct {
    maxFailures  int
    timeout      time.Duration
    failures     int
    lastAttempt  time.Time
    state        string // "closed", "open", "half-open"
}

func (cb *CircuitBreaker) Call(fn func() error) error {
    if cb.state == "open" {
        if time.Since(cb.lastAttempt) > cb.timeout {
            cb.state = "half-open"
        } else {
            return errors.New("circuit breaker is open")
        }
    }
    
    err := fn()
    if err != nil {
        cb.failures++
        cb.lastAttempt = time.Now()
        
        if cb.failures >= cb.maxFailures {
            cb.state = "open"
        }
        return err
    }
    
    // Success: reset
    cb.failures = 0
    cb.state = "closed"
    return nil
}

Usage:

cb := &CircuitBreaker{
    maxFailures: 5,
    timeout:     30 * time.Second,
}

// Call external service
err := cb.Call(func() error {
    return paymentService.Charge(amount)
})

if err != nil {
    // Circuit breaker is open or service failed
    return fallbackResponse()
}

Prevents:

Waiting for timeouts on failed services
Resource exhaustion from repeated failures
Cascading failures across system

Rate Limiting

Protect your system from abuse and overload:

type RateLimiter struct {
    redis  *redis.Client
    limit  int
    window time.Duration
}

func (rl *RateLimiter) Allow(userID string) (bool, error) {
    key := fmt.Sprintf("rate:%s", userID)
    
    // Increment counter
    count, err := rl.redis.Incr(ctx, key).Result()
    if err != nil {
        return false, err
    }
    
    // Set expiry on first request
    if count == 1 {
        rl.redis.Expire(ctx, key, rl.window)
    }
    
    return count <= int64(rl.limit), nil
}

// Usage in middleware
func RateLimitMiddleware(next http.Handler) http.Handler {
    limiter := &RateLimiter{
        redis:  redisClient,
        limit:  100,
        window: time.Minute,
    }
    
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        userID := getUserID(r)
        
        allowed, _ := limiter.Allow(userID)
        if !allowed {
            w.WriteHeader(http.StatusTooManyRequests)
            w.Write([]byte("Rate limit exceeded"))
            return
        }
        
        next.ServeHTTP(w, r)
    })
}

Monitoring and Observability

Metrics That Matter

import "github.com/prometheus/client_golang/prometheus"

var (
    requestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total HTTP requests",
        },
        []string{"method", "path", "status"},
    )
    
    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request latency",
            Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
        },
        []string{"method", "path"},
    )
)

// Instrument handlers
func instrumentHandler(path string, handler http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        
        // Call handler
        handler(w, r)
        
        // Record metrics
        duration := time.Since(start).Seconds()
        requestsTotal.WithLabelValues(r.Method, path, "200").Inc()
        requestDuration.WithLabelValues(r.Method, path).Observe(duration)
    }
}

Key Metrics to Track

Latency (p50, p95, p99)
Error rate (errors/sec, % of requests)
Throughput (requests/sec)
Saturation (CPU, memory, disk, network)

Alert on SLOs

# Prometheus alert rules
groups:
  - name: api_alerts
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m])
          / rate(http_requests_total[5m]) > 0.05
        for: 5m
        annotations:
          summary: "High error rate detected"
          
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 1
        for: 5m
        annotations:
          summary: "P95 latency above 1 second"

The Final Architecture

After all optimizations:

                    [CloudFront CDN]
                            |
                    [API Gateway]
                            |
                  [Application Load Balancer]
                            |
        ┌───────────────────┼───────────────────┐
        |                   |                   |
   [App Servers]       [App Servers]      [App Servers]
        |                   |                   |
        └───────────────────┼───────────────────┘
                            |
                ┌───────────┼───────────┐
                |           |           |
           [Redis]   [Message Queue]  [CDN]
                |           |
        ┌───────┴───────┐   |
        |               |   |
   [Primary DB]  [Read Replicas] ──┐
        |               |           |
   [Shard 1]       [Shard 2]   [Shard 3]

Real-World Numbers

Our journey from 10 to 10 million users:

Metric	Before	After	Improvement
Response time (p95)	2.5s	150ms	16.7x
Throughput	100 req/s	50,000 req/s	500x
Database load	90% CPU	30% CPU	3x reduction
Monthly cost	$500	$8,000	16x (but 100x users!)
Cost per user	$0.05	$0.0008	62x reduction

Key Lessons

Cache aggressively - Prevents repeated expensive operations
Scale horizontally - Add more servers, not bigger servers
Async everything - Don’t make users wait
Measure everything - You can’t improve what you don’t measure
Plan for failure - Circuit breakers, retries, fallbacks
Shard when necessary - But not too early
Use CDNs - Serve content from edge locations

Conclusion

Scaling from 10 to 10 million users isn’t one change—it’s many incremental improvements:

Separate concerns (app, database, cache)
Add caching layer
Horizontal scaling with load balancing
Database read replicas
Async processing with queues
CDN for static assets
Sharding (when needed)

Don’t optimize prematurely. But understand these patterns so you’re ready when you need them. Your architecture should evolve with your user base.

Start simple, measure everything, and scale intentionally. Build for the scale you have today, with an architecture that can evolve for tomorrow.