Designing Microservices That Actually Scale

April 8, 2025 · simon balfe

Designing Microservices That Actually Scale

Practical patterns for building distributed systems on AWS, from service communication to deployment strategies that work in production.

The Monolith Breaking Point

Our monolithic application was handling 1,000 requests per second when things started breaking. Not from lack of resources, but from tight coupling. A bug in the recommendation engine would take down checkout. Deploying a small fix required redeploying everything.

We decided to break it into microservices. Here’s what I learned building and running 20+ microservices in production.

Service Boundaries: The Foundation

Getting service boundaries right is crucial. We learned this the hard way:

Domain-Driven Design

Each service owns a specific business capability:

  • User Service - Authentication, profiles, preferences
  • Order Service - Cart, checkout, order management
  • Inventory Service - Stock levels, warehouses
  • Payment Service - Transactions, refunds

Key principle: If two services constantly need to talk to each other, they’re probably one service.

Database Per Service

Each microservice has its own database. No shared databases.

user-service    → PostgreSQL (user data)
order-service   → PostgreSQL (transactions)
inventory-service → MongoDB (stock levels)
analytics-service → TimescaleDB (time-series)

This prevents tight coupling but introduces distributed data challenges. More on that later.

Service Communication Patterns

Synchronous: REST/gRPC

For request-response patterns:

// gRPC service definition
service OrderService {
  rpc CreateOrder(CreateOrderRequest) returns (Order);
  rpc GetOrder(GetOrderRequest) returns (Order);
}

Use gRPC when:

  • Low latency is critical
  • Strong typing is valuable
  • Services are within your control

Use REST when:

  • External clients need access
  • Simplicity over performance
  • Wide language support needed

Asynchronous: Event-Driven

For decoupled, scalable communication:

// Publish order created event
event := OrderCreatedEvent{
    OrderID: order.ID,
    UserID: order.UserID,
    Amount: order.Total,
}
publisher.Publish("orders.created", event)

Services that care:

  • Inventory: Reserve stock
  • Email: Send confirmation
  • Analytics: Track conversion
  • Fraud: Check suspicious activity

They all subscribe independently. If email is down, orders still process.

Deployment on AWS

Container Orchestration with ECS

We run our services on AWS ECS (Elastic Container Service):

# Task definition (simplified)
service: order-service
image: ecr.io/order-service:v1.2.3
cpu: 512
memory: 1024
desired_count: 3

load_balancer:
  target_group: order-service-tg
  container_port: 8080

auto_scaling:
  min_capacity: 2
  max_capacity: 20
  cpu_threshold: 70

Why ECS over EKS:

  • Simpler for our team size (< 10 engineers)
  • Native AWS integration
  • Lower operational overhead
  • Still provides container orchestration

Service Discovery

Services find each other via AWS Cloud Map:

// Discover and call user service
endpoint, _ := discovery.GetServiceEndpoint("user-service")
resp, _ := http.Get(fmt.Sprintf("%s/users/%s", endpoint, userID))

No hardcoded URLs. Services can scale and move freely.

Load Balancing

Application Load Balancer (ALB) routes traffic:

user.example.com/api/orders → order-service
user.example.com/api/users  → user-service
user.example.com/api/payment → payment-service

Health checks ensure traffic only goes to healthy instances.

Distributed Data Challenges

The Saga Pattern

When operations span multiple services, use sagas instead of distributed transactions:

// Order creation saga
func CreateOrderSaga(order Order) error {
    // 1. Reserve inventory
    if err := inventoryClient.Reserve(order.Items); err != nil {
        return err
    }
    
    // 2. Process payment
    if err := paymentClient.Charge(order.Payment); err != nil {
        // Compensate: unreserve inventory
        inventoryClient.Unreserve(order.Items)
        return err
    }
    
    // 3. Create order
    if err := orderDB.Create(order); err != nil {
        // Compensate: refund and unreserve
        paymentClient.Refund(order.Payment)
        inventoryClient.Unreserve(order.Items)
        return err
    }
    
    return nil
}

Event Sourcing for Critical Domains

For order and payment services, we store events rather than current state:

type OrderEvent struct {
    OrderID   string
    EventType string // "created", "paid", "shipped"
    Timestamp time.Time
    Data      json.RawMessage
}

Benefits:

  • Complete audit trail
  • Can reconstruct state at any point in time
  • Natural fit for event-driven architecture

Observability at Scale

Distributed Tracing with X-Ray

Every request gets a trace ID that follows it across services:

// Propagate trace context
ctx := xray.WithSegment(r.Context(), "order-service")
resp := userService.GetUser(ctx, userID)

Now we can see the full request path:

API Gateway → Order Service → User Service → Database
   120ms         45ms           30ms          25ms

Centralized Logging

All services log to CloudWatch:

log.WithFields(log.Fields{
    "trace_id": traceID,
    "user_id": userID,
    "service": "order-service",
}).Info("Order created")

We can correlate logs across services using trace_id.

Metrics That Matter

Key metrics for each service:

  • Request rate (requests/sec)
  • Error rate (errors/sec)
  • Latency (p50, p95, p99)
  • Resource utilization (CPU, memory)
// Instrument endpoints
orderCounter.Inc()
orderLatency.Observe(duration.Seconds())

Lessons Learned

  1. Start with the monolith - Don’t go microservices until you have to
  2. Conway’s Law is real - Service boundaries should match team boundaries
  3. Network calls will fail - Build resilience from day one
  4. Observability isn’t optional - You can’t debug what you can’t see
  5. Automate everything - Manual operations don’t scale

Conclusion

Microservices aren’t a silver bullet. They solve scaling problems but introduce operational complexity. The key is knowing when the benefits outweigh the costs.

For us, microservices enabled independent deployment, technology choices, and scaling. But they required investment in tooling, monitoring, and developer education.

If you’re considering microservices, start small. Extract one service, learn the patterns, and iterate. Your architecture should evolve with your needs.