Designing Microservices That Actually Scale
Designing Microservices That Actually Scale
Practical patterns for building distributed systems on AWS, from service communication to deployment strategies that work in production.
The Monolith Breaking Point
Our monolithic application was handling 1,000 requests per second when things started breaking. Not from lack of resources, but from tight coupling. A bug in the recommendation engine would take down checkout. Deploying a small fix required redeploying everything.
We decided to break it into microservices. Here’s what I learned building and running 20+ microservices in production.
Service Boundaries: The Foundation
Getting service boundaries right is crucial. We learned this the hard way:
Domain-Driven Design
Each service owns a specific business capability:
- User Service - Authentication, profiles, preferences
- Order Service - Cart, checkout, order management
- Inventory Service - Stock levels, warehouses
- Payment Service - Transactions, refunds
Key principle: If two services constantly need to talk to each other, they’re probably one service.
Database Per Service
Each microservice has its own database. No shared databases.
user-service → PostgreSQL (user data)
order-service → PostgreSQL (transactions)
inventory-service → MongoDB (stock levels)
analytics-service → TimescaleDB (time-series)
This prevents tight coupling but introduces distributed data challenges. More on that later.
Service Communication Patterns
Synchronous: REST/gRPC
For request-response patterns:
// gRPC service definition
service OrderService {
rpc CreateOrder(CreateOrderRequest) returns (Order);
rpc GetOrder(GetOrderRequest) returns (Order);
}
Use gRPC when:
- Low latency is critical
- Strong typing is valuable
- Services are within your control
Use REST when:
- External clients need access
- Simplicity over performance
- Wide language support needed
Asynchronous: Event-Driven
For decoupled, scalable communication:
// Publish order created event
event := OrderCreatedEvent{
OrderID: order.ID,
UserID: order.UserID,
Amount: order.Total,
}
publisher.Publish("orders.created", event)
Services that care:
- Inventory: Reserve stock
- Email: Send confirmation
- Analytics: Track conversion
- Fraud: Check suspicious activity
They all subscribe independently. If email is down, orders still process.
Deployment on AWS
Container Orchestration with ECS
We run our services on AWS ECS (Elastic Container Service):
# Task definition (simplified)
service: order-service
image: ecr.io/order-service:v1.2.3
cpu: 512
memory: 1024
desired_count: 3
load_balancer:
target_group: order-service-tg
container_port: 8080
auto_scaling:
min_capacity: 2
max_capacity: 20
cpu_threshold: 70
Why ECS over EKS:
- Simpler for our team size (< 10 engineers)
- Native AWS integration
- Lower operational overhead
- Still provides container orchestration
Service Discovery
Services find each other via AWS Cloud Map:
// Discover and call user service
endpoint, _ := discovery.GetServiceEndpoint("user-service")
resp, _ := http.Get(fmt.Sprintf("%s/users/%s", endpoint, userID))
No hardcoded URLs. Services can scale and move freely.
Load Balancing
Application Load Balancer (ALB) routes traffic:
user.example.com/api/orders → order-service
user.example.com/api/users → user-service
user.example.com/api/payment → payment-service
Health checks ensure traffic only goes to healthy instances.
Distributed Data Challenges
The Saga Pattern
When operations span multiple services, use sagas instead of distributed transactions:
// Order creation saga
func CreateOrderSaga(order Order) error {
// 1. Reserve inventory
if err := inventoryClient.Reserve(order.Items); err != nil {
return err
}
// 2. Process payment
if err := paymentClient.Charge(order.Payment); err != nil {
// Compensate: unreserve inventory
inventoryClient.Unreserve(order.Items)
return err
}
// 3. Create order
if err := orderDB.Create(order); err != nil {
// Compensate: refund and unreserve
paymentClient.Refund(order.Payment)
inventoryClient.Unreserve(order.Items)
return err
}
return nil
}
Event Sourcing for Critical Domains
For order and payment services, we store events rather than current state:
type OrderEvent struct {
OrderID string
EventType string // "created", "paid", "shipped"
Timestamp time.Time
Data json.RawMessage
}
Benefits:
- Complete audit trail
- Can reconstruct state at any point in time
- Natural fit for event-driven architecture
Observability at Scale
Distributed Tracing with X-Ray
Every request gets a trace ID that follows it across services:
// Propagate trace context
ctx := xray.WithSegment(r.Context(), "order-service")
resp := userService.GetUser(ctx, userID)
Now we can see the full request path:
API Gateway → Order Service → User Service → Database
120ms 45ms 30ms 25ms
Centralized Logging
All services log to CloudWatch:
log.WithFields(log.Fields{
"trace_id": traceID,
"user_id": userID,
"service": "order-service",
}).Info("Order created")
We can correlate logs across services using trace_id.
Metrics That Matter
Key metrics for each service:
- Request rate (requests/sec)
- Error rate (errors/sec)
- Latency (p50, p95, p99)
- Resource utilization (CPU, memory)
// Instrument endpoints
orderCounter.Inc()
orderLatency.Observe(duration.Seconds())
Lessons Learned
- Start with the monolith - Don’t go microservices until you have to
- Conway’s Law is real - Service boundaries should match team boundaries
- Network calls will fail - Build resilience from day one
- Observability isn’t optional - You can’t debug what you can’t see
- Automate everything - Manual operations don’t scale
Conclusion
Microservices aren’t a silver bullet. They solve scaling problems but introduce operational complexity. The key is knowing when the benefits outweigh the costs.
For us, microservices enabled independent deployment, technology choices, and scaling. But they required investment in tooling, monitoring, and developer education.
If you’re considering microservices, start small. Extract one service, learn the patterns, and iterate. Your architecture should evolve with your needs.