Infrastructure as Code: Terraform Best Practices
Infrastructure as Code: Terraform Best Practices
Building maintainable cloud infrastructure with Terraform, from module design to state management in production.
Why Infrastructure as Code Matters
We used to provision infrastructure through the AWS console. Click click click, wait for resources to spin up, forget what we did, and have no way to reproduce it. Then came the inevitable “it works on my account but not yours” conversations.
Terraform changed everything. Our entire infrastructure is now versioned, reviewable, and reproducible.
The Foundation: Project Structure
Here’s how we organize Terraform projects:
infrastructure/
├── modules/
│ ├── vpc/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── ecs-service/
│ └── rds/
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ ├── terraform.tfvars
│ │ └── backend.tf
│ ├── staging/
│ └── prod/
└── README.md
Key principles:
- Modules are reusable components
- Environments consume modules with different configurations
- Each environment has its own state file
Writing Reusable Modules
Here’s a module for an ECS service:
# modules/ecs-service/main.tf
resource "aws_ecs_task_definition" "app" {
family = var.service_name
requires_compatibilities = ["FARGATE"]
network_mode = "awsvpc"
cpu = var.cpu
memory = var.memory
execution_role_arn = aws_iam_role.execution.arn
task_role_arn = aws_iam_role.task.arn
container_definitions = jsonencode([{
name = var.service_name
image = var.image
portMappings = [{
containerPort = var.container_port
protocol = "tcp"
}]
environment = [
for key, value in var.environment : {
name = key
value = value
}
]
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = aws_cloudwatch_log_group.app.name
"awslogs-region" = var.aws_region
"awslogs-stream-prefix" = "ecs"
}
}
}])
}
resource "aws_ecs_service" "app" {
name = var.service_name
cluster = var.cluster_id
task_definition = aws_ecs_task_definition.app.arn
desired_count = var.desired_count
launch_type = "FARGATE"
network_configuration {
subnets = var.private_subnet_ids
security_groups = [aws_security_group.app.id]
assign_public_ip = false
}
load_balancer {
target_group_arn = aws_lb_target_group.app.arn
container_name = var.service_name
container_port = var.container_port
}
}
Variables and Validation
Make your modules self-documenting with good variable definitions:
# modules/ecs-service/variables.tf
variable "service_name" {
description = "Name of the ECS service"
type = string
validation {
condition = can(regex("^[a-z0-9-]+$", var.service_name))
error_message = "Service name must be lowercase alphanumeric with hyphens."
}
}
variable "cpu" {
description = "CPU units for the task (256, 512, 1024, 2048, 4096)"
type = number
default = 256
validation {
condition = contains([256, 512, 1024, 2048, 4096], var.cpu)
error_message = "CPU must be one of: 256, 512, 1024, 2048, 4096."
}
}
variable "memory" {
description = "Memory (MB) for the task"
type = number
default = 512
}
variable "environment" {
description = "Environment variables for the container"
type = map(string)
default = {}
}
Using Modules in Environments
# environments/prod/main.tf
module "api_service" {
source = "../../modules/ecs-service"
service_name = "api-prod"
cluster_id = aws_ecs_cluster.main.id
cpu = 1024
memory = 2048
desired_count = 5
image = "123456789.dkr.ecr.us-east-1.amazonaws.com/api:latest"
container_port = 8080
environment = {
NODE_ENV = "production"
DATABASE_URL = data.aws_ssm_parameter.db_url.value
REDIS_URL = aws_elasticache_cluster.redis.cache_nodes.0.address
}
vpc_id = data.aws_vpc.main.id
private_subnet_ids = data.aws_subnets.private.ids
}
Remote State Management
Never store state files locally. Use S3 + DynamoDB for state locking:
# environments/prod/backend.tf
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-locks"
}
}
Create the S3 bucket and DynamoDB table first:
# bootstrap/main.tf
resource "aws_s3_bucket" "terraform_state" {
bucket = "mycompany-terraform-state"
}
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
}
}
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-state-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
}
Managing Secrets
Never commit secrets to version control. Use AWS Parameter Store or Secrets Manager:
# Store secret in AWS Secrets Manager
resource "aws_secretsmanager_secret" "db_password" {
name = "prod/database/password"
}
resource "aws_secretsmanager_secret_version" "db_password" {
secret_id = aws_secretsmanager_secret.db_password.id
secret_string = random_password.db_password.result
}
# Reference in ECS task
container_definitions = jsonencode([{
name = "app"
secrets = [{
name = "DATABASE_PASSWORD"
valueFrom = aws_secretsmanager_secret.db_password.arn
}]
}])
Terraform Workspaces (Use Sparingly)
Workspaces seem appealing but can cause issues:
# Create workspace
terraform workspace new staging
# Switch workspace
terraform workspace select prod
When to use:
- Quick dev/test environments
- Temporary infrastructure
When NOT to use:
- Production infrastructure
- Environments with different configurations
For prod, use separate directories with separate state files.
Plan and Apply Workflow
Always review changes before applying:
# Initialize (first time or after adding providers)
terraform init
# Format code
terraform fmt -recursive
# Validate configuration
terraform validate
# Plan changes
terraform plan -out=tfplan
# Review the plan carefully!
# Apply the plan
terraform apply tfplan
CI/CD Integration
Our GitHub Actions workflow for Terraform:
name: Terraform
on:
pull_request:
paths:
- 'infrastructure/**'
push:
branches:
- main
jobs:
terraform:
runs-on: ubuntu-latest
defaults:
run:
working-directory: infrastructure/environments/prod
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- name: Terraform Init
run: terraform init
- name: Terraform Format
run: terraform fmt -check
- name: Terraform Validate
run: terraform validate
- name: Terraform Plan
run: terraform plan -no-color
continue-on-error: true
- name: Terraform Apply
if: github.ref == 'refs/heads/main'
run: terraform apply -auto-approve
Import Existing Resources
Already have infrastructure? Import it:
# Import existing EC2 instance
terraform import aws_instance.example i-1234567890abcdef
# Import S3 bucket
terraform import aws_s3_bucket.example my-bucket-name
# Import RDS instance
terraform import aws_db_instance.example mydb
Write the resource definition, then import its state.
Handling State Drift
Resources get modified outside Terraform (console, CLI). Detect drift:
# Refresh state from real infrastructure
terraform refresh
# See what changed
terraform plan
Common Pitfalls
1. Hardcoded Values
# ❌ Bad: Hardcoded values
resource "aws_instance" "web" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t2.micro"
}
# ✅ Good: Use variables and data sources
data "aws_ami" "ubuntu" {
most_recent = true
owners = ["099720109477"] # Canonical
filter {
name = "name"
values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
}
}
resource "aws_instance" "web" {
ami = data.aws_ami.ubuntu.id
instance_type = var.instance_type
}
2. Not Using Lifecycle Blocks
Prevent accidental resource deletion:
resource "aws_db_instance" "prod" {
# ... configuration ...
lifecycle {
prevent_destroy = true
ignore_changes = [
password, # Don't detect drift in passwords
]
}
}
3. Large Blast Radius
Keep environments separate. A bug in dev config shouldn’t affect prod.
Terraform vs Alternatives
CloudFormation:
- AWS-only
- Verbose YAML
- Better AWS support (new services)
Pulumi:
- Real programming languages
- Good for complex logic
- Steeper learning curve
CDK:
- Synthesizes to CloudFormation
- Type-safe
- AWS-focused
Terraform wins for:
- Multi-cloud
- Mature ecosystem
- Simple learning curve
- Large community
Real-World Impact
After adopting Terraform:
- Infrastructure changes: From hours to minutes
- Environment consistency: Dev/staging/prod truly identical
- Disaster recovery: Entire infrastructure rebuildable from code
- Collaboration: Infrastructure changes go through PR review
- Cost visibility: Easy to see what resources cost
Conclusion
Infrastructure as Code isn’t optional anymore. Terraform provides:
- Version control for infrastructure
- Reproducible environments
- Change review process
- Documentation through code
Start small. Terraform one service. Learn the patterns. Then expand. Your infrastructure will become more reliable, maintainable, and understandable.
The upfront investment in learning Terraform pays dividends every time you need to provision, modify, or understand your infrastructure.