Creating production-ready Kubernetes infrastructure requires careful planning, proper tooling, and adherence to best practices. In this comprehensive guide, we'll explore how to leverage Terraform to build scalable and secure Kubernetes clusters that can handle enterprise workloads.
Why Infrastructure as Code for Kubernetes?
Infrastructure as Code (IaC) transforms how we manage and deploy infrastructure. When applied to Kubernetes, it brings several critical advantages:
- Reproducibility: Deploy identical environments across dev, staging, and production
- Version Control: Track infrastructure changes alongside application code
- Automation: Reduce manual errors and deployment time
- Collaboration: Enable team members to review and contribute to infrastructure changes
Prerequisites
Before we begin, ensure you have:
- Terraform >= 1.0 installed
- kubectl configured
- Cloud provider CLI (AWS CLI, gcloud, or Azure CLI)
- Basic understanding of Kubernetes concepts
Setting Up the Project Structure
kubernetes-infrastructure/
āāā modules/
ā āāā vpc/
ā āāā eks/
ā āāā monitoring/
āāā environments/
ā āāā dev/
ā āāā staging/
ā āāā production/
āāā main.tf
āāā variables.tf
āāā outputs.tf
āāā terraform.tfvars
This modular approach promotes reusability and maintainability across different environments.
Creating the VPC Module
First, let's create a robust networking foundation:
# modules/vpc/main.tf
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "${var.cluster_name}-vpc"
"kubernetes.io/cluster/${var.cluster_name}" = "shared"
}
}
resource "aws_subnet" "private" {
count = length(var.private_subnet_cidrs)
vpc_id = aws_vpc.main.id
cidr_block = var.private_subnet_cidrs[count.index]
availability_zone = data.aws_availability_zones.available.names[count.index]
tags = {
Name = "${var.cluster_name}-private-${count.index + 1}"
"kubernetes.io/cluster/${var.cluster_name}" = "owned"
"kubernetes.io/role/internal-elb" = "1"
}
}
EKS Cluster Configuration
Now, let's create a production-ready EKS cluster:
# modules/eks/main.tf
resource "aws_eks_cluster" "main" {
name = var.cluster_name
role_arn = aws_iam_role.cluster.arn
version = var.kubernetes_version
vpc_config {
subnet_ids = var.subnet_ids
endpoint_private_access = true
endpoint_public_access = true
public_access_cidrs = var.public_access_cidrs
}
encryption_config {
provider {
key_arn = aws_kms_key.eks.arn
}
resources = ["secrets"]
}
enabled_cluster_log_types = [
"api",
"audit",
"authenticator",
"controllerManager",
"scheduler"
]
depends_on = [
aws_iam_role_policy_attachment.cluster_policy,
aws_iam_role_policy_attachment.vpc_resource_controller,
]
}
Node Group Configuration
Configure managed node groups with auto-scaling:
resource "aws_eks_node_group" "main" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "${var.cluster_name}-nodes"
node_role_arn = aws_iam_role.node_group.arn
subnet_ids = var.private_subnet_ids
instance_types = var.instance_types
ami_type = "AL2_x86_64"
capacity_type = "ON_DEMAND"
scaling_config {
desired_size = var.desired_capacity
max_size = var.max_capacity
min_size = var.min_capacity
}
update_config {
max_unavailable = 1
}
# Ensure node group is created after all IAM policies
depends_on = [
aws_iam_role_policy_attachment.node_group_policy,
aws_iam_role_policy_attachment.cni_policy,
aws_iam_role_policy_attachment.registry_policy,
]
}
Security Best Practices
RBAC Configuration
Implement role-based access control:
# rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: developer-role
rules:
- apiGroups: [""]
resources: ["pods", "services", "configmaps"]
verbs: ["get", "list", "create", "update", "delete"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["get", "list", "create", "update", "delete"]
Network Policies
Implement network segmentation:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-all-ingress
spec:
podSelector: {}
policyTypes:
- Ingress
Monitoring and Observability
Integrate monitoring from the start:
# modules/monitoring/main.tf
resource "helm_release" "prometheus" {
name = "prometheus"
repository = "https://prometheus-community.github.io/helm-charts"
chart = "kube-prometheus-stack"
namespace = "monitoring"
create_namespace = true
values = [
file("${path.module}/prometheus-values.yaml")
]
}
Auto-Scaling Configuration
Configure both horizontal and vertical scaling:
resource "helm_release" "cluster_autoscaler" {
name = "cluster-autoscaler"
repository = "https://kubernetes.github.io/autoscaler"
chart = "cluster-autoscaler"
namespace = "kube-system"
set {
name = "autoDiscovery.clusterName"
value = var.cluster_name
}
set {
name = "awsRegion"
value = var.aws_region
}
}
Environment-Specific Configurations
Development Environment
# environments/dev/main.tf
module "kubernetes_cluster" {
source = "../../modules/eks"
cluster_name = "dev-cluster"
kubernetes_version = "1.28"
instance_types = ["t3.medium"]
desired_capacity = 2
min_capacity = 1
max_capacity = 5
}
Production Environment
# environments/production/main.tf
module "kubernetes_cluster" {
source = "../../modules/eks"
cluster_name = "prod-cluster"
kubernetes_version = "1.28"
instance_types = ["m5.large", "m5.xlarge"]
desired_capacity = 5
min_capacity = 3
max_capacity = 20
}
Deployment Pipeline
Automate deployment with CI/CD:
# .github/workflows/terraform.yml
name: Terraform
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
terraform:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
with:
terraform_version: 1.6.0
- name: Terraform Plan
run: |
terraform init
terraform plan
- name: Terraform Apply
if: github.ref == 'refs/heads/main'
run: terraform apply -auto-approve
Backup and Disaster Recovery
Implement backup strategies:
resource "helm_release" "velero" {
name = "velero"
repository = "https://vmware-tanzu.github.io/helm-charts"
chart = "velero"
namespace = "velero"
create_namespace = true
set {
name = "configuration.provider"
value = "aws"
}
set {
name = "configuration.backupStorageLocation.bucket"
value = aws_s3_bucket.velero_backup.bucket
}
}
Cost Optimization
Spot Instances
resource "aws_eks_node_group" "spot" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "${var.cluster_name}-spot-nodes"
node_role_arn = aws_iam_role.node_group.arn
subnet_ids = var.private_subnet_ids
capacity_type = "SPOT"
instance_types = ["m5.large", "m5.xlarge", "m4.large"]
scaling_config {
desired_size = 2
max_size = 10
min_size = 0
}
}
Resource Quotas
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: production
spec:
hard:
requests.cpu: "4"
requests.memory: 8Gi
limits.cpu: "8"
limits.memory: 16Gi
Troubleshooting Common Issues
Node Not Ready
Check common causes:
- Network connectivity
- IAM permissions
- Security group configurations
- Instance capacity
Pod Scheduling Issues
Debug with:
kubectl describe pod <pod-name>
kubectl get nodes -o wide
kubectl top nodes
Conclusion
Building scalable Kubernetes infrastructure with Terraform requires careful planning and adherence to best practices. By following the patterns outlined in this guide, you can create robust, secure, and maintainable infrastructure that scales with your organization's needs.
Key takeaways:
- Use modular Terraform code for reusability
- Implement security from day one
- Plan for monitoring and observability
- Automate everything through CI/CD
- Consider cost optimization strategies
Have you implemented Kubernetes with Terraform in your organization? What challenges did you face and how did you overcome them? Share your experiences in the comments below.