Building Scalable Kubernetes Infrastructure with Terraform
KubernetesTerraformDevOps

Building Scalable Kubernetes Infrastructure with Terraform

šŸ“…
ā±ļø8 min read
šŸ‘¤Jananath Banuka

Creating production-ready Kubernetes infrastructure requires careful planning, proper tooling, and adherence to best practices. In this comprehensive guide, we'll explore how to leverage Terraform to build scalable and secure Kubernetes clusters that can handle enterprise workloads.

Why Infrastructure as Code for Kubernetes?

Infrastructure as Code (IaC) transforms how we manage and deploy infrastructure. When applied to Kubernetes, it brings several critical advantages:

  • Reproducibility: Deploy identical environments across dev, staging, and production
  • Version Control: Track infrastructure changes alongside application code
  • Automation: Reduce manual errors and deployment time
  • Collaboration: Enable team members to review and contribute to infrastructure changes

Prerequisites

Before we begin, ensure you have:

  • Terraform >= 1.0 installed
  • kubectl configured
  • Cloud provider CLI (AWS CLI, gcloud, or Azure CLI)
  • Basic understanding of Kubernetes concepts

Setting Up the Project Structure

kubernetes-infrastructure/
ā”œā”€ā”€ modules/
│   ā”œā”€ā”€ vpc/
│   ā”œā”€ā”€ eks/
│   └── monitoring/
ā”œā”€ā”€ environments/
│   ā”œā”€ā”€ dev/
│   ā”œā”€ā”€ staging/
│   └── production/
ā”œā”€ā”€ main.tf
ā”œā”€ā”€ variables.tf
ā”œā”€ā”€ outputs.tf
└── terraform.tfvars

This modular approach promotes reusability and maintainability across different environments.

Creating the VPC Module

First, let's create a robust networking foundation:

# modules/vpc/main.tf
resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name = "${var.cluster_name}-vpc"
    "kubernetes.io/cluster/${var.cluster_name}" = "shared"
  }
}

resource "aws_subnet" "private" {
  count = length(var.private_subnet_cidrs)

  vpc_id            = aws_vpc.main.id
  cidr_block        = var.private_subnet_cidrs[count.index]
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = {
    Name = "${var.cluster_name}-private-${count.index + 1}"
    "kubernetes.io/cluster/${var.cluster_name}" = "owned"
    "kubernetes.io/role/internal-elb" = "1"
  }
}

EKS Cluster Configuration

Now, let's create a production-ready EKS cluster:

# modules/eks/main.tf
resource "aws_eks_cluster" "main" {
  name     = var.cluster_name
  role_arn = aws_iam_role.cluster.arn
  version  = var.kubernetes_version

  vpc_config {
    subnet_ids              = var.subnet_ids
    endpoint_private_access = true
    endpoint_public_access  = true
    public_access_cidrs     = var.public_access_cidrs
  }

  encryption_config {
    provider {
      key_arn = aws_kms_key.eks.arn
    }
    resources = ["secrets"]
  }

  enabled_cluster_log_types = [
    "api",
    "audit",
    "authenticator",
    "controllerManager",
    "scheduler"
  ]

  depends_on = [
    aws_iam_role_policy_attachment.cluster_policy,
    aws_iam_role_policy_attachment.vpc_resource_controller,
  ]
}

Node Group Configuration

Configure managed node groups with auto-scaling:

resource "aws_eks_node_group" "main" {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "${var.cluster_name}-nodes"
  node_role_arn   = aws_iam_role.node_group.arn
  subnet_ids      = var.private_subnet_ids

  instance_types = var.instance_types
  ami_type       = "AL2_x86_64"
  capacity_type  = "ON_DEMAND"

  scaling_config {
    desired_size = var.desired_capacity
    max_size     = var.max_capacity
    min_size     = var.min_capacity
  }

  update_config {
    max_unavailable = 1
  }

  # Ensure node group is created after all IAM policies
  depends_on = [
    aws_iam_role_policy_attachment.node_group_policy,
    aws_iam_role_policy_attachment.cni_policy,
    aws_iam_role_policy_attachment.registry_policy,
  ]
}

Security Best Practices

RBAC Configuration

Implement role-based access control:

# rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: developer-role
rules:
- apiGroups: [""]
  resources: ["pods", "services", "configmaps"]
  verbs: ["get", "list", "create", "update", "delete"]
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]
  verbs: ["get", "list", "create", "update", "delete"]

Network Policies

Implement network segmentation:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all-ingress
spec:
  podSelector: {}
  policyTypes:
  - Ingress

Monitoring and Observability

Integrate monitoring from the start:

# modules/monitoring/main.tf
resource "helm_release" "prometheus" {
  name       = "prometheus"
  repository = "https://prometheus-community.github.io/helm-charts"
  chart      = "kube-prometheus-stack"
  namespace  = "monitoring"

  create_namespace = true

  values = [
    file("${path.module}/prometheus-values.yaml")
  ]
}

Auto-Scaling Configuration

Configure both horizontal and vertical scaling:

resource "helm_release" "cluster_autoscaler" {
  name       = "cluster-autoscaler"
  repository = "https://kubernetes.github.io/autoscaler"
  chart      = "cluster-autoscaler"
  namespace  = "kube-system"

  set {
    name  = "autoDiscovery.clusterName"
    value = var.cluster_name
  }

  set {
    name  = "awsRegion"
    value = var.aws_region
  }
}

Environment-Specific Configurations

Development Environment

# environments/dev/main.tf
module "kubernetes_cluster" {
  source = "../../modules/eks"

  cluster_name       = "dev-cluster"
  kubernetes_version = "1.28"
  instance_types     = ["t3.medium"]
  desired_capacity   = 2
  min_capacity       = 1
  max_capacity       = 5
}

Production Environment

# environments/production/main.tf
module "kubernetes_cluster" {
  source = "../../modules/eks"

  cluster_name       = "prod-cluster"
  kubernetes_version = "1.28"
  instance_types     = ["m5.large", "m5.xlarge"]
  desired_capacity   = 5
  min_capacity       = 3
  max_capacity       = 20
}

Deployment Pipeline

Automate deployment with CI/CD:

# .github/workflows/terraform.yml
name: Terraform

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  terraform:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    
    - name: Setup Terraform
      uses: hashicorp/setup-terraform@v2
      with:
        terraform_version: 1.6.0
    
    - name: Terraform Plan
      run: |
        terraform init
        terraform plan
    
    - name: Terraform Apply
      if: github.ref == 'refs/heads/main'
      run: terraform apply -auto-approve

Backup and Disaster Recovery

Implement backup strategies:

resource "helm_release" "velero" {
  name       = "velero"
  repository = "https://vmware-tanzu.github.io/helm-charts"
  chart      = "velero"
  namespace  = "velero"

  create_namespace = true

  set {
    name  = "configuration.provider"
    value = "aws"
  }

  set {
    name  = "configuration.backupStorageLocation.bucket"
    value = aws_s3_bucket.velero_backup.bucket
  }
}

Cost Optimization

Spot Instances

resource "aws_eks_node_group" "spot" {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "${var.cluster_name}-spot-nodes"
  node_role_arn   = aws_iam_role.node_group.arn
  subnet_ids      = var.private_subnet_ids

  capacity_type  = "SPOT"
  instance_types = ["m5.large", "m5.xlarge", "m4.large"]

  scaling_config {
    desired_size = 2
    max_size     = 10
    min_size     = 0
  }
}

Resource Quotas

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: production
spec:
  hard:
    requests.cpu: "4"
    requests.memory: 8Gi
    limits.cpu: "8"
    limits.memory: 16Gi

Troubleshooting Common Issues

Node Not Ready

Check common causes:

  1. Network connectivity
  2. IAM permissions
  3. Security group configurations
  4. Instance capacity

Pod Scheduling Issues

Debug with:

kubectl describe pod <pod-name>
kubectl get nodes -o wide
kubectl top nodes

Conclusion

Building scalable Kubernetes infrastructure with Terraform requires careful planning and adherence to best practices. By following the patterns outlined in this guide, you can create robust, secure, and maintainable infrastructure that scales with your organization's needs.

Key takeaways:

  • Use modular Terraform code for reusability
  • Implement security from day one
  • Plan for monitoring and observability
  • Automate everything through CI/CD
  • Consider cost optimization strategies

Have you implemented Kubernetes with Terraform in your organization? What challenges did you face and how did you overcome them? Share your experiences in the comments below.