Blog
September 20, 2022 Marie H.

Terraform for Azure: Patterns That Work at Scale

Terraform for Azure: Patterns That Work at Scale

Terraform for Azure: Patterns That Work at Scale

Azure is my least favorite cloud to write Terraform for. That's not a complaint about the platform itself — it's functional, has solid enterprise features, and the networking model is reasonable. But the azurerm provider has historically lagged behind Azure's actual feature releases by months, the resource model has quirks that bite you the first time (and occasionally the second and third), and the access control model has evolved in ways that left inconsistent patterns across resources. After spending most of this year managing multi-subscription Azure infrastructure with Terraform, here's what I've actually learned.

Provider Authentication

The azurerm provider needs four values: client_id, client_secret, tenant_id, and subscription_id. In development you can use az login and the provider will pick up your local credentials automatically. In CI you need explicit values.

The wrong way is to put a long-lived service principal secret in your CI environment variables. These secrets don't expire unless you rotate them, they can be exfiltrated from CI logs if you're not careful, and they're not scoped to a single pipeline run. The right way is federated credentials.

For GitHub Actions:

provider "azurerm" {
  features {}

  client_id       = var.azure_client_id
  tenant_id       = var.azure_tenant_id
  subscription_id = var.azure_subscription_id
  use_oidc        = true
}

With use_oidc = true, the provider exchanges a GitHub-issued OIDC token for short-lived Azure credentials. No client_secret needed. The service principal needs a federated identity credential configured in Azure AD that trusts your GitHub repository and branch. In Azure DevOps, the same pattern works via service connections with workload identity federation.

For the backend auth, you can either use the same service principal or a separate one with narrower permissions — just Storage Blob Data Contributor on the state storage account rather than broad subscription access.

The Azure Resource Model

In AWS, resources mostly exist independently at the account level. In Azure, almost everything lives inside a Resource Group. Resource Groups are mandatory, and they matter for Terraform structure.

My convention: one Resource Group per logical stack, with all related resources in the same group. A networking stack gets a networking-rg, an application stack gets myapp-prod-rg. This aligns with how Azure handles resource lifecycle — deleting a Resource Group deletes everything in it, which is either very useful or very dangerous depending on whether you did it intentionally.

resource "azurerm_resource_group" "app" {
  name     = "myapp-prod-rg"
  location = var.location
  tags     = local.common_tags
}

Naming conventions matter in Azure more than in AWS. Storage account names must be globally unique, 3-24 characters, lowercase alphanumeric only — no hyphens. Key Vault names must be globally unique within Azure. Container Registry names must be globally unique. Build your naming convention accounting for this before you start creating resources; retrofitting it later requires terraform import or terraform state mv, neither of which is fun.

Core Networking Resources

A standard networking setup looks like this:

resource "azurerm_virtual_network" "main" {
  name                = "${local.prefix}-vnet"
  resource_group_name = azurerm_resource_group.app.name
  location            = azurerm_resource_group.app.location
  address_space       = ["10.0.0.0/16"]
}

resource "azurerm_subnet" "app" {
  name                 = "app-subnet"
  resource_group_name  = azurerm_resource_group.app.name
  virtual_network_name = azurerm_virtual_network.main.name
  address_prefixes     = ["10.0.1.0/24"]
}

resource "azurerm_network_security_group" "app" {
  name                = "${local.prefix}-app-nsg"
  resource_group_name = azurerm_resource_group.app.name
  location            = azurerm_resource_group.app.location

  security_rule {
    name                       = "allow-https"
    priority                   = 100
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "443"
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }
}

resource "azurerm_subnet_network_security_group_association" "app" {
  subnet_id                 = azurerm_subnet.app.id
  network_security_group_id = azurerm_network_security_group.app.id
}

The NSG-to-subnet association is a separate resource, which surprises people coming from AWS where you attach a security group directly to a resource. In Azure, the NSG attaches to the subnet or the NIC, not to the VM resource itself.

Azure Key Vault and the Access Model

Key Vault has two access models: the legacy vault access policy model and the newer RBAC model. Use RBAC.

resource "azurerm_key_vault" "main" {
  name                       = "${local.prefix}-kv"
  resource_group_name        = azurerm_resource_group.app.name
  location                   = azurerm_resource_group.app.location
  tenant_id                  = data.azurerm_client_config.current.tenant_id
  sku_name                   = "standard"
  enable_rbac_authorization  = true
  purge_protection_enabled   = true
  soft_delete_retention_days = 90
}

resource "azurerm_role_assignment" "kv_secrets_user" {
  scope                = azurerm_key_vault.main.id
  role_definition_name = "Key Vault Secrets User"
  principal_id         = azurerm_linux_virtual_machine.app.identity[0].principal_id
}

With enable_rbac_authorization = true, access is granted via Azure RBAC role assignments rather than vault access policies. Role assignments integrate with your existing IAM model and are managed the same way as any other Azure RBAC assignment — you don't need a separate code path for Key Vault permissions. The access policy model predates RBAC and is effectively the legacy path now.

To write secrets during infrastructure provisioning:

resource "azurerm_key_vault_secret" "db_password" {
  name         = "db-password"
  value        = random_password.db.result
  key_vault_id = azurerm_key_vault.main.id
}

The Terraform service principal needs Key Vault Secrets Officer on the vault to create secrets. The application's managed identity gets Key Vault Secrets User — read-only. Principle of least privilege applies here as much as anywhere.

AKS and ACR: Container Infrastructure

The common pattern for containerized workloads on Azure:

resource "azurerm_container_registry" "main" {
  name                = "${local.prefix_nodash}acr"
  resource_group_name = azurerm_resource_group.app.name
  location            = azurerm_resource_group.app.location
  sku                 = "Standard"
  admin_enabled       = false
}

resource "azurerm_kubernetes_cluster" "main" {
  name                = "${local.prefix}-aks"
  resource_group_name = azurerm_resource_group.app.name
  location            = azurerm_resource_group.app.location
  dns_prefix          = local.prefix

  default_node_pool {
    name       = "default"
    node_count = 3
    vm_size    = "Standard_D2s_v3"
  }

  identity {
    type = "SystemAssigned"
  }
}

resource "azurerm_role_assignment" "aks_acr_pull" {
  principal_id                     = azurerm_kubernetes_cluster.main.kubelet_identity[0].object_id
  role_definition_name             = "AcrPull"
  scope                            = azurerm_container_registry.main.id
  skip_service_principal_aad_check = true
}

The azurerm_role_assignment for AcrPull uses the cluster's kubelet_identity — the managed identity that the kubelet uses when pulling images. This is different from the cluster's own managed identity (identity[0].principal_id). The kubelet identity is what actually authenticates to ACR when pulling images for pods. Getting the wrong identity here means pods fail to start with image pull errors and the error message doesn't make this distinction obvious.

skip_service_principal_aad_check = true avoids the check that normally ensures the principal exists in AAD before creating the assignment. It's needed here because there's a propagation delay for newly created managed identities.

State in Azure Blob Storage

terraform {
  backend "azurerm" {
    resource_group_name  = "terraform-state-rg"
    storage_account_name = "mytfstate"
    container_name       = "tfstate"
    key                  = "prod/myapp/terraform.tfstate"
  }
}

The storage account, resource group, and container must exist before terraform init can run. This is a bootstrapping problem — you can't use Terraform to create the state storage until you have somewhere to store the state. The solution is a small, separate bootstrap script or a manual one-time setup. I use a short az cli script committed to the repo:

#!/bin/bash
az group create --name terraform-state-rg --location eastus
az storage account create \
  --name mytfstate \
  --resource-group terraform-state-rg \
  --sku Standard_LRS \
  --encryption-services blob
az storage container create \
  --name tfstate \
  --account-name mytfstate

Run this once, manually, then never again. The state storage is intentionally outside the Terraform lifecycle.

The azapi Provider

Azure regularly releases features that the azurerm provider doesn't support yet. The azapi provider lets you use any Azure REST API directly:

resource "azapi_resource" "aks_maintenance_config" {
  type      = "Microsoft.ContainerService/managedClusters/maintenanceConfigurations@2022-07-01"
  name      = "default"
  parent_id = azurerm_kubernetes_cluster.main.id

  body = jsonencode({
    properties = {
      maintenanceWindow = {
        schedule = {
          weekly = {
            intervalWeeks = 1
            dayOfWeek     = "Sunday"
          }
        }
        startTime  = "00:00"
        durationHours = 4
      }
    }
  })
}

This is verbose and you lose the ergonomics of the azurerm provider — no computed attributes, no plan-time validation, just raw JSON in and raw JSON out. I use azapi only when azurerm genuinely doesn't have the resource or property yet. Once azurerm adds support, I migrate away from the azapi resource. The azapi_update_resource type is particularly useful for adding properties to existing azurerm-managed resources without replacing the resource — you can add a new API property to a resource that azurerm doesn't expose yet without having to fork the whole resource definition.

Import and Drift

Bringing existing Azure resources under Terraform management is a fact of life when you take over an environment. The classical way:

terraform import azurerm_resource_group.app /subscriptions/SUBSCRIPTION_ID/resourceGroups/myapp-prod-rg

The newer import block syntax (Terraform 1.5+) is cleaner for documenting what you imported and when:

import {
  to = azurerm_resource_group.app
  id = "/subscriptions/SUBSCRIPTION_ID/resourceGroups/myapp-prod-rg"
}

Run terraform plan after importing to see the diff between the imported state and your written configuration. There will almost always be a diff — properties you didn't specify in your config that have non-default values in the real resource. Work through the diff, add the missing properties to your config, and re-run plan until it's clean. Only then is the resource actually under Terraform management in a meaningful sense. An imported resource with a dirty plan is a resource that will be modified on the next apply in ways you may not expect.