Terraform State Management Across AWS, Azure, and GCP

In 2022 I was managing Terraform configurations for infrastructure on all three major cloud providers simultaneously — different clients, different stacks, but I was the common thread maintaining all of it. The thing I kept coming back to was state. Not the actual infrastructure, not the provider APIs, but how you store, protect, and isolate your Terraform state files. Get this wrong and you will eventually destroy something you didn't mean to. Get it right and it becomes invisible infrastructure that just works.

Why This Matters More With Multiple Clouds

When you're working across clouds, you're constantly context-switching. The directory structure, the variable names, the provider configs — they all look slightly different. In that environment, an accidental terraform apply in the wrong directory or against the wrong workspace can destroy production infrastructure in a cloud account you weren't even thinking about. State isolation is your safeguard. If the state for your AWS production VPC lives somewhere completely separate from your GCP staging cluster, you can't accidentally conflate them.

Beyond accidents, separate state backends also mean separate blast radii. A corrupted state file in one environment doesn't affect another. A locked state from a dead CI job in one stack doesn't block deployments to another.

Backend Configuration Per Cloud

Each cloud has a native backend. Here's what all three look like.

AWS S3:

terraform {
  backend "s3" {
    bucket         = "my-terraform-state-prod"
    key            = "networking/vpc/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
  }
}

Azure Blob Storage:

terraform {
  backend "azurerm" {
    resource_group_name  = "terraform-state-rg"
    storage_account_name = "mytfstateaccount"
    container_name       = "tfstate"
    key                  = "prod/networking/terraform.tfstate"
  }
}

GCP Cloud Storage:

terraform {
  backend "gcs" {
    bucket = "my-terraform-state-prod"
    prefix = "networking/vpc"
  }
}

Authentication to these backends in CI is handled differently than your provider authentication. For S3, the CI runner needs an IAM role with s3:GetObject, s3:PutObject, s3:DeleteObject, and dynamodb:GetItem, dynamodb:PutItem, dynamodb:DeleteItem on the lock table. For Azure, the service principal needs Storage Blob Data Contributor on the storage account. For GCS, the service account needs storage.objects.create, storage.objects.get, and storage.objects.delete on the bucket.

Workspace Strategy

Terraform workspaces give you environment separation within a single backend. terraform workspace new staging creates a separate state path in the same bucket. The state key is prefixed automatically: env:/staging/networking/vpc/terraform.tfstate for S3.

I use workspaces for simple environment separation when dev, staging, and prod are genuinely identical except for sizing and naming. For a single-region application with three environments that share the same topology, workspaces are clean and low-overhead.

I stop using workspaces when environments start diverging structurally. If prod has multi-region failover and dev is single-region, or if prod has a WAF and dev doesn't, workspace selection becomes a footgun — you might apply a plan generated in the prod workspace to infra that doesn't exist in staging. At that point, separate directories (or separate repos) with their own backends are the safer choice. The rule I use: workspaces for configuration differences, separate directories for topology differences.

Cross-Stack References With `terraform_remote_state`

When your network stack creates a VPC and your application stack needs that VPC ID, don't hardcode it. Use terraform_remote_state to read it directly from the networking stack's state file:

data "terraform_remote_state" "networking" {
  backend = "s3"
  config = {
    bucket = "my-terraform-state-prod"
    key    = "networking/vpc/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_instance" "app" {
  subnet_id = data.terraform_remote_state.networking.outputs.private_subnet_id
  # ...
}

The networking stack just needs to declare the outputs block. This creates an explicit, auditable dependency between stacks rather than a value that got copy-pasted into a variable file and then never updated.

The downside: it couples the stacks tightly. If the networking stack output disappears or is renamed, the application stack breaks on the next plan. For more decoupled architectures, reading the value from SSM Parameter Store (AWS), Key Vault (Azure), or Secret Manager (GCP) gives you the same cross-stack sharing without the state file dependency.

The CI/CD Pipeline Pattern

This is the pattern I use for every Terraform project, and I don't deviate from it:

terraform fmt -check — fails the pipeline if formatting is wrong, no exceptions
terraform validate — catches syntax errors and obvious config problems
terraform plan -out=plan.tfplan — generate and save the plan
Human review of the plan output — required before apply
terraform apply plan.tfplan — apply exactly the reviewed plan

The -out flag is important. If you run terraform plan and then terraform apply without the saved plan, Terraform generates a new plan at apply time. In a fast-moving environment, the plan you reviewed and the plan that actually runs might differ. Saving the plan file and applying it ensures you're applying exactly what you approved. In CI, the plan file is saved as a build artifact and the apply job downloads it rather than regenerating.

Never skip the review step. I know it slows things down. It's still the right call.

State Locking

All three backends support locking, but they implement it differently.

AWS S3 uses DynamoDB. This is the old pattern — the newer S3 native locking (available in Terraform 1.7+) stores the lock in the S3 bucket itself using conditional writes, and you can drop the DynamoDB table. If you're using the DynamoDB pattern, the table needs a LockID string partition key and that's it.

Azure Blob Storage uses blob leases built into the storage API. No additional setup needed.

GCS uses object generation conditions — a built-in GCS feature. Also no additional setup.

All three work the same way from Terraform's perspective: a terraform apply that starts while another is running will fail with a lock error. This prevents two CI jobs from applying simultaneously and producing a corrupted state.

When a CI job dies mid-apply and leaves a stale lock, you'll see an error like:

Error: Error locking state: Error acquiring the state lock
Lock Info:
  ID: 12345678-1234-1234-1234-123456789012
  ...

After confirming the lock is genuinely stale (no apply is actually running), release it:

terraform force-unlock 12345678-1234-1234-1234-123456789012

Verify the infrastructure is in a consistent state before running another apply. Don't just unlock and immediately re-apply without checking.

The `-target` Trap

terraform apply -target=aws_instance.web applies changes only to a specific resource, ignoring everything else in your configuration. It sounds useful for quick fixes. It's almost always a mistake.

The problem: after a targeted apply, your state is partially updated. Your configuration says one thing about the rest of your infrastructure, your state says another. Subsequent terraform plan runs may show unexpected diffs because Terraform is now working from a state that doesn't fully reflect the last full apply. You've introduced drift between your declared configuration and your actual state.

I use -target exactly once: emergency mitigation when something is actively on fire and I need to bring one resource back to a known state right now. Even then, I follow up with a full terraform plan and apply as soon as the incident is resolved to re-sync the state.

Provider Authentication in CI

This deserves its own post, but briefly: no long-lived credentials in CI.

For AWS, configure the Jenkins agent (or GitHub Actions runner) to assume an IAM role via OIDC, or use the EC2 instance profile if the Jenkins agent runs on EC2. No AWS_ACCESS_KEY_ID in environment variables.

For Azure, use a service principal with federated credentials tied to your CI system's OIDC provider. The service principal gets Contributor on the relevant subscriptions. Set ARM_USE_OIDC=true in the provider configuration.

For GCP, use Workload Identity Federation. The CI provider authenticates via OIDC and impersonates a GCP service account. No JSON key files.

The pattern is the same across all three clouds: the CI system proves its identity via a short-lived OIDC token, exchanges it for a cloud-provider credential scoped to exactly what it needs. Static credentials that can be exfiltrated and used outside your CI system are a category of risk you can eliminate entirely.