GCP IAM Hardening: What We Actually Did

IAM hardening is one of those tasks that organizations know they should do and consistently deprioritize until something forces the issue. For us, the forcing function was the security hardening initiative that followed a broader audit — I led the GCP side of it. This post covers the concrete steps we took, the commands that matter, and the Terraform patterns that make it sustainable.

Starting With the IAM Audit

The first thing you need is a complete picture of what exists. Export the full IAM policy for each project:

gcloud projects get-iam-policy PROJECT_ID --format=json > iam-policy-PROJECT_ID.json

If you have multiple projects (and you do), script this across all projects in the organization:

gcloud projects list --format="value(projectId)" | while read PROJECT_ID; do
  gcloud projects get-iam-policy "$PROJECT_ID" --format=json > "iam-audit/${PROJECT_ID}.json"
done

Then read through the output. What you're looking for:

Service accounts with roles/owner or roles/editor. These are broad project-level roles that grant near-unlimited access. Service accounts almost never need these — they need specific permissions for specific resources. Every service account with roles/owner is a potential lateral movement vector if the key is compromised.

User accounts with project-level roles that should be resource-scoped. A user with roles/storage.objectAdmin at the project level can read and write every GCS bucket in the project. If they need access to one bucket for a specific job function, that's where the binding should be.

Unused service accounts. Service accounts that were created for a project or deployment that no longer exists, still sitting in the IAM policy. Each one is an attack surface — a compromised key for an account nobody is monitoring gives an attacker persistent access.

In our audit, we found 11 service accounts with roles/owner or roles/editor, 3 of which appeared to be unused. We found users with project-level storage access who needed bucket-level access. Typical findings — not extreme, but not clean either.

Least Privilege in Practice: Project-Level to Resource-Level

The pattern for moving from project-level to resource-level bindings is straightforward in concept, tedious in execution because you have to find out what resources each principal actually needs access to.

The Terraform change looks like this:

# Before: project-level binding
resource "google_project_iam_member" "storage_admin" {
  project = var.project_id
  role    = "roles/storage.objectAdmin"
  member  = "serviceAccount:my-service@PROJECT_ID.iam.gserviceaccount.com"
}

# After: resource-level binding on specific bucket
resource "google_storage_bucket_iam_member" "storage_admin" {
  bucket = google_storage_bucket.my_bucket.name
  role   = "roles/storage.objectAdmin"
  member = "serviceAccount:my-service@PROJECT_ID.iam.gserviceaccount.com"
}

If the service account needs access to multiple specific buckets, you add multiple google_storage_bucket_iam_member resources. That's fine. It's more explicit, and explicit is what you want — you can see exactly what has access to what.

The operational challenge: some of these bindings were added manually in the console months or years ago, so there's no Terraform state for them. You're working from the IAM audit export to reconstruct what should be scoped to what, while trying to avoid breaking running services. The approach that worked: add the new resource-level binding first, verify the service is still functioning, then remove the project-level binding in a separate commit. Two steps, not one.

Workload Identity Migration

Key files for GKE service accounts are the old pattern. Workload Identity is the right approach: GKE pods assume GCP service account identities through a Kubernetes service account binding, no key file involved. No key means no rotation burden, no risk of a key file leaking through a misconfigured ConfigMap or getting committed to git.

Audit what's still using key files:

# Find Kubernetes secrets that contain service account keys
kubectl get secrets --all-namespaces -o json | \
  jq '.items[] | select(.data["service-account.json"] != null) |
      {namespace: .metadata.namespace, name: .metadata.name}'

# Find references to serviceAccountKeyFile in running pod specs
kubectl get pods --all-namespaces -o json | \
  jq '.items[] | select(.spec.volumes[]?.secret != null) |
      {namespace: .metadata.namespace, name: .metadata.name}'

For each one you find, the migration to Workload Identity follows a standard pattern. On the GCP side:

# Allow the Kubernetes service account to impersonate the GCP service account
gcloud iam service-accounts add-iam-policy-binding \
  GCP_SA_EMAIL \
  --role="roles/iam.workloadIdentityUser" \
  --member="serviceAccount:PROJECT_ID.svc.id.goog[NAMESPACE/KSA_NAME]"

On the Kubernetes side, annotate the service account:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-ksa
  namespace: my-namespace
  annotations:
    iam.gke.io/gcp-service-account: GCP_SA_EMAIL

After migration, delete the key file secret and the pod spec reference to it. Verify with kubectl exec into the pod that the GCP credentials are available via the metadata server, not a mounted file.

Service Account Key Hygiene

Even after Workload Identity migration, some service accounts still have keys — for external systems that can't use Workload Identity, for local development, for CI/CD in some configurations. Those keys need active management.

Audit key ages across all service accounts:

gcloud iam service-accounts list --format="value(email)" --project=PROJECT_ID | \
  while read SA_EMAIL; do
    gcloud iam service-accounts keys list \
      --iam-account="$SA_EMAIL" \
      --format="table[box](name.basename(), validAfterTime, validBeforeTime)" \
      2>/dev/null
  done

Keys older than 90 days get rotated or deleted. The "deleted" option applies if the service account isn't actively being used — an old key for an unused service account is pure risk.

We added this check to our security pipeline as a weekly scan. If a key older than 90 days is found, a ticket is automatically created and assigned to the service account owner. Automate the audit; don't rely on someone remembering to run the command.

GCS Bucket Policy Audit

Public GCS buckets are a recurring source of data exposure incidents. The audit is simple:

# Check IAM policy for a specific bucket
gsutil iam get gs://BUCKET_NAME

# Find all buckets with allUsers or allAuthenticatedUsers bindings
for BUCKET in $(gsutil ls); do
  POLICY=$(gsutil iam get "$BUCKET" 2>/dev/null)
  if echo "$POLICY" | grep -q "allUsers\|allAuthenticatedUsers"; then
    echo "PUBLIC BUCKET: $BUCKET"
    echo "$POLICY"
  fi
done

allUsers is fully public — no authentication required, readable by anyone on the internet. allAuthenticatedUsers means any authenticated Google account — also effectively public, since anyone can create a Google account. Neither is acceptable unless you're intentionally hosting public static assets, in which case it should be documented and scoped to roles/storage.objectViewer only.

When you find unintentional public access:

# Remove public access
gsutil iam ch -d allUsers:objectViewer gs://BUCKET_NAME
gsutil iam ch -d allAuthenticatedUsers:objectViewer gs://BUCKET_NAME

# Or enforce at the project level to prevent future public buckets
gcloud resource-manager org-policies set-policy \
  --project=PROJECT_ID constraints/storage.publicAccessPrevention.yaml

That last command, the org policy for storage.publicAccessPrevention, is the control that actually prevents this class of problem. Enable it at the project or organization level. New buckets can't be made public even if someone tries.

VPC Service Controls

For sensitive data, VPC Service Controls add a perimeter that restricts which identities and network paths can access GCP APIs — even if those identities have IAM access. The scenario VPC-SC addresses is insider exfiltration or compromised-credential exfiltration: if a credential is stolen, the attacker can only use it from inside the VPC perimeter, not from an arbitrary external location.

This is not a first-day control — it requires careful configuration and testing because it can break legitimate access patterns if not scoped correctly. But for production buckets containing PII, financial data, or secrets, it's worth the operational overhead.

Terraform State and Drift Detection

All IAM changes go through Terraform. This is the rule with no exceptions. When IAM is managed in Terraform, the state is authoritative — if something is in the IAM policy but not in Terraform, that's drift, and drift gets reviewed.

The enforcement mechanism is terraform plan in CI. Every change to IAM-related Terraform files runs a plan in the pull request, and the plan output is posted as a PR comment. Reviewers can see exactly what IAM bindings are being added or removed.

For drift detection, we run terraform plan as a scheduled job nightly against production. Any drift — resources that exist in GCP but not in Terraform, or resources in Terraform that no longer exist in GCP — generates an alert. Most drift is innocuous (a service auto-created a service account that needs to be imported), but some drift is a signal that someone made a manual change outside the review process.

Manual IAM changes in the console happen under pressure — someone needs access now, the change-management process feels slow, they fix it directly. We don't prohibit this in an emergency, but we require that the Terraform state be updated within 24 hours and that the change be reviewed after the fact. That catches both the cases where the emergency justifies the exception and the cases where "emergency" was used to skip a step that wasn't actually urgent.

The IAM hardening work took about three months from initial audit to clean Terraform state. The ongoing maintenance is lighter — the controls are in place, the weekly key age scan runs, and the org policies prevent the most common misconfiguration patterns. The three months were worth it.