GKE Multi-Cluster Operations: Lessons from Production
Managing a single GKE cluster is straightforward. Managing four or more — production, staging, sandbox, and a dedicated build cluster — introduces coordination problems that individual cluster management doesn't prepare you for. Here's what I've learned.
Fleet Registration
A Google Cloud Fleet is the unit of multi-cluster management. Registering clusters with a fleet enables Config Sync, Policy Controller, and Multi-cluster Ingress. It's the prerequisite for everything else.
# Register an existing GKE cluster with the fleet
gcloud container fleet memberships register prod-us-central1 \
--gke-cluster us-central1/prod-us-central1 \
--enable-workload-identity \
--project my-project
# View fleet membership
gcloud container fleet memberships list --project my-project
Fleet membership also registers the cluster in the Connect gateway, which lets you use kubectl through the Connect proxy without direct API server access — useful for clusters in private networks:
gcloud container fleet memberships get-credentials prod-us-central1 \
--project my-project
Fleet management is at the GCP project level. All clusters in the fleet share the same project. If you have multi-project setups (common in large organizations), fleet management gets more complex and you'll want the workload identity federation approach.
Config Sync
Config Sync (formerly Anthos Config Management) syncs Kubernetes configuration from a git repository to all fleet clusters. Think of it as Flux/ArgoCD but managed by Google and integrated with the fleet layer.
Enable it via the fleet feature:
gcloud container fleet config-management enable --project my-project
Then configure it per-cluster with a ConfigManagement CRD:
apiVersion: configmanagement.gke.io/v1
kind: ConfigManagement
metadata:
name: config-management
spec:
clusterName: prod-us-central1
gitSync:
syncRepo: https://github.com/myorg/cluster-config
syncBranch: main
secretType: token
policyDir: clusters/prod-us-central1
I structure the repo so each cluster has its own directory for cluster-specific config, plus a common/ directory for things every cluster gets:
cluster-config/
common/
namespaces/
rbac/
network-policies/
clusters/
prod-us-central1/
staging-us-central1/
sandbox-us-east1/
build-us-central1/
Config Sync handles the namespace definitions, RBAC, and network policies that should be consistent across clusters. It doesn't replace ArgoCD for application deployments — Config Sync manages cluster-level config, ArgoCD manages application workloads.
The biggest benefit: new clusters start with a known-good configuration state automatically. Register a new cluster, point it at your git repo, and within minutes it has the correct RBAC, namespaces, and network policies without manual intervention.
Policy Controller
Policy Controller is managed OPA Gatekeeper. It enforces policies across all fleet clusters, blocking resource creation that violates your rules.
Enable via fleet:
gcloud container fleet policycontroller enable --project my-project
Policies come from constraint templates and constraints. Some examples we run:
Require resource limits on all containers:
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredResources
metadata:
name: require-resource-limits
spec:
match:
kinds:
- apiGroups: [""]
kinds: ["Pod"]
excludedNamespaces: ["kube-system", "velero"]
parameters:
limits: ["cpu", "memory"]
requests: ["cpu", "memory"]
Block privileged pods:
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sPSPPrivilegedContainer
metadata:
name: no-privileged-containers
spec:
match:
kinds:
- apiGroups: [""]
kinds: ["Pod"]
excludedNamespaces: ["kube-system"]
Enforce image registry allowlisting (only pull from our internal registry or trusted public registries):
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sAllowedRepos
metadata:
name: allowed-image-registries
spec:
match:
kinds:
- apiGroups: [""]
kinds: ["Pod"]
parameters:
repos:
- "us-central1-docker.pkg.dev/my-project/"
- "gcr.io/distroless/"
- "registry.k8s.io/"
Policy Controller runs in dryrun mode before you enforce. Use dryrun to audit violations across your fleet before turning on enforcement — you'll almost always find violations in kube-system or monitoring namespaces that need exceptions.
Multi-Cluster Ingress and Gateway API
Multi-cluster Ingress (now superseded by Gateway API) provides a single anycast IP that routes to backends across multiple clusters. Google's load balancer handles geographic routing and failover.
Enable the fleet feature:
gcloud container fleet ingress enable \
--config-membership=prod-us-central1 \
--project my-project
A MultiClusterIngress (or with the newer Gateway API, a Gateway with a gke-l7-global-external-managed-mc class) and MultiClusterService in your config cluster creates the global load balancer and registers backends across clusters. Traffic hits the anycast IP, Google routes it to the nearest healthy backend.
The practical benefit: blue/green across regions, gradual traffic migration, and automatic failover if a cluster becomes unhealthy — all without DNS changes or client-side load balancing logic.
Cluster Upgrades
GKE managed upgrades with release channels are the right approach for most teams. You pick a channel:
- Rapid: 6-8 weeks behind upstream Kubernetes. Get features faster, but this is where Google finds bugs.
- Regular: 2-4 months behind upstream. The right balance for production.
- Stable: 5-6 months behind. Suitable if you have compliance requirements or need maximum predictability.
Set a maintenance window to avoid upgrades during business hours:
# Terraform
resource "google_container_cluster" "prod" {
maintenance_policy {
recurring_window {
start_time = "2022-01-01T02:00:00Z"
end_time = "2022-01-01T06:00:00Z"
recurrence = "FREQ=WEEKLY;BYDAY=SA,SU"
}
}
}
GKE upgrades control plane first, then nodes. Nodes upgrade one at a time (or in configurable surge batches). The upgrade respects PodDisruptionBudgets — if your PDB prevents eviction, the upgrade will stall. Check your PDBs before upgrades.
If something goes wrong post-upgrade, GKE allows rollback of node pools to the previous version within a window. I've used this once. The option is: gcloud container node-pools rollback POOL_NAME --cluster CLUSTER_NAME --zone ZONE. It doesn't roll back the control plane, only node pools.
Upgrade staging and sandbox first. Always. Even if you're on Regular channel, things break in your specific configuration.
Monitoring Across Clusters
Cloud Monitoring with Managed Prometheus is my current setup. Every GKE cluster has cluster and location as automatically-attached labels. A single Prometheus query or Monitoring dashboard can aggregate metrics across all clusters.
Enable Managed Prometheus (GKE 1.23+):
gcloud container clusters update my-cluster \
--enable-managed-prometheus \
--zone us-central1-a
Then a PodMonitoring resource targets your application:
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
name: api-server
spec:
selector:
matchLabels:
app: api-server
endpoints:
- port: metrics
interval: 30s
For logs, Cloud Logging auto-collects from all GKE clusters. Filter by cluster:
resource.type="k8s_container"
resource.labels.cluster_name="prod-us-central1"
severity>=ERROR
A single log explorer query can span clusters by omitting the cluster filter. Combine this with log-based alerting and you have cluster-agnostic alerting on error rates.
Console vs kubectl
For one-off operations and exploration, the Cloud Console is fine. The GKE cluster list, workload viewer, and fleet management UI are genuinely useful. For anything automated — upgrade pipelines, config deployment, fleet-wide operations — use gcloud and kubectl with explicit kubeconfig contexts.
Maintain a kubeconfig that has all your cluster contexts:
gcloud container clusters get-credentials prod-us-central1 \
--zone us-central1-a --project my-project
# Results in context: gke_my-project_us-central1-a_prod-us-central1
kubectl config use-context gke_my-project_us-central1-a_prod-us-central1
For fleet-wide kubectl operations, I write shell scripts that iterate over a list of known contexts. Tooling like kubectx helps with context switching. For larger fleets, look at kubectl plugins like kube-multi-exec for running the same command across multiple clusters at once.
The fleet layer reduces the operational burden significantly compared to managing clusters independently, but it doesn't eliminate per-cluster knowledge. You still need to understand what's running in each cluster and why upgrades or policy changes might affect it differently. Fleet management scales the operational model; it doesn't abstract away the Kubernetes fundamentals.