Dynamic Jenkins Agents on Kubernetes

The Jenkins setup I inherited at IBM was the kind that accumulates over years: a set of EC2 instances registered as static agents, each one a special snowflake with its own installed tools, its own maintenance history, and its own way of breaking. Agents sat idle most of the time, burned money, and fell out of sync with each other. When a build failed because "the agent doesn't have Go 1.13 installed," you'd SSH into the instance and install it, and maybe do the same on the other three agents, and then forget to document it. Months later someone would add a fifth agent and the problem would repeat.

The Kubernetes plugin solved this by making build agents ephemeral. A build starts, a pod spins up, the build runs, the pod terminates. The agent is defined as code in the Jenkinsfile. Every build gets a clean environment. No drift, no snowflakes.

The Problem with Static Agents

Static agents have a few failure modes that are all predictable but annoying:

Tool sprawl. Every team needs something slightly different installed. Over time the agent accumulates every tool any build has ever needed, pinned to whatever version was current when it was installed.

Executor waste. Static agents have a fixed executor count. Idle executors are paid-for capacity doing nothing. Busy executors queue your build behind someone else's.

State pollution. A build that leaves artifacts, modified files, or running processes on the agent affects subsequent builds on the same agent. Intermittent failures that are nearly impossible to reproduce.

How the Kubernetes Plugin Works

The Jenkins master runs somewhere persistent — in our case, an EKS pod with a persistent volume for the Jenkins home directory. When a build starts, the Kubernetes plugin creates a pod in the configured namespace. That pod contains at minimum a JNLP container (more on that shortly) and whatever build containers you've defined. The JNLP container connects back to the Jenkins master, registers as an agent, and the build runs. When the build finishes, the pod is deleted.

The pod definition lives in the Jenkinsfile. The agent is code, reviewed alongside the build code, versioned with the project.

Configuring the Cloud in Jenkins

Before the Jenkinsfile side works, you configure the Kubernetes cloud in Jenkins: Manage Jenkins → Manage Nodes and Clouds → Configure Clouds → Add a new cloud → Kubernetes.

Key fields:
- Kubernetes URL: the API server endpoint. If Jenkins is running inside the cluster, leave this blank and it uses the in-cluster config.
- Kubernetes Namespace: the namespace where build pods will be created.
- Jenkins URL: how the JNLP agent in the pod will reach the Jenkins master. Must be reachable from inside the cluster — use the Kubernetes service DNS name, not localhost.
- Credentials: a kubeconfig credential or a service account token.

The Jenkinsfile Pod Template

pipeline {
    agent {
        kubernetes {
            yaml """
apiVersion: v1
kind: Pod
metadata:
  labels:
    build: key-manager
spec:
  serviceAccountName: jenkins-build
  containers:
    - name: jnlp
      image: jenkins/inbound-agent:4.3-4
      resources:
        requests:
          cpu: 100m
          memory: 256Mi
    - name: golang
      image: golang:1.13-alpine
      command: ['cat']
      tty: true
      resources:
        requests:
          cpu: 500m
          memory: 512Mi
        limits:
          cpu: 2
          memory: 2Gi
      volumeMounts:
        - name: go-module-cache
          mountPath: /go/pkg/mod
    - name: docker
      image: docker:19.03-dind
      securityContext:
        privileged: true
      volumeMounts:
        - name: docker-sock
          mountPath: /var/run/docker.sock
  volumes:
    - name: go-module-cache
      persistentVolumeClaim:
        claimName: go-module-cache-pvc
    - name: docker-sock
      hostPath:
        path: /var/run/docker.sock
"""
        }
    }
    stages {
        stage('Test') {
            steps {
                container('golang') {
                    sh 'go test ./...'
                }
            }
        }
        stage('Build Image') {
            steps {
                container('docker') {
                    sh 'docker build -t key-manager:${BUILD_NUMBER} .'
                }
            }
        }
    }
}

The container('golang') step directs subsequent sh commands to run in the golang container. Without it, commands run in the jnlp container, which has a minimal environment.

The JNLP Container: Do Not Mess With It

This is the most common mistake I see with the Kubernetes plugin. The jnlp container is how the pod connects back to the Jenkins master. It's not optional. If you forget to include it in your pod template, the plugin adds it automatically — but only if you leave the container name jnlp alone.

Problems arise when:

You define a container named jnlp with an image that isn't the Jenkins inbound agent. The plugin uses the first container named jnlp as the agent connector. If that container is your application, it won't speak the JNLP protocol and the agent will never connect.
You set a custom command or args on the jnlp container that overrides the entrypoint. The inbound agent image has a specific entrypoint that handles the connection to the master. Overriding it breaks the connection.
You forget that the jnlp container needs network access to the Jenkins master on the JNLP port (default 50000). If your network policy doesn't allow this, the agent will spin up and then immediately fail to connect.

When an agent fails to connect, the symptom is the pod starts, appears healthy in Kubernetes, but Jenkins never picks it up as an available executor. The build hangs in the queue with "waiting for agent."

Debugging Failed Agents

When an agent pod isn't connecting:

# Watch pods as they come and go
kubectl get pods -n jenkins --watch

# Once you see the stuck pod, look at the JNLP container logs
kubectl logs -n jenkins <pod-name> -c jnlp

# Common error messages:
# "Failed to connect to Jenkins master" — network policy or wrong Jenkins URL
# "JNLP secret mismatch" — the agent was re-registered, old secret is stale
# Container in CrashLoopBackOff — usually the jnlp container command was overridden

The Jenkins master logs also show the agent connection attempts. Manage Jenkins → System Log → All Jenkins Logs, filter for the pod name.

RBAC for the Build Service Account

The pod runs with a service account that needs permission to manage pods in the build namespace. Minimal required permissions:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: jenkins-build
  namespace: jenkins
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["create", "delete", "get", "list", "watch"]
  - apiGroups: [""]
    resources: ["pods/log"]
    verbs: ["get", "list"]
  - apiGroups: [""]
    resources: ["pods/exec"]
    verbs: ["create"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: jenkins-build
  namespace: jenkins
subjects:
  - kind: ServiceAccount
    name: jenkins-build
    namespace: jenkins
roleRef:
  kind: Role
  name: jenkins-build
  apiGroup: rbac.authorization.k8s.io

The jenkins-build service account is what you specify in serviceAccountName in the pod template.

Caching with Persistent Volumes

Without caching, every build re-downloads Go modules, npm packages, or Maven dependencies from the internet. On a Go project with 40 dependencies, that's 30 extra seconds per build, and it means your builds fail if the module proxy is having a bad day.

The solution is a PVC mounted into the build container at the cache directory. For Go modules:

volumes:
  - name: go-module-cache
    persistentVolumeClaim:
      claimName: go-module-cache-pvc

# Create the PVC once
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: go-module-cache-pvc
  namespace: jenkins
spec:
  accessModes: [ReadWriteMany]
  storageClassName: efs-sc  # or any RWX-capable storage class
  resources:
    requests:
      storage: 10Gi
EOF

ReadWriteMany is required because multiple build pods may be running simultaneously and all need to read from (and potentially write to) the cache. ReadWriteOnce will cause pods to fail to mount on nodes other than the one that holds the volume.

What This Replaced

After we rolled this out, the static agent fleet went away. Build environments were defined in Jenkinsfiles alongside the code they built. New tool versions were a one-line image change in the Jenkinsfile, reviewable in a pull request. Build isolation meant the flaky "works on my agent" failures went away. The operational burden of maintaining agent VMs disappeared.

The main investment is upfront: configuring the Kubernetes cloud, getting the RBAC right, setting up the PVC storage class. Once it's running, it's largely self-maintaining in a way that a fleet of EC2 instances never was.