Writing Kubernetes Controllers in Go with controller-runtime

Custom controllers are how you extend Kubernetes to actually manage your things — not just Pods and Deployments, but your own abstractions. I've written a few now for our platform team at IBM and the pattern is consistent enough that I want to write it down clearly.

This post assumes you know what a CRD is and have at least read the Kubernetes docs on controllers. I'm going to focus on the implementation, not the theory.

The Reconciliation Loop

The core idea is simple: your controller watches a resource, and whenever something changes, it gets called with a request. Your job is to make the world match what the resource says it should look like. That function is the reconciler.

Observe. Compare. Act. Repeat.

You don't get a "what changed" diff. You get the current state and you figure out what to do. This is intentional — it means your controller can also handle external drift, not just API-triggered changes.

The key mental shift from imperative code: you're not handling events, you're enforcing desired state. If someone manually deletes a child resource your controller manages, the next reconcile will recreate it. That's correct behavior, not a bug.

controller-runtime vs client-go Directly

client-go is the low-level library. It's powerful and gives you complete control. It's also a lot of boilerplate — you're wiring up informers, listers, work queues, and event handlers yourself.

controller-runtime is the abstraction layer that kubebuilder uses. It handles the informer and queue machinery for you and exposes a clean Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) interface. For 90% of controller use cases, this is what you should be using.

The tradeoff: controller-runtime is opinionated. If you have unusual performance requirements or need fine-grained control over caching, you might need to drop down to client-go. For our encryption service controllers, controller-runtime has been more than sufficient.

kubebuilder Scaffolding

Install kubebuilder and scaffold a new project:

kubebuilder init --domain ibm.example.com --repo github.com/ibm-internal/keyservice-operator
kubebuilder create api --group crypto --version v1alpha1 --kind KeyPolicy

This generates your CRD type, controller skeleton, and the wiring to connect them. Don't fight the scaffolding — it's opinionated for good reasons and you can always add to it.

The generated api/v1alpha1/keypolicy_types.go is where you define your spec and status:

type KeyPolicySpec struct {
    Algorithm  string `json:"algorithm"`
    KeyLength  int    `json:"keyLength"`
    RotateDays int    `json:"rotateDays,omitempty"`
}

type KeyPolicyStatus struct {
    Phase          string      `json:"phase,omitempty"`
    LastRotatedAt  metav1.Time `json:"lastRotatedAt,omitempty"`
    ObservedGeneration int64   `json:"observedGeneration,omitempty"`
}

Always include ObservedGeneration in status. It lets you tell whether the status reflects the current spec or a previous one.

A Minimal Reconciler

Here's a stripped-down but complete reconciler that watches KeyPolicy objects:

package controllers

import (
    "context"
    "fmt"
    "time"

    "k8s.io/apimachinery/pkg/api/errors"
    ctrl "sigs.k8s.io/controller-runtime"
    "sigs.k8s.io/controller-runtime/pkg/client"
    "sigs.k8s.io/controller-runtime/pkg/log"

    cryptov1alpha1 "github.com/ibm-internal/keyservice-operator/api/v1alpha1"
)

type KeyPolicyReconciler struct {
    client.Client
    Scheme *runtime.Scheme
}

func (r *KeyPolicyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := log.FromContext(ctx)

    policy := &cryptov1alpha1.KeyPolicy{}
    if err := r.Get(ctx, req.NamespacedName, policy); err != nil {
        if errors.IsNotFound(err) {
            // Object was deleted before we got to it. Nothing to do.
            return ctrl.Result{}, nil
        }
        return ctrl.Result{}, fmt.Errorf("fetching KeyPolicy: %w", err)
    }

    // Check if deletion is in progress
    if !policy.DeletionTimestamp.IsZero() {
        return r.handleDeletion(ctx, policy)
    }

    // Ensure our finalizer is registered
    if err := r.ensureFinalizer(ctx, policy); err != nil {
        return ctrl.Result{}, err
    }

    // Reconcile the actual state
    if err := r.reconcileKeyMaterial(ctx, policy); err != nil {
        log.Error(err, "failed to reconcile key material")
        return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
    }

    // Update status
    policy.Status.Phase = "Active"
    policy.Status.ObservedGeneration = policy.Generation
    if err := r.Status().Update(ctx, policy); err != nil {
        return ctrl.Result{}, fmt.Errorf("updating status: %w", err)
    }

    // Schedule next rotation check
    return ctrl.Result{RequeueAfter: time.Duration(policy.Spec.RotateDays) * 24 * time.Hour}, nil
}

A few things to note here. I'm using fmt.Errorf with %w to wrap errors — this preserves the error chain for logging. I return ctrl.Result{}, nil on NotFound because the object is already gone, not because something went wrong. I distinguish between errors worth retrying immediately (return the error, controller-runtime will requeue with backoff) and errors I want to rate-limit myself (return RequeueAfter).

The Finalizer Pattern

Finalizers are how you do cleanup before an object is actually deleted. Without a finalizer, the object disappears and you never get a chance to clean up external resources — KMS keys, certificates, whatever your controller manages.

const finalizerName = "crypto.ibm.example.com/cleanup"

func (r *KeyPolicyReconciler) ensureFinalizer(ctx context.Context, policy *cryptov1alpha1.KeyPolicy) error {
    if containsString(policy.Finalizers, finalizerName) {
        return nil
    }
    policy.Finalizers = append(policy.Finalizers, finalizerName)
    return r.Update(ctx, policy)
}

func (r *KeyPolicyReconciler) handleDeletion(ctx context.Context, policy *cryptov1alpha1.KeyPolicy) (ctrl.Result, error) {
    if !containsString(policy.Finalizers, finalizerName) {
        return ctrl.Result{}, nil
    }

    // Do your external cleanup here
    if err := r.cleanupKeyMaterial(ctx, policy); err != nil {
        return ctrl.Result{}, fmt.Errorf("cleanup failed: %w", err)
    }

    // Remove our finalizer so Kubernetes can proceed with deletion
    policy.Finalizers = removeString(policy.Finalizers, finalizerName)
    return ctrl.Result{}, r.Update(ctx, policy)
}

Important: if cleanupKeyMaterial is idempotent (it should be), you can safely retry on failure. If cleanup fails and you return an error, the controller will requeue and try again. The object won't be deleted until all finalizers are removed.

Requeueing Strategy

ctrl.Result{} with no fields means "don't requeue unless the watch fires again." ctrl.Result{Requeue: true} requeues immediately. ctrl.Result{RequeueAfter: duration} is what you want for most scheduled work.

For our key rotation use case, I requeue after RotateDays * 24h. The controller wakes up, checks whether rotation is actually due, and either rotates or reschedules. This is better than a separate cron job because the controller naturally handles restarts — if it crashes and comes back up, all the requeues are rebuilt from the existing objects.

One thing to be careful about: RequeueAfter does not guarantee exact timing. If your controller is under load, it might fire late. Design for this — check actual timestamps, don't assume the requeue happened exactly when you scheduled it.

Wiring It Up

In main.go, register your controller with the manager:

if err = (&controllers.KeyPolicyReconciler{
    Client: mgr.GetClient(),
    Scheme: mgr.GetScheme(),
}).SetupWithManager(mgr); err != nil {
    setupLog.Error(err, "unable to create controller", "controller", "KeyPolicy")
    os.Exit(1)
}

And SetupWithManager in your controller:

func (r *KeyPolicyReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&cryptov1alpha1.KeyPolicy{}).
        Owns(&corev1.Secret{}).
        Complete(r)
}

Owns tells the manager that if a Secret owned by your KeyPolicy changes, trigger a reconcile on the parent KeyPolicy. This is how you handle child resource drift without writing custom watch logic.

Updated March 2026: kubebuilder v4 (released 2023) made several breaking changes: the project layout changed, controller-gen markers were updated, and Go 1.20+ is now required. The reconciler interface itself is stable, but if you're scaffolding a new project, use kubebuilder v4 and expect the generated boilerplate to look somewhat different from what's shown here. The controllerutil.AddFinalizer and controllerutil.RemoveFinalizer helpers in controller-runtime v0.15+ replace the manual string slice manipulation shown above — use those instead.

Testing with Ginkgo, Gomega, and Counterfeiter

We write controller tests in Ginkgo + Gomega — the BDD framework pair kubebuilder defaults to. The nested Describe/It structure maps cleanly to "given this resource state, when we reconcile, expect this outcome."

The fake client from sigs.k8s.io/controller-runtime/pkg/client/fake handles the Kubernetes API side without a running cluster:

var _ = Describe("KeyPolicy controller", func() {
    var (
        reconciler *KeyPolicyReconciler
        fakeClient client.Client
        ctx        context.Context
    )

    BeforeEach(func() {
        ctx = context.Background()
        scheme := runtime.NewScheme()
        Expect(cryptov1alpha1.AddToScheme(scheme)).To(Succeed())
        Expect(corev1.AddToScheme(scheme)).To(Succeed())

        fakeClient = fake.NewClientBuilder().WithScheme(scheme).Build()
        reconciler = &KeyPolicyReconciler{Client: fakeClient, Scheme: scheme}
    })

    It("should add a finalizer and set status to Active", func() {
        policy := &cryptov1alpha1.KeyPolicy{
            ObjectMeta: metav1.ObjectMeta{Name: "test-policy", Namespace: "default"},
            Spec:       cryptov1alpha1.KeyPolicySpec{Algorithm: "AES", KeyLength: 256, RotateDays: 90},
        }
        Expect(fakeClient.Create(ctx, policy)).To(Succeed())

        result, err := reconciler.Reconcile(ctx, ctrl.Request{
            NamespacedName: types.NamespacedName{Name: "test-policy", Namespace: "default"},
        })
        Expect(err).NotTo(HaveOccurred())
        Expect(result.RequeueAfter).To(Equal(90 * 24 * time.Hour))

        updated := &cryptov1alpha1.KeyPolicy{}
        Expect(fakeClient.Get(ctx, types.NamespacedName{Name: "test-policy", Namespace: "default"}, updated)).To(Succeed())
        Expect(updated.Status.Phase).To(Equal("Active"))
        Expect(updated.Finalizers).To(ContainElement("crypto.ibm.example.com/cleanup"))
    })
})

For external dependencies — the KMS wrap/unwrap calls, any storage operations — we use counterfeiter to generate fakes from interfaces. Define a thin interface around the external call and go generate produces a complete fake with call recording, argument capture, and configurable return values:

//go:generate go run github.com/maxbrunsfeld/counterfeiter/v6 . KeyMaterialStore

type KeyMaterialStore interface {
    StoreWrappedKey(ctx context.Context, keyID string, wrapped []byte) error
    GetWrappedKey(ctx context.Context, keyID string) ([]byte, error)
    DeleteKey(ctx context.Context, keyID string) error
}

Running go generate ./... produces fakes/fake_key_material_store.go. In tests:

fakeStore := &fakes.FakeKeyMaterialStore{}
fakeStore.GetWrappedKeyReturns([]byte("wrapped-dek"), nil)

reconciler.KeyStore = fakeStore

// after reconcile
Expect(fakeStore.StoreWrappedKeyCallCount()).To(Equal(1))
_, keyID, wrappedKey := fakeStore.StoreWrappedKeyArgsForCall(0)
Expect(keyID).NotTo(BeEmpty())

The generated fake is type-safe and tied to the interface. When the interface changes, the fake fails to compile until you regenerate — you can't forget to update the mock.

For integration scenarios that need real Kubernetes API machinery (webhook validation, status subresource behavior), use envtest. It's slower but catches things fake clients miss.

Controllers feel complex at first but the reconcile loop pattern is genuinely simple once it clicks. The complexity lives in your domain logic, not the framework.