Automating Key Rotation Workflows in Go

Manual key rotation is a liability. I've seen it at multiple organizations: someone sets a calendar reminder to rotate the database encryption key every 90 days, the reminder fires on a Friday afternoon, someone decides to do it Monday, Monday becomes next week, and now you're out of compliance and don't know it. Even when people are diligent, manual rotation creates gaps in audit trails, introduces human error risk, and doesn't scale across dozens of keys.

The fix is automation. Key rotation should be a background process: detect expiring keys, rotate them, update what they protect, and log every step. Here's how I build this in Go.

Updated March 2026: AWS KMS now supports fully automatic key rotation on a configurable schedule (annual by default, configurable between 90 and 2560 days as of 2023) without requiring custom automation for the KMS key itself. This post's rotation worker pattern is still relevant for envelope encryption workflows where you also need to re-wrap data encryption keys stored in your own database, and for IBM Key Protect which has a different rotation model. For pure KMS key rotation, check whether your cloud provider's native automatic rotation covers your compliance requirements before building a custom worker.

Why Manual Rotation Fails

Compliance frameworks (SOC 2, PCI-DSS, HIPAA) typically require key rotation on a defined schedule. The requirement isn't just that keys get rotated — it's that you can prove they were rotated, on time, with an audit trail. A calendar reminder produces neither proof nor a trail.

Beyond compliance, there's the operational risk. Key rotation involves multiple steps: create the new key, re-encrypt data with the new key, update all references, retire the old key. If any step fails partway through, you need a clean rollback path. Manual processes handle partial failures badly. Automated processes can be designed to be idempotent and to roll back cleanly.

Rotation Workflow Design

The workflow I use has five stages:

Detect: Find keys that are within a threshold of their rotation deadline
Generate: Create a new key version (or new key, depending on the KMS)
Re-encrypt: Wrap existing data encryption keys with the new root key
Update references: Commit the newly wrapped DEKs to the data store
Retire: Mark the old key version as inactive (don't delete immediately — you may need it for decryption until all data is migrated)

Each stage should be idempotent: running it twice should produce the same result as running it once. This is the safety property that lets you retry on failure without making things worse.

The Rotation Worker

The worker runs on a ticker and processes each key that needs rotation:

type RotationWorker struct {
    kmsClient    KMSClient
    store        KeyStore
    logger       *slog.Logger
    checkInterval time.Duration
    rotateWithin  time.Duration // rotate when this close to expiry
}

func (w *RotationWorker) Run(ctx context.Context) error {
    ticker := time.NewTicker(w.checkInterval)
    defer ticker.Stop()

    // Run immediately on start, then on each tick
    if err := w.runCycle(ctx); err != nil {
        w.logger.Error("rotation cycle failed", "error", err)
    }

    for {
        select {
        case <-ticker.C:
            if err := w.runCycle(ctx); err != nil {
                w.logger.Error("rotation cycle failed", "error", err)
                // Don't return — keep running, alert separately
            }
        case <-ctx.Done():
            return ctx.Err()
        }
    }
}

func (w *RotationWorker) runCycle(ctx context.Context) error {
    keys, err := w.store.ListKeys(ctx)
    if err != nil {
        return fmt.Errorf("listing keys: %w", err)
    }

    deadline := time.Now().Add(w.rotateWithin)
    for _, key := range keys {
        if key.ExpiresAt.Before(deadline) {
            if err := w.rotateKey(ctx, key); err != nil {
                // Log and continue — one failure shouldn't block all rotations
                w.logger.Error("key rotation failed",
                    "key_id", key.ID,
                    "error", err,
                )
                recordRotationFailure(key.ID, err)
                continue
            }
        }
    }
    return nil
}

I log rotation failures and continue rather than returning an error from the cycle. One bad key shouldn't prevent the remaining keys from being checked. The failure gets recorded for alerting separately.

The Rotate Key Operation

func (w *RotationWorker) rotateKey(ctx context.Context, key KeyRecord) error {
    start := time.Now()
    w.logger.Info("starting key rotation",
        "key_id", key.ID,
        "current_version", key.Version,
        "expires_at", key.ExpiresAt,
    )

    // Step 1: Create new key version in KMS
    newVersion, err := w.kmsClient.RotateKey(ctx, key.ID)
    if err != nil {
        return fmt.Errorf("KMS rotate: %w", err)
    }
    w.logger.Info("new key version created",
        "key_id", key.ID,
        "new_version", newVersion,
    )

    // Step 2: Find all DEKs wrapped with the old root key version
    deks, err := w.store.ListWrappedDEKs(ctx, key.ID, key.Version)
    if err != nil {
        return fmt.Errorf("listing DEKs for key %s v%s: %w", key.ID, key.Version, err)
    }

    // Step 3: Re-wrap each DEK with the new key version
    for _, dek := range deks {
        if err := w.rewrapDEK(ctx, key.ID, dek); err != nil {
            return fmt.Errorf("re-wrapping DEK %s: %w", dek.ID, err)
        }
    }

    // Step 4: Update key record with new version and extended expiry
    newExpiry := time.Now().Add(90 * 24 * time.Hour)
    if err := w.store.UpdateKeyVersion(ctx, key.ID, newVersion, newExpiry); err != nil {
        return fmt.Errorf("updating key record: %w", err)
    }

    w.logger.Info("key rotation complete",
        "key_id", key.ID,
        "new_version", newVersion,
        "new_expiry", newExpiry,
        "deks_rewrapped", len(deks),
        "duration_ms", time.Since(start).Milliseconds(),
    )
    return nil
}

Re-wrapping a DEK

func (w *RotationWorker) rewrapDEK(ctx context.Context, rootKeyID string, dek DEKRecord) error {
    // Unwrap with the old key (KMS keeps old versions for unwrap)
    plaintext, err := w.kmsClient.Unwrap(ctx, rootKeyID, dek.WrappedKey)
    if err != nil {
        return fmt.Errorf("unwrap DEK %s: %w", dek.ID, err)
    }

    // Wrap with the current (new) key version
    newWrapped, err := w.kmsClient.Wrap(ctx, rootKeyID, plaintext)
    if err != nil {
        // Zero out plaintext before returning
        for i := range plaintext {
            plaintext[i] = 0
        }
        return fmt.Errorf("re-wrap DEK %s: %w", dek.ID, err)
    }
    // Zero out plaintext
    for i := range plaintext {
        plaintext[i] = 0
    }

    // Persist the new wrapped DEK
    return w.store.UpdateWrappedDEK(ctx, dek.ID, newWrapped)
}

Zeroing the plaintext slice after use is basic hygiene — you don't want decrypted key material sitting in memory longer than necessary.

Idempotency and Rollback

The rotation is designed to be safe to retry:

RotateKey on most KMS providers (including IBM Key Protect and AWS KMS) is idempotent in the sense that the old key version remains valid for unwrap after rotation. You can unwrap DEKs that were wrapped with the old version even after rotating.
UpdateWrappedDEK updates in place. If it's called twice with the same new wrapped value, that's fine.
If rotateKey fails partway through — say, after creating the new key version but before re-wrapping all DEKs — the next cycle will pick up where it left off. The DEKs that weren't re-wrapped are still wrapped with the old version, which is still valid.

The one thing to be careful about: don't retire (delete) the old key version until you've verified all DEKs have been re-wrapped. If you're operating under compliance requirements that mandate hardware-backed keys, the rotation workflow here applies equally to HSM-backed key management — the re-wrapping logic is the same; the KMS client just points at a PKCS#11 interface instead of a cloud API. I keep old versions in a "pending retirement" state for 7 days after rotation, then retire them. This gives time to catch any DEKs that were missed.

Audit Logging

Every rotation event goes to structured logs with enough context to satisfy an auditor:

func recordRotationEvent(logger *slog.Logger, keyID, fromVersion, toVersion string, deksCount int, err error) {
    level := slog.LevelInfo
    status := "success"
    if err != nil {
        level = slog.LevelError
        status = "failure"
    }
    logger.Log(context.Background(), level, "key_rotation_event",
        "event_type", "key_rotation",
        "key_id", keyID,
        "from_version", fromVersion,
        "to_version", toVersion,
        "deks_rotated", deksCount,
        "status", status,
        "error", err,
        "timestamp", time.Now().UTC(),
    )
}

Structured logs feed into your SIEM or log aggregator. Each event is queryable: "show me all key rotations for key X in the last 90 days." That's your audit trail.

Alerting on Failure

The worker logs failures and continues, but failures need to trigger alerts. I use a metrics counter that Prometheus scrapes:

var keyRotationFailures = prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "key_rotation_failures_total",
        Help: "Total number of key rotation failures",
    },
    []string{"key_id"},
)

func recordRotationFailure(keyID string, err error) {
    keyRotationFailures.WithLabelValues(keyID).Inc()
}

Alert when rate(key_rotation_failures_total[1h]) > 0. A rotation failure means a key is approaching expiry without being rotated — that's urgent, not just informational.

The goal is that nobody has to remember to rotate keys. The system handles it, proves it happened, and wakes someone up if it doesn't.