Five Major Velero Upgrades: What I Learned

We've been running Velero for GKE cluster backup since 2020. Over that time I've managed it through five major version jumps: 1.2 to 1.3, 1.3 to 1.6, 1.6 to 1.8, 1.8 to 1.10, and 1.10 to 1.11. Three clusters, varying node counts, a mix of workloads with PVs. Each upgrade taught me something that the previous one hadn't.

This is a retrospective on what made upgrades painful, what made them smooth, and what I do differently now.

Why Velero Upgrades Aren't Trivial

Velero's CRD schema evolves across versions. The Backup, Restore, Schedule, and BackupStorageLocation CRDs have changed between releases. When you upgrade the Velero server, it needs to be compatible with the CRD versions in the cluster, and the existing backup objects need to remain readable.

Plugin APIs also change. Velero has a plugin architecture for storage backends and volume snapshot providers. A plugin compiled against the Velero 1.8 plugin API may not work with Velero 1.10. You must upgrade plugins in lockstep with Velero itself, and if a plugin hasn't released a compatible version yet, you're blocked.

Finally, the backup format in GCS is stable across minor versions but the metadata format has changed across major versions. Velero provides migration tooling for major format changes, but you need to know it exists and run it.

The General Upgrade Process

For every upgrade, before touching production:

Read the upgrade guide for the specific version pair. CrowdStrike has good docs; Velero's upgrade guides live at velero.io/docs/vX.Y/upgrade-to-vX-Y/ and cover the specific steps for that version.
Upgrade one major version at a time for large jumps. When I needed to get from 1.3 to 1.6, I went 1.3 → 1.4 → 1.5 → 1.6, not directly to 1.6. This is slower but significantly safer.
Upgrade in dev first. Do a complete backup and restore test in dev before touching staging or production.
Have a rollback plan. For Velero, rollback means reverting the Velero deployment to the previous version. Your existing backups in GCS are still there; you don't lose them by rolling back.

The Helm-based installation is straightforward:

helm repo update

# Check current version
helm list -n velero

# Upgrade
helm upgrade velero vmware-tanzu/velero \
  --namespace velero \
  --values my-velero-values.yaml \
  --version 1.11.0 \
  --wait

The --wait flag blocks until all pods are running. Check the Velero pod logs after every upgrade.

The 1.6 Upgrade: Plugin Architecture Change

Before 1.6, the storage provider (in our case, GCS) and the volume snapshot plugin were configured via flags on the Velero deployment itself: --provider gcp, --backup-location-config serviceAccount=....

In 1.6, this moved to first-class plugin resources. You install the GCP plugin as a separate container, and configuration lives in BackupStorageLocation and VolumeSnapshotLocation CRDs.

apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: default
  namespace: velero
spec:
  provider: velero.io/gcp
  objectStorage:
    bucket: my-velero-backups
  config:
    serviceAccount: velero@my-project.iam.gserviceaccount.com

The upgrade involved:
- Removing the old --provider and --backup-location-config flags from the Velero deployment
- Installing the velero-plugin-for-gcp as a separate init container
- Creating the BackupStorageLocation and VolumeSnapshotLocation resources

Existing backups were unaffected — the storage format didn't change, just how Velero accessed GCS. But if you miss the plugin installation step, Velero comes up but can't write backups, which you might not notice immediately.

The 1.10 Upgrade: Restic to Kopia

This was the most disruptive upgrade we did.

Before 1.10, Velero's file-system backup (pod volume backup — backing up PV contents by mounting them into a Restic container) used Restic as the backend uploader. In 1.10, Velero switched the default to Kopia, a different deduplication and upload tool.

The transition is NOT automatic. You have to explicitly opt into Kopia by adding --uploader-type=kopia to your Velero deployment. If you don't add this flag, you continue using Restic.

Here's the catch: pods annotated for volume backup use:

backup.velero.io/backup-volumes: my-volume-name

This annotation works the same regardless of whether Restic or Kopia is the uploader. But backups created with Restic can only be restored with the Restic uploader. Backups created with Kopia can only be restored with the Kopia uploader. They are not interchangeable.

Our migration path:
1. Upgrade Velero to 1.10 without enabling Kopia (keep using Restic temporarily)
2. Run new backups with Restic — these are your rollback point
3. Enable Kopia, run a test backup and restore in dev
4. Once confident, enable Kopia in staging and production
5. Keep old Restic backups until they've aged out of your retention window; don't delete them early

The official upgrade docs suggest keeping the Restic uploader running alongside Kopia during the transition. We skipped that and went straight to Kopia after testing in dev. I wouldn't do it that way again — keep both available until all your backups are Kopia-native.

The 1.11 Upgrade: CSI Snapshots Matured

In 1.11, CSI snapshot support stabilized and became the recommended approach for PV backups on GKE. Instead of copying file contents into GCS via Restic/Kopia, CSI snapshots take a storage-level snapshot of the PV (backed by GCP Persistent Disk snapshots), which is orders of magnitude faster.

The prerequisite is a configured VolumeSnapshotClass:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: csi-gce-pd-vsc
  labels:
    velero.io/csi-volumesnapshot-class: "true"
driver: pd.csi.storage.gke.io
deletionPolicy: Retain

The velero.io/csi-volumesnapshot-class: "true" label tells Velero to use this class for CSI snapshots. The deletionPolicy: Retain is important — if it's set to Delete, Velero's cleanup process could inadvertently delete the underlying GCP disk snapshot.

After enabling CSI snapshots, PV backups that used to take 20-30 minutes (copying gigabytes of data via Kopia) complete in under a minute. The tradeoff is that the snapshots live as GCP disk snapshots rather than objects in GCS — different cost model, different retention management.

Backup Format and Storage

Velero stores each backup as a directory in GCS containing:

backup.tar.gz — all the Kubernetes resource manifests (namespace, deployments, services, configmaps, etc.)
velero-backup.json — backup metadata
<resource>-volumesnapshots.json.gz — volume snapshot metadata

The Kubernetes resource format is stable across minor versions. When there's a major format change, Velero provides a migration command; this hasn't happened during any of our upgrades but I've seen it in older upgrade notes.

One thing I check after every upgrade: velero backup describe <recent-backup-name>. Make sure the backup shows Phase: Completed and that the item counts make sense. A backup that says Completed with 0 items backed up is silently broken.

What I Do Differently Now

Test every upgrade with a full backup + restore to a separate namespace. This is the only way to actually know the upgrade worked. Not just "is the backup status Completed" but "can I actually restore from it."

# Create a test backup
velero backup create upgrade-test-$(date +%Y%m%d) \
  --include-namespaces test-workload

# Restore to a different namespace
velero restore create --from-backup upgrade-test-$(date +%Y%m%d) \
  --namespace-mappings test-workload:test-workload-restored

# Verify the restored objects
kubectl get all -n test-workload-restored

Schedule upgrade windows. Don't upgrade Velero during a period when you'd need to rely on a backup. If something goes wrong with the upgrade, you want time to investigate before the next scheduled backup runs.

Never upgrade Velero and GKE on the same day. Two things changing at once makes failures hard to attribute. Give each upgrade its own change window with at least a day between them.

Pin plugin versions to match Velero. The velero-plugin-for-gcp has version compatibility requirements. Every time I upgrade Velero I explicitly check the plugin compatibility matrix and pin the plugin version in my Helm values. Letting Helm pull "latest" on the plugin will eventually catch you with an incompatible version.

The 1.10 Restic-to-Kopia transition was the hardest upgrade by far. If you're on 1.9 or earlier and planning to move to 1.10+, read that section twice and plan your backup transition carefully. Everything else has been manageable with proper preparation.