Kubernetes Backup and Restore with Velero

I've managed Velero across three clusters — production, staging, and dev — through five major version upgrades from 1.2 to 1.11. Here's what I've learned about making it work reliably and what trips you up.

What Velero Actually Does

Velero has two distinct jobs that people often conflate: backing up Kubernetes resources (manifests — Deployments, Services, ConfigMaps, Secrets, PVCs) and backing up the actual data in persistent volumes. These are fundamentally different operations with different speed, reliability, and complexity characteristics.

Resource backup is fast. Velero queries the Kubernetes API, serializes everything to JSON/YAML, and writes it to object storage (S3, GCS, Azure Blob). A full cluster backup of a moderately sized namespace with ~200 resources takes under a minute.

Volume backup is slow and complex. Velero either uses file-level backup (via Restic, now Kopia) to copy the actual bytes from a mounted volume, or it triggers a CSI snapshot through your storage driver. File-level backup of a 50GB Postgres data directory takes a while. CSI snapshots are nearly instant if your storage class supports them.

Know which one you need. If your data lives in an external database (RDS, Cloud SQL), you only need resource backup. The PVC manifests will restore, and your data has its own backup path. If you're running stateful workloads directly in the cluster, you need volume backup too.

Installation with Helm

helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm repo update

helm install velero vmware-tanzu/velero \
  --namespace velero \
  --create-namespace \
  -f velero-values.yaml

The velero-values.yaml that matters:

configuration:
  backupStorageLocation:
    - name: default
      provider: gcp
      bucket: my-cluster-velero-backups
      config:
        serviceAccount: velero@my-project.iam.gserviceaccount.com

  volumeSnapshotLocation:
    - name: default
      provider: gcp
      config:
        project: my-gcp-project

credentials:
  useSecret: true
  secretContents:
    cloud: |
      [default]
      ...

BackupStorageLocation is where Kubernetes manifests and metadata go — your GCS/S3 bucket. VolumeSnapshotLocation is where PV snapshots go — this is cloud provider disk snapshots, separate from your manifest bucket. They're independent; you can have one without the other.

Creating and Scheduling Backups

A manual backup of specific namespaces:

velero backup create my-backup --include-namespaces=production,monitoring

Without --include-namespaces, it backs up everything. I've been bitten by this — the default catches more than you expect, including system namespaces. Be explicit.

Check backup status:

velero backup describe my-backup --details
velero backup logs my-backup

Scheduled backups are what you actually run in production:

velero schedule create daily-production \
  --schedule="0 3 * * *" \
  --include-namespaces=production \
  --ttl 720h  # 30 days retention

The --ttl sets how long backups are kept before Velero automatically deletes them. I run daily backups with 30-day TTL and weekly backups with 90-day TTL. The schedules stack fine.

Restoring

velero restore create --from-backup my-backup

Important default behavior: restore skips resources that already exist. If a Deployment is running in the target namespace, Velero will not overwrite it. To change this:

velero restore create --from-backup my-backup \
  --existing-resource-policy=update

To restore to a different namespace (useful for DR drills without touching production):

velero restore create --from-backup my-backup \
  --namespace-mappings production:production-restored

Watch restore progress:

velero restore describe <restore-name> --details

Version Upgrade Lessons: 1.2 to 1.11

The CRD format changed materially in 1.5 (moving from v1beta1 to v1 CRDs), again around 1.8 with the Restic→Kopia migration path, and the BSL/VSL schema has shifted across multiple versions. Hard rules I follow:

Never skip a major version. 1.2 → 1.11 is not a valid upgrade path. Go 1.2 → 1.3 → ... → 1.11. Each release's upgrade guide documents what CRD migrations are required. Skip one and you will have corrupted backup metadata.

Always read the upgrade guide before upgrading. Not the changelog. The dedicated upgrade guide in the docs. These are different documents. The changelog says what changed; the upgrade guide says what you must do to not break your existing backups.

Back up your Velero CRD state before upgrading Velero itself. Before upgrading the Velero Helm release, I do a kubectl get backup,restore,schedule,backupstoragelocation -n velero -o yaml > velero-state-before-upgrade.yaml. This has saved me once.

The Restic → Kopia Migration (Velero 1.10)

Velero 1.10 deprecated Restic in favor of Kopia for file-level volume backup. Kopia is meaningfully faster for large volumes and more reliable with concurrent backups. The migration is not automatic.

If you're on 1.9 or older using Restic, your existing backups remain Restic backups and can still be restored. New backups after enabling Kopia use Kopia. You have to explicitly set --uploader-type=kopia during install/upgrade.

Do this migration. Kopia handles large volumes (100GB+) significantly better than Restic. We saw backup time drop by ~40% on our largest volumes after switching.

Disaster Recovery Drills

This is the part most teams skip and regret. Quarterly, I restore a recent production backup to a separate production-dr-test namespace and verify that the application actually starts and that the data is intact. The process:

# Create restore into isolated namespace
velero restore create dr-drill-q2-2022 \
  --from-backup daily-production-<latest> \
  --namespace-mappings production:production-dr-test \
  --include-namespaces production

# Watch it complete
kubectl get pods -n production-dr-test -w

# Verify application health
kubectl exec -n production-dr-test deploy/api -- /healthcheck

Then delete the namespace when done. This drill has caught two real issues: once a missing Secret that wasn't included in the backup scope, and once a volume restore that silently failed (status showed "Completed" but the data was wrong). You cannot trust backup status alone. Restore and verify.

What Velero Is Not

Velero is not a database backup tool. It can back up a PVC containing a Postgres data directory, but restoring that into a running Postgres pod is not guaranteed to give you a consistent database state. For databases, use native backup tools (pg_dump, mysqldump, or managed service snapshots) and use Velero for Kubernetes resource recovery only.

Velero is also not fast enough for RPO requirements under 15 minutes. For that you need application-level replication.

For most teams running production workloads on Kubernetes without extremely tight recovery time objectives, Velero with daily scheduled backups and quarterly restore drills is the right solution. It's mature, well-maintained, and the GKE/GCS integration is solid.