Upgrading Clusters on VMware vSphere

This document explains how to upgrade Kubernetes clusters on VMware vSphere after the platform-side distribution upgrade is complete. The documented workflow focuses on updating the control plane and worker nodes through Cluster API resources.

INFO

Where this page fits in the full ACP upgrade flow

This page covers only the Kubernetes step of the upgrade. The full ACP upgrade flow — including upgrade artifact synchronization, ACP Core upgrade through CVO, Aligned plugin upgrades, and Agnostic plugin upgrades from Marketplace — is documented in the ACP product documentation. Complete those steps before you start the Kubernetes step on this page:

Use this page when the same cluster runs on an immutable operating system, because the Kubernetes step on immutable OS replaces nodes from a new VM template rather than upgrading binaries in place.

Upgrade Sequence

Upgrade VMware vSphere clusters in the following order:

  1. (Prerequisite) Upgrade the ACP platform on the management cluster first. This brings the cluster-api-provider-vsphere controller and the related CAPI components to versions that understand the new schema. Trigger workload-cluster upgrades only after the management-side controllers have rolled out and become Ready.
  2. Complete the distribution-version upgrade described in Upgrading Clusters.
  3. Verify that the control plane is healthy and the current cluster is stable.
  4. Upgrade the control plane Kubernetes version.
  5. Upgrade worker nodes to the target Kubernetes version.

Prerequisites

Before you begin, ensure the following conditions are met:

  • The distribution-version upgrade is complete.
  • The control plane is healthy and reachable.
  • All nodes are in the Ready state.
  • The target VM template is present in the vSphere environment under the same name as the MicroOS Image Version value in the OS Support Matrix row. The upgrade fails if the template is not present when the new VSphereMachineTemplate is applied.
  • The target Kubernetes version is compatible with your workloads and add-ons.
  • The machine config pools have enough capacity for rolling updates.
  • Review the Kubernetes upgrade path and version skew policy.
WARNING

Disk Preservation Model

Upgrades rely on Cluster API's rolling replacement mechanism. Each cluster has four disk classes; only the pool-managed class survives a delete-recreate.

Disk classDeclared inSurvives upgrade?Use for
System disk (root volume)The VM template used for spec.template.spec.template❌ NeverOS + kubelet/kubeadm/containerd. Rebuilt from the new template every replacement.
Template-local disksVSphereMachineTemplate.spec.template.spec.* (additional disks declared in the template)❌ NeverEphemeral cache. Destroyed with the old VM.
Pool-managed persistent disksVSphereMachineConfigPool.spec.slot[].persistentDisks✅ Detached from old VM and reattached to the new VM at the same slotPlatform state such as /var/cpaas.
External CSI volumes (vSphere CSI, etc.)Workload PVCs / CSI driver✅ Unrelated to node lifecycleApplication data.

"Preserved" means the same disk identity is reattached — it does not mean the disk's contents are time-traveled. Anything written to a pool-managed disk during the upgrade window stays after the upgrade and stays after a rollback.

WARNING

Templates Cannot Be Modified In Place

VSphereMachineTemplate.spec.template.spec is immutable. The vSphere admission webhook rejects any update with the message "VSphereMachineTemplate spec.template.spec field is immutable. Please create a new resource instead." Every upgrade step on this page therefore creates a new VSphereMachineTemplate with a new metadata.name, applies it, and then patches the controlling resource's infrastructureRef.name to the new template. Keep the previous template until the new rollout is healthy in case rollback is required.

INFO

Fleet Essentials UI does not support ACP 4.3 cluster upgrades

vSphere clusters do not currently expose a Fleet Essentials UI upgrade path. Use the YAML procedure documented below, or the two-step upgrade flow built into the ACP Core platform — see Request the upgrade for workload clusters.

Required Values From the OS Support Matrix

The authoritative mapping between an ACP release, its VM template, the Kubernetes version, the matching CoreDNS, etcd, and Kube-OVN versions lives in OS Support Matrix. Locate the row that corresponds to the target ACP version before you start; the row supplies every value the steps below need.

The cells you read from that row map to the upgrade manifests as follows:

OS Support Matrix columnUsed to setWhere it lands
MicroOS Image VersionVSphereMachineTemplate.spec.template.spec.template (target VM template name)Control plane and worker VSphereMachineTemplate
Kubernetes VersionKubeadmControlPlane.spec.version and MachineDeployment.spec.template.spec.versionBoth control plane and worker
corednsKubeadmControlPlane.spec.kubeadmConfigSpec.clusterConfiguration.dns.imageTagControl plane only
etcdKubeadmControlPlane.spec.kubeadmConfigSpec.clusterConfiguration.etcd.local.imageTagControl plane only
kube-ovn (chart)Cluster.metadata.annotations["cpaas.io/kube-ovn-version"]Cluster scope; the vSphere provider reconciles the Kube-OVN AppRelease from this annotation. This is the acp/chart-cpaas-kube-ovn chart version (for example v4.3.3), not the Kube-OVN component version.

The CoreDNS and etcd image tags are control-plane-only because clusterConfiguration is a KubeadmControlPlane field. Worker nodes inherit container image versions from the new VM template; the MachineDeployment does not carry its own dns/etcd tags. The Kube-OVN annotation lives on the Cluster resource, not on KubeadmControlPlane, because the vSphere provider watches it independently of the Kubernetes control plane rollout.

Steps

Create the target machine templates

Before you start the rolling upgrade, create new VSphereMachineTemplate resources for the control plane and workers.

  1. Export the existing control plane template

    kubectl get vspheremachinetemplate <cluster_name>-control-plane -n <namespace> -o yaml > new-cp-template.yaml
  2. Modify the control plane template

    Edit new-cp-template.yaml:

    • Set metadata.name to a new unique name (for example, <cluster_name>-control-plane-v2)
    • Update spec.template.spec.template to the target VM template name
    • Update CPU, memory, or disk settings if needed
    • Remove server-generated fields: metadata.resourceVersion, metadata.uid, metadata.generation, metadata.creationTimestamp, metadata.managedFields, metadata.annotations["kubectl.kubernetes.io/last-applied-configuration"], and status
    • Leave spec.template.spec.providerID unset. The vSphere provider sets providerID to the VM's BIOS UUID once the VM is created; pre-filling it in the template breaks the controller's identity binding.
  3. Export and modify the worker template

    kubectl get vspheremachinetemplate <cluster_name>-worker -n <namespace> -o yaml > new-worker-template.yaml

    Edit new-worker-template.yaml:

    • Set metadata.name to a new unique name (for example, <cluster_name>-worker-v2)
    • Update spec.template.spec.template to the target VM template name
    • Update CPU, memory, or disk settings if needed
    • Remove the same server-generated fields listed above
  4. Apply both new templates

    kubectl apply -f new-cp-template.yaml
    kubectl apply -f new-worker-template.yaml

Upgrade the control plane

Before you start, collect every required value from the target ACP row in the OS Support Matrix as described in Required Values From the OS Support Matrix.

  1. Patch the KubeadmControlPlane with the target Kubernetes values

    Update the KubeadmControlPlane resource in a single edit to keep spec.version, the CoreDNS image tag, the etcd image tag, and the infrastructure template reference consistent with the same VM template:

    • spec.versionKubernetes Version from the OS Support Matrix row

    • spec.kubeadmConfigSpec.clusterConfiguration.dns.imageTagcoredns column from the same row

    • spec.kubeadmConfigSpec.clusterConfiguration.etcd.local.imageTagetcd column from the same row

    • spec.machineTemplate.infrastructureRef.name ← the new VSphereMachineTemplate name created above

      kubectl edit kubeadmcontrolplane <cluster_name> -n <namespace>

    Updating only spec.version is not sufficient. The CoreDNS and etcd image tags must move together with the Kubernetes version because they are built from the same release; leaving them at the previous values can result in CoreDNS and etcd pods that do not match the new Kubernetes minor version.

  2. Update the Kube-OVN version annotation on the Cluster resource

    If the target ACP row in the OS Support Matrix shows a different kube-ovn (chart) value than the current cluster, patch the annotation on the Cluster resource so the vSphere provider reconciles the new Kube-OVN AppRelease.

    INFO

    Prerequisite — Kube-OVN reconcile gating annotation (vSphere only)

    On vSphere, the provider reconciles the Kube-OVN AppRelease only when the Cluster resource carries the annotation cpaas.io/network-type: kube-ovn. This annotation is normally set at cluster creation. If it is missing, the steps below will succeed at writing the version annotation but the Kube-OVN AppRelease will not be reconciled. Verify before proceeding:

    kubectl get cluster <cluster_name> -n <namespace> \
      -o jsonpath='{.metadata.annotations.cpaas\.io/network-type}{"\n"}'
    # Expected output: kube-ovn

    This precondition is vSphere-specific. The DCS and Huawei Cloud Stack providers use the DCSCluster / HCSCluster spec.networkType field instead and do not require this annotation.

    kubectl annotate cluster <cluster_name> -n <namespace> \
      cpaas.io/kube-ovn-version=<kube-ovn-version-from-matrix> --overwrite

    Kube-OVN is a Core lifecycle component, but on immutable OS the vSphere provider drives its delivery from this annotation; the annotation does not update automatically when spec.version changes.

    The vSphere provider reconciles a single Kube-OVN AppRelease named cni-kube-ovn in the cpaas-system namespace of the workload cluster. Run the following on the workload cluster (not the bootstrap KIND or the global cluster) to follow the reconciliation:

    # Overall AppRelease state — Sync and Health columns must reach a Success-equivalent reason
    kubectl get apprelease cni-kube-ovn -n cpaas-system
    
    # Installed revision and chart phase
    kubectl get apprelease cni-kube-ovn -n cpaas-system \
      -o jsonpath='Installed: {.status.charts.*.installedRevision}{"\n"}Phase: {.status.charts.*.phase}{"\n"}'

    The normal sequence is Upgrading → HealthChecking → Success. On small clusters the full transition typically completes within about one minute. Read the phases as follows:

    PhaseMeaninginstalledRevision
    UpgradingHelm release upgrade in progress. Sync condition is Unknown(Syncing).Still the previous version
    HealthCheckingHelm release applied; controller is verifying Kube-OVN pods. Sync condition is True(Synced).Already the target version
    SuccessAll three conditions (Validate, Sync, Health) are True.Target version
    WARNING

    Do not declare the upgrade complete on installedRevision alone. The field flips to the target value during HealthChecking, before pods have been verified Ready. The chart is only considered upgraded when phase is Success and installedRevision matches the target.

    The AppRelease API also defines Downloading, Installing, Syncing, DownloadFailed, DeployFailed, and NotReady. The first three are transient and the upgrade should converge on its own. The last three indicate a failure that needs manual investigation; start with kubectl describe apprelease cni-kube-ovn -n cpaas-system to read the per-condition message field.

  3. Monitor the control plane rollout

    kubectl -n <namespace> get kubeadmcontrolplane <cluster_name> -w
    kubectl -n <namespace> get machine -l cluster.x-k8s.io/control-plane

Upgrade the worker nodes

After the control plane upgrade completes, update the MachineDeployment to reference the new worker template and the target Kubernetes version.

Typical changes include:

  • spec.template.spec.version — the target Kubernetes version
  • spec.template.spec.infrastructureRef.name — the new VSphereMachineTemplate name
  • spec.template.spec.bootstrap.configRef.name — the new KubeadmConfigTemplate name, if bootstrap settings must change (see Updating Bootstrap Templates)

Apply the changes:

kubectl patch machinedeployment <cluster_name>-md-0 -n <namespace> \
  --type='merge' -p='{
    "spec": {
      "template": {
        "spec": {
          "version": "<target_kubernetes_version>",
          "infrastructureRef": {
            "name": "<new-worker-template-name>"
          }
        }
      }
    }
  }'

Monitor the worker rollout:

kubectl -n <namespace> get machinedeployment <cluster_name>-md-0 -w
kubectl -n <namespace> get machine
kubectl --kubeconfig=/tmp/<cluster_name>.kubeconfig get nodes -o wide

Rolling Back a Failed Upgrade

If the rolling update fails — new VMs fail to boot, nodes do not become Ready, or the new Kubernetes minor version surfaces an incompatibility — revert the template reference and Kubernetes-version fields back to the previous values. Cluster API treats the reversion as a new spec drift and rolls the v2 machines back to the previous template, one at a time.

Three facts to internalize before rolling back:

  • The old VMs are gone. They were destroyed during the upgrade. Rollback uses the old template to build a fresh set of replacement machines; it does not restore the original VMs.
  • The old VSphereMachineTemplate resource must still exist. Do not delete the previous template until the new rollout is healthy. If you already deleted it, recreate it from version control or backup before rolling back.
  • Pool-managed disk identity is preserved, but data state is not. Disks declared in VSphereMachineConfigPool.spec.slot[].persistentDisks reattach to the rolled-back machines at the same slot, but any data written to those disks during the upgrade window (for example, etcd entries in the new Kubernetes minor format) stays. If the new format is unreadable by the older Kubernetes minor version, the rollback may still fail and require manual etcd restoration.

Procedure:

  • Control plane: patch KubeadmControlPlane to restore the previous spec.machineTemplate.infrastructureRef.name, spec.version, spec.kubeadmConfigSpec.clusterConfiguration.dns.imageTag, and spec.kubeadmConfigSpec.clusterConfiguration.etcd.local.imageTag.

  • Workers: patch each MachineDeployment to restore the previous spec.template.spec.infrastructureRef.name and spec.template.spec.version.

  • Kube-OVN: if the kube-ovn annotation was changed, restore the previous value on the Cluster resource:

    kubectl annotate cluster <cluster_name> -n <namespace> \
      cpaas.io/kube-ovn-version=<previous-kube-ovn-version> --overwrite

If the new control plane never reached etcd quorum, the KubeadmControlPlane controller may refuse to roll back any machine because its preflight checks block on an unhealthy etcd. Recover etcd quorum first (operator intervention) before retrying the rollback.

Verification

Confirm the following results after the upgrade:

  • KubeadmControlPlane reaches the target version and desired replica count.
  • MachineDeployment reaches the target version and desired replica count.
  • Control plane and worker nodes return to the Ready state.
  • The vSphere CPI daemonset remains available in the workload cluster.

Next Steps

After the Kubernetes upgrade is complete, continue with routine node operations in Managing Nodes on VMware vSphere.