Upgrading Clusters on Huawei Cloud Stack
This guide explains how to upgrade Kubernetes clusters on Huawei Cloud Stack with minimal downtime, while preserving stability and data integrity.
Where this page fits in the full ACP upgrade flow
This page covers only the Kubernetes step of the upgrade. The full ACP upgrade flow — including upgrade artifact synchronization, ACP Core upgrade through CVO, Aligned plugin upgrades, and Agnostic plugin upgrades from Marketplace — is documented in the ACP product documentation. Complete those steps before you start the Kubernetes step on this page:
- Upgrade Overview (scope and sequencing)
- Pre-Upgrade Preparation
- Upgrade the global cluster (Core, Aligned, Agnostic)
- Upgrade workload clusters (Core, Aligned, Agnostic)
Use this page when the same cluster runs on an immutable operating system, because the Kubernetes step on immutable OS replaces nodes from a new VM template rather than upgrading binaries in place.
Version
HCS provider v1.0.1 is the first release that supports pool-managed persistent disks.
Existing Cluster Migration
If your cluster runs ACP v4.3.1 or later and you are moving to HCS provider v1.0.1 or later, complete the migration procedure in Migrate Existing Huawei Cloud Stack Clusters to Pool-Managed Persistent Disks before you rely on upgrade-time disk preservation.
TOC
OverviewUpgrade SequencePrerequisitesControl Plane UpgradesInfrastructure Image UpdatesProcedureKubernetes Version UpgradesRequired Values From the OS Support MatrixProcedureWorker Node UpgradesRolling Back a Failed UpgradeAdditional ResourcesOverview
Cluster upgrades on HCS encompass multiple components and follow a structured approach to ensure system reliability:
- Control Plane Upgrades: Update Kubernetes control plane components and underlying infrastructure
- Worker Node Upgrades: Upgrade worker nodes with new machine images and Kubernetes versions
- Infrastructure Updates: Modify virtual machine specifications, storage, and network configurations
Upgrade Sequence
Upgrade HCS clusters in the following order:
- (Prerequisite) Upgrade the ACP platform on the management cluster first. This brings the
cluster-api-provider-hcscontroller and the related CAPI components to versions that understand the new schema. Trigger workload-cluster upgrades only after the management-side controllers have rolled out and become Ready. - Upgrade the Distribution Version on the workload cluster. See Upgrading Clusters.
- Upgrade the control plane Kubernetes version.
- Upgrade worker nodes to the target Kubernetes version.
Cluster API orchestrates rolling updates with built-in safety mechanisms to reduce service disruption.
Skipping step 1 risks two failure modes: the old controller silently ignores new schema fields written to HCSMachineConfigPool / HCSMachineTemplate; or a controller image swap mid-rollout interrupts persistent-disk state-machine progression. Always settle the management-side upgrade before touching workload rollout.
Prerequisites
Before you start, ensure all of the following prerequisites are met:
- The Distribution Version upgrade is complete.
- The control plane is reachable.
- All nodes are healthy and in
Readystate. - The target VM image is present in the HCS environment under the same name as the MicroOS Image Version value in the OS Support Matrix row. The upgrade fails if the image is not present when the new
HCSMachineTemplateis applied. - The target Kubernetes version is compatible with your workloads and add-ons.
- Review the Kubernetes upgrade path and version skew policy.
- Any node-local state that must survive replacement is declared in
HCSMachineConfigPool.spec.configs[].persistentDisks[], not inHCSMachineTemplate.spec.template.spec.dataVolumes[].
For initial deployment, see the Create Cluster guide.
Single-Control-Plane Clusters
The upgrade workflow in this document applies to HCS clusters with a highly available control plane. Single-control-plane HCS clusters are supported for creation, but they are not supported for upgrade through this workflow.
Disk Preservation Model
Upgrades rely on Cluster API's rolling replacement mechanism. The HCS provider has four disk classes:
Do not treat node-local data on HCS dataVolumes[] as preserved state. Move /var/cpaas and any other retained node-local paths to pool-managed persistent disks before the rolling replacement.
Templates Cannot Be Modified In Place
HCSMachineTemplate is a Cluster API infrastructure template. Cluster API only triggers rolling replacement when KubeadmControlPlane.spec.machineTemplate.infrastructureRef.name or MachineDeployment.spec.template.spec.infrastructureRef.name points at a different template name. Editing the existing template in place changes the manifest but does not produce a new rollout — the running VMs continue to use the in-memory snapshot of the previous template.
Every upgrade step on this page therefore creates a new HCSMachineTemplate with a new metadata.name, applies it, and then patches the controlling resource's infrastructureRef.name to the new template. Keep the previous template until the new rollout is healthy in case rollback is required.
Fleet Essentials UI does not support ACP 4.3 cluster upgrades
The Fleet Essentials UI workflow has not been adapted to the Cluster Version Operator (CVO) mechanism introduced in ACP 4.3. HCS clusters do not currently expose a Fleet Essentials UI upgrade path; use the YAML procedure documented below, or the two-step upgrade flow built into the ACP Core platform — see Request the upgrade for workload clusters.
Control Plane Upgrades
Control plane upgrades update the Kubernetes API server, etcd, scheduler, and controller manager, along with the underlying VM infrastructure.
For HCS control planes backed by an HCSMachineConfigPool that uses pool-managed persistent disks, keep KubeadmControlPlane.spec.rolloutStrategy.rollingUpdate.maxSurge: 0 during upgrades. Persistent disks are bound to fixed (hostname, slot) identities, so the rollout must remove the old machine before the replacement machine can reuse the same disk.
Infrastructure Image Updates
Upgrading the underlying machine images for control plane nodes provides security patches, performance improvements, and updated system components.
Procedure
-
Create Updated Machine Template
Copy the existing
HCSMachineTemplatereferenced byKubeadmControlPlaneand modify the required specifications: -
Modify Template Specifications
Modify the new template:
- Set
metadata.nameto<new-template-name> - Remove server-generated metadata and status fields from the copied manifest.
- Leave runtime identity fields unset, including
spec.template.spec.providerIDandspec.template.spec.serverId. The HCS provider assigns these values when it creates instances. - Keep preserved paths such as
/var/cpaasout ofspec.template.spec.dataVolumes[]. Declare those paths in the referencedHCSMachineConfigPool.spec.configs[].persistentDisks[]. - Update as needed:
spec.template.spec.imageNamespec.template.spec.flavorNamespec.template.spec.rootVolume.sizespec.template.spec.dataVolumesfor temporary disks only
- Set
-
Deploy Updated Template
Apply the new machine template:
-
Update Control Plane Reference
Modify the
KubeadmControlPlaneresource to reference the new template: -
Monitor Rolling Update
The control plane will automatically perform a rolling update:
Kubernetes Version Upgrades
Upgrading the Kubernetes version involves updating both the control plane software and the supporting virtual machine images.
Required Values From the OS Support Matrix
The authoritative mapping between an ACP release, its MicroOS image, the Kubernetes version, the matching CoreDNS, etcd, and Kube-OVN versions lives in OS Support Matrix. Locate the row that corresponds to the target ACP version before you start; the row supplies every value the procedure below needs.
The cells you read from that row map to the upgrade manifests as follows:
The CoreDNS and etcd image tags are control-plane-only because clusterConfiguration is a KubeadmControlPlane field. Worker nodes inherit container image versions from the new VM template; the MachineDeployment does not carry its own dns/etcd tags. The Kube-OVN annotation lives on the Cluster resource, not on KubeadmControlPlane, because the HCS provider watches it independently of the Kubernetes control plane rollout.
Procedure
-
Create a new
HCSMachineTemplatefor the target Kubernetes versionCopy the existing control-plane template and apply it under a new
metadata.namewith the targetimageName:In
new-cp-template.yaml:-
Set
metadata.nameto<new-template-name>. -
Set
spec.template.spec.imageNameto the MicroOS Image Version value from the target row in the OS Support Matrix. -
Strip server-generated metadata (
resourceVersion,uid,generation,creationTimestamp,managedFields,kubectl.kubernetes.io/last-applied-configurationannotation) and the entirestatusfield. -
Leave runtime identity fields unset, including
spec.template.spec.providerIDandspec.template.spec.serverId. The HCS provider setsproviderIDtohcs://<cluster-name>/<machine-name>andserverIdto the HCS ECS instance ID after the VM is created; pre-filling them in the template breaks the controller's identity binding.
-
-
Patch the
KubeadmControlPlanewith the target Kubernetes valuesUpdate the
KubeadmControlPlaneresource in a single edit to keepspec.version, the CoreDNS image tag, the etcd image tag, and the infrastructure template reference consistent with the same MicroOS release:-
spec.version← Kubernetes Version from the OS Support Matrix row -
spec.kubeadmConfigSpec.clusterConfiguration.dns.imageTag← coredns column from the same row -
spec.kubeadmConfigSpec.clusterConfiguration.etcd.local.imageTag← etcd column from the same row -
spec.machineTemplate.infrastructureRef.name← the newHCSMachineTemplatename created in step 1
Updating only
spec.versionis not sufficient. The CoreDNS and etcd image tags must move together with the Kubernetes version because they are built from the same MicroOS release; leaving them at the previous values can result in CoreDNS and etcd pods that do not match the new Kubernetes minor version.Keep
spec.rolloutStrategy.rollingUpdate.maxSurge: 0when the referenced control plane pool uses persistent disks. The replacement machine must reuse the same fixed hostname and disk slot after the old machine is removed. -
-
Upgrade the Kube-OVN chart on the workload cluster
Kube-OVN is a Core lifecycle component, but on immutable OS the HCS provider does not pin its chart version to the cluster's Kubernetes version. The chart version is carried by a separate
AppReleasenamedcni-kube-ovnin thecpaas-systemnamespace of the workload cluster, and you move it forward in two steps: update the annotation on theClusterresource for bookkeeping and future re-creation, then patch the existingAppReleasedirectly to bump the chart revision.WARNINGWhy two steps are required on HCS
The HCS provider creates the
cni-kube-ovnAppReleasethe first time the cluster is built, and from then on it reconciles only thespec.valuesblock (cluster name, CIDRs, registry, control-plane node list). It does not write tospec.source.charts[0].targetRevisionon anAppReleasethat already exists. As a result, changingcpaas.io/kube-ovn-versionon theClusterresource alone does not move the chart version on the workload cluster. The annotation must still be updated so the recorded target matches the OS Support Matrix row, but the chart upgrade itself is driven by a directAppReleasepatch.3.1. Update the
cpaas.io/kube-ovn-versionannotation on theClusterresourceThe annotation does not update automatically when
spec.versionchanges; keep it in step with the kube-ovn (chart) column of the target row.3.2. Patch the
AppReleasechart revision on the workload clusterRun the patch against the workload cluster's API server (not the bootstrap KIND or the
globalcluster):Use the same value you set in the annotation. The
releaseName(cpaas-kube-ovn) andname(acp/chart-cpaas-kube-ovn) are managed by the provider; do not change them.3.3. Wait for reconciliation to complete
Watch the chart phase and the installed revision:
The normal sequence is
Upgrading → HealthChecking → Success. On small clusters the full transition typically completes within about one minute. Read the phases as follows:WARNINGDo not declare the upgrade complete on
installedRevisionalone. The field flips to the target value duringHealthChecking, before pods have been verified Ready. The chart is only considered upgraded whenphaseisSuccessandinstalledRevisionmatches the target.The
AppReleaseAPI also definesDownloading,Installing,Syncing,DownloadFailed,DeployFailed, andNotReady. The first three are transient and the upgrade should converge on its own. The last three indicate a failure that needs manual investigation; start withkubectl describe apprelease cni-kube-ovn -n cpaas-systemto read the per-conditionmessagefield. -
Verify Upgrade Progress
Monitor the rolling upgrade process:
Worker Node Upgrades
Worker node upgrades are managed via MachineDeployment resources.
For detailed worker node procedures, see the Managing Nodes section.
Rolling Back a Failed Upgrade
If the rolling update fails — new VMs fail to boot, nodes do not become Ready, or the new Kubernetes minor version surfaces an incompatibility — revert the template reference and Kubernetes-version fields back to the previous values. Cluster API treats the reversion as a new spec drift and rolls the v2 machines back to the previous template, one at a time.
Three facts to internalize before rolling back:
- The old VMs are gone. They were destroyed during the upgrade. Rollback uses the old template to build a fresh set of replacement machines; it does not restore the original VMs.
- The old
HCSMachineTemplateresource must still exist. Do not delete the previous template until the new rollout is healthy. If you already deleted it, recreate it from version control or backup before rolling back. - Only pool-managed persistent disks preserve node-local state. Data written to
HCSMachineTemplate.spec.template.spec.dataVolumes[]during the upgrade window is lost when that VM is replaced. Data written to disks declared inHCSMachineConfigPool.spec.configs[].persistentDisks[]is retained and reattached to the replacement VM. Application data should still use external persistent storage such as HCS EVS CSI unless your operational design explicitly depends on node-local state.
Procedure:
-
Control plane: patch
KubeadmControlPlaneto restore the previousspec.machineTemplate.infrastructureRef.name,spec.version,spec.kubeadmConfigSpec.clusterConfiguration.dns.imageTag, andspec.kubeadmConfigSpec.clusterConfiguration.etcd.local.imageTag. -
Workers: patch each
MachineDeploymentto restore the previousspec.template.spec.infrastructureRef.nameandspec.template.spec.version. -
Kube-OVN: if the Kube-OVN chart was upgraded, revert it the same way the upgrade was applied — first restore the annotation, then patch the
AppReleasechart revision back. Verify with the sameinstalledRevision+phase=Successcheck used in step 3.
If the new control plane never reached etcd quorum, the KubeadmControlPlane controller may refuse to roll back any machine because its preflight checks block on an unhealthy etcd. Recover etcd quorum first (operator intervention) before retrying the rollback.