Troubleshoot a Cluster Stuck in Deleting
Scope
This page covers the case where a workload cluster that was built by ACP is being deleted, every child resource (provider-specific Machine CRs and the IaaS VMs they represent) is already gone, but the top-level Cluster CR remains in Phase: Deleting indefinitely.
This page applies to any ACP-built workload cluster, regardless of the underlying IaaS:
- Huawei DCS (DCS Provider)
- Huawei Cloud Stack (HCS Provider)
- VMware vSphere (vSphere Provider)
- Bare-metal on Immutable Infrastructure (baremetal Provider)
- Clusters created through the SSH UPI path managed by
cluster-manager
Direct observation evidence comes from DCS; the other providers share the same global-cluster control-plane path.
Not in scope:
- A provider-specific
MachineCR (such asDCSMachine) stuck inDeletingwhile the IaaS VM still exists — see Troubleshoot a DCSMachine Stuck in Deleting. - A workload cluster that was imported (onboarded as a Third-party Cluster) and never reached the ready state — see Troubleshoot a Workload Cluster Stuck in Provisioned.
Symptoms
The Cluster CR shows the pattern below.
The capi.cpaas.io/imported finalizer (and a capi.cpaas.io/alauda-cluster: imported label, if present) does not by itself mean the cluster was imported — ACP-built clusters carry these too. Use the infrastructureRef / controlPlaneRef row above to tell the two apart.
Diagnosis
Run the two checks below on the global cluster to confirm the pattern.
Confirm the capi.cpaas.io/imported finalizer is still present:
Confirm the cluster is ACP-built. An ACP-built cluster has both spec.infrastructureRef and spec.controlPlaneRef set; a Third-party Cluster (onboarded by import) has neither:
If both values are non-empty (for example infrastructureRef=DCSCluster controlPlaneRef=KubeadmControlPlane), the cluster is ACP-built and the workaround on this page applies. If both are empty, the cluster was onboarded as a Third-party Cluster; this page does not apply and the workaround below must not be used.
Why This Happens
The capi.cpaas.io/imported finalizer on a Cluster CR is managed by a platform controller on the global cluster. For a cluster onboarded as a Third-party Cluster, that finalizer is released as part of the platform-side cleanup when the cluster is removed.
An ACP-built cluster carries the same finalizer. When an ACP-built cluster is deleted, its child resources and IaaS VMs are removed correctly, but the capi.cpaas.io/imported finalizer on the top-level Cluster CR is not released — so the CR stays in Deleting.
This is a platform-level behavior, not a fault in any IaaS provider, and it does not affect imported (Third-party) clusters. The exact handling is under investigation by the platform team.
Workaround
Before clearing the finalizer, verify that nothing depends on the Cluster CR's continued presence — every child resource and every IaaS VM must already be gone.
Step 1 — Confirm all provider-specific Machine CRs in the namespace are gone:
Each command must return no resources for this cluster.
Step 2 — Confirm the cluster's VMs are gone on the IaaS platform. Use the provider-specific verification method:
- DCS: list VMs through the DCS portal or the DCS API and verify that none of the cluster's VMs remain. See Creating Clusters on Huawei DCS for the relevant cluster fields.
- HCS: list VMs in the HCS portal and verify that none remain.
- vSphere: list VMs in vCenter under the cluster's folder and verify that none remain.
- Bare-metal: confirm machine deprovisioning has completed in the bare-metal inventory.
Step 3 — Only after Step 1 and Step 2 both confirm zero remaining resources, clear the Cluster CR's finalizers:
The Cluster CR disappears immediately. Because every child resource and every IaaS VM was already gone before this step, removing the top-level finalizer cannot leave behind an orphan — it only clears a marker that no longer has any reconciler willing to release it.
Why Not Just Wait?
The finalizer is not released on a timer, and for an ACP-built cluster it is not released by the normal platform cleanup path. Waiting does not change the state — observations of nine hours or more without progress have been recorded. Apply the workaround once Step 1 and Step 2 are confirmed.
Permanent Fix
The fix belongs in the global-cluster control plane: the capi.cpaas.io/imported finalizer must be released for an ACP-built cluster once its child resources are gone. This is tracked outside this documentation set; no release date is committed here. Until the fix lands, the workaround above is the supported recovery path.
See Also
- Troubleshoot a DCSMachine Stuck in Deleting — node-level deletion stalls; different scope.
- Troubleshoot a Workload Cluster Stuck in Provisioned — covers true Third-party Cluster bring-up issues; different scope.