Troubleshoot a Cluster Stuck in Deleting

Scope

This page covers the case where a workload cluster that was built by ACP is being deleted, every child resource (provider-specific Machine CRs and the IaaS VMs they represent) is already gone, but the top-level Cluster CR remains in Phase: Deleting indefinitely.

This page applies to any ACP-built workload cluster, regardless of the underlying IaaS:

  • Huawei DCS (DCS Provider)
  • Huawei Cloud Stack (HCS Provider)
  • VMware vSphere (vSphere Provider)
  • Bare-metal on Immutable Infrastructure (baremetal Provider)
  • Clusters created through the SSH UPI path managed by cluster-manager

Direct observation evidence comes from DCS; the other providers share the same global-cluster control-plane path.

Not in scope:

Symptoms

The Cluster CR shows the pattern below.

Where to lookWhat you see
Cluster.status.phase on the global clusterDeleting, persisting far longer than expected — typically from one hour to many hours.
Cluster.metadata.deletionTimestampSet.
Cluster.metadata.finalizersStill contains capi.cpaas.io/imported.
Cluster.spec.infrastructureRef and Cluster.spec.controlPlaneRefBoth are set (for example infrastructureRef.kind: DCSCluster, controlPlaneRef.kind: KubeadmControlPlane). This marks an ACP-built cluster — a Third-party Cluster has neither reference.
Provider-specific Machine CRs in the same namespace (DCSMachine / HCSMachine / VSphereMachine / etc.), plus the matching infrastructure Cluster, KubeadmControlPlane, MachineDeployment, and Machine resourcesNone remain — cascade deletion has already completed.
The IaaS platform itself (DCS portal, vCenter, HCS portal, bare-metal inventory)None of the cluster's VMs remain.
Provider controller logs (for example cluster-api-provider-dcs-manager)Show successful machine deletion entries such as removed DCSMachine finalizer.

The capi.cpaas.io/imported finalizer (and a capi.cpaas.io/alauda-cluster: imported label, if present) does not by itself mean the cluster was imported — ACP-built clusters carry these too. Use the infrastructureRef / controlPlaneRef row above to tell the two apart.

Diagnosis

Run the two checks below on the global cluster to confirm the pattern.

Confirm the capi.cpaas.io/imported finalizer is still present:

kubectl get cluster <name> -n <ns> -o yaml | grep -A 2 finalizers

Confirm the cluster is ACP-built. An ACP-built cluster has both spec.infrastructureRef and spec.controlPlaneRef set; a Third-party Cluster (onboarded by import) has neither:

kubectl get cluster <name> -n <ns> \
  -o jsonpath='infrastructureRef={.spec.infrastructureRef.kind} controlPlaneRef={.spec.controlPlaneRef.kind}{"\n"}'

If both values are non-empty (for example infrastructureRef=DCSCluster controlPlaneRef=KubeadmControlPlane), the cluster is ACP-built and the workaround on this page applies. If both are empty, the cluster was onboarded as a Third-party Cluster; this page does not apply and the workaround below must not be used.

Why This Happens

The capi.cpaas.io/imported finalizer on a Cluster CR is managed by a platform controller on the global cluster. For a cluster onboarded as a Third-party Cluster, that finalizer is released as part of the platform-side cleanup when the cluster is removed.

An ACP-built cluster carries the same finalizer. When an ACP-built cluster is deleted, its child resources and IaaS VMs are removed correctly, but the capi.cpaas.io/imported finalizer on the top-level Cluster CR is not released — so the CR stays in Deleting.

This is a platform-level behavior, not a fault in any IaaS provider, and it does not affect imported (Third-party) clusters. The exact handling is under investigation by the platform team.

Workaround

Before clearing the finalizer, verify that nothing depends on the Cluster CR's continued presence — every child resource and every IaaS VM must already be gone.

Step 1 — Confirm all provider-specific Machine CRs in the namespace are gone:

# Replace dcsmachines with the resource for your provider (hcsmachines, vspheremachines, ...).
kubectl get dcsmachines,hcsmachines,vspheremachines -n <ns>

# Also confirm the higher-level CAPI resources are gone:
kubectl get machines,machinedeployments,kubeadmcontrolplanes,dcsclusters,hcsclusters,vsphereclusters -n <ns>

Each command must return no resources for this cluster.

Step 2 — Confirm the cluster's VMs are gone on the IaaS platform. Use the provider-specific verification method:

  • DCS: list VMs through the DCS portal or the DCS API and verify that none of the cluster's VMs remain. See Creating Clusters on Huawei DCS for the relevant cluster fields.
  • HCS: list VMs in the HCS portal and verify that none remain.
  • vSphere: list VMs in vCenter under the cluster's folder and verify that none remain.
  • Bare-metal: confirm machine deprovisioning has completed in the bare-metal inventory.

Step 3 — Only after Step 1 and Step 2 both confirm zero remaining resources, clear the Cluster CR's finalizers:

kubectl patch cluster <name> -n <ns> --type=merge \
  -p '{"metadata":{"finalizers":null}}'

The Cluster CR disappears immediately. Because every child resource and every IaaS VM was already gone before this step, removing the top-level finalizer cannot leave behind an orphan — it only clears a marker that no longer has any reconciler willing to release it.

Why Not Just Wait?

The finalizer is not released on a timer, and for an ACP-built cluster it is not released by the normal platform cleanup path. Waiting does not change the state — observations of nine hours or more without progress have been recorded. Apply the workaround once Step 1 and Step 2 are confirmed.

Permanent Fix

The fix belongs in the global-cluster control plane: the capi.cpaas.io/imported finalizer must be released for an ACP-built cluster once its child resources are gone. This is tracked outside this documentation set; no release date is committed here. Until the fix lands, the workaround above is the supported recovery path.

See Also