Troubleshoot a Cluster Stuck in Deleting

Scope

This page covers the case where a workload cluster that was built by ACP is being deleted, every child resource (provider-specific Machine CRs and the IaaS VMs they represent) is already gone, but the top-level Cluster CR remains in Phase: Deleting indefinitely.

This page applies to any ACP-built workload cluster, regardless of the underlying IaaS:

Huawei DCS (DCS Provider)
Huawei Cloud Stack (HCS Provider)
VMware vSphere (vSphere Provider)
Bare-metal on Immutable Infrastructure (baremetal Provider)
Clusters created through the SSH UPI path managed by cluster-manager

Direct observation evidence comes from DCS; the other providers share the same global-cluster control-plane path.

Not in scope:

A provider-specific Machine CR (such as DCSMachine) stuck in Deleting while the IaaS VM still exists — see Troubleshoot a DCSMachine Stuck in Deleting.
A workload cluster that was imported (onboarded as a Third-party Cluster) and never reached the ready state — see Troubleshoot a Workload Cluster Stuck in Provisioned.

Symptoms

The Cluster CR shows the pattern below.

Where to look	What you see
`Cluster.status.phase` on the `global` cluster	`Deleting`, persisting far longer than expected — typically from one hour to many hours.
`Cluster.metadata.deletionTimestamp`	Set.
`Cluster.metadata.finalizers`	Still contains `capi.cpaas.io/imported`.
`Cluster.spec.infrastructureRef` and `Cluster.spec.controlPlaneRef`	Both are set (for example `infrastructureRef.kind: DCSCluster`, `controlPlaneRef.kind: KubeadmControlPlane`). This marks an ACP-built cluster — a Third-party Cluster has neither reference.
Provider-specific `Machine` CRs in the same namespace (`DCSMachine` / `HCSMachine` / `VSphereMachine` / etc.), plus the matching infrastructure `Cluster`, `KubeadmControlPlane`, `MachineDeployment`, and `Machine` resources	None remain — cascade deletion has already completed.
The IaaS platform itself (DCS portal, vCenter, HCS portal, bare-metal inventory)	None of the cluster's VMs remain.
Provider controller logs (for example `cluster-api-provider-dcs-manager`)	Show successful machine deletion entries such as `removed DCSMachine finalizer`.

The capi.cpaas.io/imported finalizer (and a capi.cpaas.io/alauda-cluster: imported label, if present) does not by itself mean the cluster was imported — ACP-built clusters carry these too. Use the infrastructureRef / controlPlaneRef row above to tell the two apart.

Diagnosis

Run the two checks below on the global cluster to confirm the pattern.

Confirm the capi.cpaas.io/imported finalizer is still present:

kubectl get cluster <name> -n <ns> -o yaml | grep -A 2 finalizers

Confirm the cluster is ACP-built. An ACP-built cluster has both spec.infrastructureRef and spec.controlPlaneRef set; a Third-party Cluster (onboarded by import) has neither:

kubectl get cluster <name> -n <ns> \
  -o jsonpath='infrastructureRef={.spec.infrastructureRef.kind} controlPlaneRef={.spec.controlPlaneRef.kind}{"\n"}'

If both values are non-empty (for example infrastructureRef=DCSCluster controlPlaneRef=KubeadmControlPlane), the cluster is ACP-built and the workaround on this page applies. If both are empty, the cluster was onboarded as a Third-party Cluster; this page does not apply and the workaround below must not be used.

Why This Happens

The capi.cpaas.io/imported finalizer on a Cluster CR is managed by a platform controller on the global cluster. For a cluster onboarded as a Third-party Cluster, that finalizer is released as part of the platform-side cleanup when the cluster is removed.

An ACP-built cluster carries the same finalizer. When an ACP-built cluster is deleted, its child resources and IaaS VMs are removed correctly, but the capi.cpaas.io/imported finalizer on the top-level Cluster CR is not released — so the CR stays in Deleting.

This is a platform-level behavior, not a fault in any IaaS provider, and it does not affect imported (Third-party) clusters. The exact handling is under investigation by the platform team.

Workaround

Before clearing the finalizer, verify that nothing depends on the Cluster CR's continued presence — every child resource and every IaaS VM must already be gone.

Step 1 — Confirm all provider-specific Machine CRs in the namespace are gone:

# Replace dcsmachines with the resource for your provider (hcsmachines, vspheremachines, ...).
kubectl get dcsmachines,hcsmachines,vspheremachines -n <ns>

# Also confirm the higher-level CAPI resources are gone:
kubectl get machines,machinedeployments,kubeadmcontrolplanes,dcsclusters,hcsclusters,vsphereclusters -n <ns>

Each command must return no resources for this cluster.

Step 2 — Confirm the cluster's VMs are gone on the IaaS platform. Use the provider-specific verification method:

DCS: list VMs through the DCS portal or the DCS API and verify that none of the cluster's VMs remain. See Creating Clusters on Huawei DCS for the relevant cluster fields.
HCS: list VMs in the HCS portal and verify that none remain.
vSphere: list VMs in vCenter under the cluster's folder and verify that none remain.
Bare-metal: confirm machine deprovisioning has completed in the bare-metal inventory.

Step 3 — Only after Step 1 and Step 2 both confirm zero remaining resources, clear the Cluster CR's finalizers:

kubectl patch cluster <name> -n <ns> --type=merge \
  -p '{"metadata":{"finalizers":null}}'

The Cluster CR disappears immediately. Because every child resource and every IaaS VM was already gone before this step, removing the top-level finalizer cannot leave behind an orphan — it only clears a marker that no longer has any reconciler willing to release it.

Why Not Just Wait?

The finalizer is not released on a timer, and for an ACP-built cluster it is not released by the normal platform cleanup path. Waiting does not change the state — observations of nine hours or more without progress have been recorded. Apply the workaround once Step 1 and Step 2 are confirmed.

Permanent Fix

The fix belongs in the global-cluster control plane: the capi.cpaas.io/imported finalizer must be released for an ACP-built cluster once its child resources are gone. This is tracked outside this documentation set; no release date is committed here. Until the fix lands, the workaround above is the supported recovery path.

Troubleshoot a Cluster Stuck in Deleting

TOC

Scope

Symptoms

Diagnosis

Why This Happens

Workaround

Why Not Just Wait?

Permanent Fix

See Also

#Troubleshoot a Cluster Stuck in Deleting

#TOC

#Scope

#Symptoms

#Diagnosis

#Why This Happens

#Workaround

#Why Not Just Wait?

#Permanent Fix

#See Also

Troubleshoot a Cluster Stuck in Deleting

TOC

Scope

Symptoms

Diagnosis

Why This Happens

Workaround

Why Not Just Wait?

Permanent Fix

See Also