Troubleshoot a Workload Cluster Stuck in Provisioned
Use this guide when a workload cluster on Immutable Infrastructure has reached Cluster.status.phase = Provisioned and the Kubernetes control plane reports ready, but the cluster never becomes fully usable.
The diagnostic flow assumes the workload cluster was created by applying provider Cluster API manifests to the global cluster.
TOC
ScopeSymptomsInvariants of a Healthy Imported ClusterDiagnostic StepsStep 1 — Confirm the Cluster API surface is healthyStep 2 — Check Machine and Node conditionsStep 3 — Inspect the infrastructure provider reconciler logsStep 4 — Verify the global cluster has imported the workload clusterStep 5 — Verify the ClusterCredential count invariantPattern That Indicates a Failed ImportWhat Does Not Resolve This PatternNext StepSee AlsoScope
The diagnostic flow in this guide targets the global cluster's import controller, which is shared across all Immutable Infrastructure providers. The four invariants in Invariants of a Healthy Imported Cluster — clusters.platform.tkestack.io presence, the capi.cpaas.io/imported label, the ClusterCredential count, and the workload-cluster sentry ServiceAccount — apply to every provider.
The log strings in this guide use placeholders for reconciler object names because each infrastructure provider uses its own object types (for example, the provider's own Cluster and Machine infrastructure CRDs). When you read your provider's logs, substitute the placeholders for the real object names that appear in your provider's reconciler output. The diagnostic pattern itself does not depend on the provider — only the object names in the log strings do.
If your environment is Huawei Cloud Stack and the symptoms below also include a stalled kubeadm init on a node whose pool config carries a dotted hostname, see Troubleshoot HCS Workload Clusters for the provider-specific pattern after you finish the generic diagnostic flow on this page.
Symptoms
The cluster reaches a partial-success state and stays there. The following indicators appear together:
A typical secondary signature appears in the infrastructure provider reconciler logs on the global cluster:
<CNI-AppRelease-object> and <Infra-Cluster-object> are placeholders for the actual object types your provider uses. For example, the CNI AppRelease object is typically named after the CNI being deployed, and the infrastructure cluster object name matches your provider's CRD (your provider's own *Cluster CRD).
The reconciler tries to apply the CNI through an AppRelease resource that is executed by the workload cluster's own sentry ServiceAccount. If that ServiceAccount is not present, the CNI is never deployed and the workload nodes stay NotReady.
Invariants of a Healthy Imported Cluster
Once a workload cluster reaches a healthy steady state, all of the following must hold. Verify each invariant from the global cluster context unless noted otherwise.
-
A
clusters.platform.tkestack.io/<cluster-name>object exists. -
The Cluster API
Clusterresource carries the labelcapi.cpaas.io/imported: "".The label value is the empty string
""on a healthy import, so ajsonpathquery alone cannot distinguish "label missing" from "label value is empty". Usejqto check label presence explicitly:Expected output:
present (value="").missingmeans the import has not completed. -
Exactly one
clustercredentials.platform.tkestack.ioobject is labelled for the cluster.Exactly one name should print.
-
Inside the workload cluster, ServiceAccount
sentryexists incpaas-system.
If any invariant fails, the workload cluster has not been imported by the global cluster's import controller, and the symptom pattern above is expected.
Diagnostic Steps
Step 1 — Confirm the Cluster API surface is healthy
A failure here is unrelated to the import pattern; investigate the infrastructure provider directly.
Step 2 — Check Machine and Node conditions
If phase=Running but the NodeHealthy condition is False, the underlying workload node is NotReady. Continue to Step 3.
Step 3 — Inspect the infrastructure provider reconciler logs
If you see repeated ServiceAccount "sentry" not found errors paired with a failure to reconcile the CNI AppRelease object, the workload cluster's CNI is not being deployed because the import flow has not completed. Continue to Step 4.
Step 4 — Verify the global cluster has imported the workload cluster
Run the invariant checks from the previous section. The combined diagnostic signature for a failed import is:
clusters.platform.tkestack.io/<cluster-name>returnsNotFound.Cluster.metadata.labelsdoes not includecapi.cpaas.io/imported.
Step 5 — Verify the ClusterCredential count invariant
A healthy cluster prints 1. A value greater than 1 (for example, 8) indicates that the import flow tried and failed repeatedly, leaving orphan credentials behind.
Pattern That Indicates a Failed Import
The diagnostics indicate a failed import when all of the following are true at the same time:
- The Cluster API surface is healthy (
Cluster.status.phase=Provisioned,KubeadmControlPlane.status.ready=true). Machine.condition.NodeHealthy=Falseand at least one workload node isNotReady.- Infrastructure provider logs repeatedly report
ServiceAccount "sentry" not found. clusters.platform.tkestack.io/<cluster-name>does not exist.- The Cluster resource does not carry the
capi.cpaas.io/importedlabel. - More than one
clustercredentials.platform.tkestack.ioobject is labelled for the cluster.
When this pattern holds, the workload cluster's initial import on the global cluster failed and the import controller does not currently retry from scratch.
What Does Not Resolve This Pattern
Restarting the cluster-transformer pod on the global cluster does not re-import a workload cluster whose initial import already failed in this pattern. After a manual pod restart no log entries reference the affected cluster and the missing clusters.platform.tkestack.io entry is not created.
Do not assume the controller will eventually recover on its own when more than one orphan ClusterCredential is present for the same cluster.
Next Step
Engage platform support with the following information:
- The cluster name and namespace.
- The output of the diagnostic commands in Steps 1 through 5.
- The names and creation timestamps of every
clustercredentials.platform.tkestack.ioobject that shares the samecluster.x-k8s.io/cluster-namelabel as the affected cluster.
Recovery currently requires deleting and recreating the workload cluster after cleaning up the orphan ClusterCredential entries. Do not perform this on a production cluster without platform-support guidance: the recreate path triggers deletion of the underlying virtual machines on the IaaS platform, and provider-specific details (for example, DCS IP-pool and hostname planning, or HCS HCSMachineConfigPool re-allocation) must be reviewed before the recreate is applied.
See Also
- Troubleshoot a Cluster Stuck in Deleting — covers ACP-built clusters that stall in
Deletingrather than failing to import. - Troubleshoot a DCSMachine Stuck in Deleting — node-level deletion stalls; different scope.