Troubleshoot a Workload Cluster Stuck in Provisioned

Use this guide when a workload cluster on Immutable Infrastructure has reached Cluster.status.phase = Provisioned and the Kubernetes control plane reports ready, but the cluster never becomes fully usable.

The diagnostic flow assumes the workload cluster was created by applying provider Cluster API manifests to the global cluster.

Scope Symptoms Invariants of a Healthy Imported Cluster Diagnostic Steps Step 1 — Confirm the Cluster API surface is healthy Step 2 — Check Machine and Node conditions Step 3 — Inspect the infrastructure provider reconciler logs Step 4 — Verify the global cluster has imported the workload cluster Step 5 — Verify the ClusterCredential count invariant Pattern That Indicates a Failed Import What Does Not Resolve This Pattern Next Step See Also

Scope

The diagnostic flow in this guide targets the global cluster's import controller, which is shared across all Immutable Infrastructure providers. The four invariants in Invariants of a Healthy Imported Cluster — clusters.platform.tkestack.io presence, the capi.cpaas.io/imported label, the ClusterCredential count, and the workload-cluster sentry ServiceAccount — apply to every provider.

The log strings in this guide use placeholders for reconciler object names because each infrastructure provider uses its own object types (for example, the provider's own Cluster and Machine infrastructure CRDs). When you read your provider's logs, substitute the placeholders for the real object names that appear in your provider's reconciler output. The diagnostic pattern itself does not depend on the provider — only the object names in the log strings do.

If your environment is Huawei Cloud Stack and the symptoms below also include a stalled kubeadm init on a node whose pool config carries a dotted hostname, see Troubleshoot HCS Workload Clusters for the provider-specific pattern after you finish the generic diagnostic flow on this page.

Symptoms

The cluster reaches a partial-success state and stays there. The following indicators appear together:

Where to look	What you see
`Cluster.status.phase` (on the `global` cluster)	`Provisioned`
`KubeadmControlPlane.status.ready`	`true`
`Machine.status.phase`	`Running`
`Machine.status.conditions[?(@.type=="NodeHealthy")]`	`status: False`, `reason: NodeConditionsFailed`, `message: Node condition Ready is False`
`MachineDeployment.status.conditions[?(@.type=="Available")]`	`status: False`, `reason: NotAvailable`, `message: WorkersAvailable: 0 available replicas`
Inside the workload cluster: `kubectl get nodes`	One or more nodes report `STATUS=NotReady` because the CNI is not running

A typical secondary signature appears in the infrastructure provider reconciler logs on the global cluster:

ERROR  failed to reconcile <CNI-AppRelease-object>: ServiceAccount "sentry" not found
ERROR  failed to reconcile <Infra-Cluster-object>: ServiceAccount "sentry" not found

<CNI-AppRelease-object> and <Infra-Cluster-object> are placeholders for the actual object types your provider uses. For example, the CNI AppRelease object is typically named after the CNI being deployed, and the infrastructure cluster object name matches your provider's CRD (your provider's own *Cluster CRD).

The reconciler tries to apply the CNI through an AppRelease resource that is executed by the workload cluster's own sentry ServiceAccount. If that ServiceAccount is not present, the CNI is never deployed and the workload nodes stay NotReady.

Invariants of a Healthy Imported Cluster

Once a workload cluster reaches a healthy steady state, all of the following must hold. Verify each invariant from the global cluster context unless noted otherwise.

A clusters.platform.tkestack.io/<cluster-name> object exists.

kubectl get clusters.platform.tkestack.io <cluster-name>

The Cluster API Cluster resource carries the label capi.cpaas.io/imported: "".

The label value is the empty string "" on a healthy import, so a jsonpath query alone cannot distinguish "label missing" from "label value is empty". Use jq to check label presence explicitly:
kubectl get cluster <cluster-name> -n <namespace> -o json | \ jq -r 'if (.metadata.labels | has("capi.cpaas.io/imported")) then "present (value=\"\(.metadata.labels["capi.cpaas.io/imported"])\")" else "missing" end'
Expected output: present (value=""). missing means the import has not completed.

Exactly one clustercredentials.platform.tkestack.io object is labelled for the cluster.

kubectl get clustercredentials.platform.tkestack.io -o json | \
  jq -r --arg c <cluster-name> \
    '.items[] | select(.metadata.labels["cluster.x-k8s.io/cluster-name"]==$c) | .metadata.name'

Exactly one name should print.

Inside the workload cluster, ServiceAccount sentry exists in cpaas-system.

kubectl --kubeconfig <workload-kubeconfig> -n cpaas-system get serviceaccount sentry

If any invariant fails, the workload cluster has not been imported by the global cluster's import controller, and the symptom pattern above is expected.

Diagnostic Steps

Step 1 — Confirm the Cluster API surface is healthy

kubectl get cluster <cluster-name> -n <namespace> -o wide
kubectl get kubeadmcontrolplane -n <namespace>
kubectl get machinedeployment -n <namespace>

A failure here is unrelated to the import pattern; investigate the infrastructure provider directly.

Step 2 — Check Machine and Node conditions

kubectl get machine -n <namespace> \
  -l cluster.x-k8s.io/cluster-name=<cluster-name> \
  -o jsonpath='{range .items[*]}{.metadata.name}{" phase="}{.status.phase}{" "}{range .status.conditions[*]}{.type}={.status} {end}{"\n"}{end}'

If phase=Running but the NodeHealthy condition is False, the underlying workload node is NotReady. Continue to Step 3.

Step 3 — Inspect the infrastructure provider reconciler logs

kubectl logs -n cpaas-system <infra-provider-manager-pod> --tail=200 | \
  grep -E 'ServiceAccount|sentry|<cluster-name>'

If you see repeated ServiceAccount "sentry" not found errors paired with a failure to reconcile the CNI AppRelease object, the workload cluster's CNI is not being deployed because the import flow has not completed. Continue to Step 4.

Step 4 — Verify the global cluster has imported the workload cluster

Run the invariant checks from the previous section. The combined diagnostic signature for a failed import is:

clusters.platform.tkestack.io/<cluster-name> returns NotFound.
Cluster.metadata.labels does not include capi.cpaas.io/imported.

Step 5 — Verify the ClusterCredential count invariant

kubectl get clustercredentials.platform.tkestack.io -o json | \
  jq -r --arg c <cluster-name> \
    '[.items[] | select(.metadata.labels["cluster.x-k8s.io/cluster-name"]==$c)] | length'

A healthy cluster prints 1. A value greater than 1 (for example, 8) indicates that the import flow tried and failed repeatedly, leaving orphan credentials behind.

Pattern That Indicates a Failed Import

The diagnostics indicate a failed import when all of the following are true at the same time:

The Cluster API surface is healthy (Cluster.status.phase=Provisioned, KubeadmControlPlane.status.ready=true).
Machine.condition.NodeHealthy=False and at least one workload node is NotReady.
Infrastructure provider logs repeatedly report ServiceAccount "sentry" not found.
clusters.platform.tkestack.io/<cluster-name> does not exist.
The Cluster resource does not carry the capi.cpaas.io/imported label.
More than one clustercredentials.platform.tkestack.io object is labelled for the cluster.

When this pattern holds, the workload cluster's initial import on the global cluster failed and the import controller does not currently retry from scratch.

What Does Not Resolve This Pattern

WARNING

Restarting the cluster-transformer pod on the global cluster does not re-import a workload cluster whose initial import already failed in this pattern. After a manual pod restart no log entries reference the affected cluster and the missing clusters.platform.tkestack.io entry is not created.

Do not assume the controller will eventually recover on its own when more than one orphan ClusterCredential is present for the same cluster.

Next Step

Engage platform support with the following information:

The cluster name and namespace.
The output of the diagnostic commands in Steps 1 through 5.
The names and creation timestamps of every clustercredentials.platform.tkestack.io object that shares the same cluster.x-k8s.io/cluster-name label as the affected cluster.

Recovery currently requires deleting and recreating the workload cluster after cleaning up the orphan ClusterCredential entries. Do not perform this on a production cluster without platform-support guidance: the recreate path triggers deletion of the underlying virtual machines on the IaaS platform, and provider-specific details (for example, DCS IP-pool and hostname planning, or HCS HCSMachineConfigPool re-allocation) must be reviewed before the recreate is applied.

#Troubleshoot a Workload Cluster Stuck in Provisioned

#TOC

#Scope

#Symptoms

#Invariants of a Healthy Imported Cluster

#Diagnostic Steps

#Step 1 — Confirm the Cluster API surface is healthy

#Step 2 — Check Machine and Node conditions

#Step 3 — Inspect the infrastructure provider reconciler logs

#Step 4 — Verify the global cluster has imported the workload cluster

#Step 5 — Verify the ClusterCredential count invariant

#Pattern That Indicates a Failed Import

#What Does Not Resolve This Pattern

#Next Step

#See Also