Overview
Kubernetes Engine's node auto-repair feature helps you keep the nodes in your cluster in a healthy, running state. When enabled, Kubernetes Engine makes periodic checks on the health state of each node in your cluster. If a node fails consecutive health checks over an extended time period, Kubernetes Engine initiates a repair process for that node. If you disable node auto-repair at any time during the repair process, the in-progress repairs are not cancelled and still complete for any node currently under repair.
Rationale
Kubernetes Engine uses the node's health status to determine if a node needs to be repaired. A node reporting a Ready status is considered healthy. Kubernetes Engine triggers a repair action if a node reports consecutive unhealthy status reports for a given time threshold. An unhealthy status can mean:
- A node reports a NotReady status on consecutive checks over the given time threshold (approximately 10 minutes).
- A node does not report any status at all over the given time threshold (approximately 10 minutes).
- A node's boot disk is out of disk space for an extended time period (approximately 30 minutes).
You can enable node auto-repair on a per-node pool basis. When you create a cluster, you can enable or disable auto-repair for the cluster's default node pool. If you create additional node pools, you can enable or disable node auto-repair for those node pools, independent of the auto-repair setting for the default node pool. Kubernetes Engine generates an entry in its operation logs for any automated repair event. You can check the logs by using the gcloud container operations list command.
Remediation guidance
From Console
- Go to Kubernetes GCP Console by visiting https://console.cloud.google.com/kubernetes/list?
- Select reported Kubernetes clusters for which
Automatic node repairis disabled - Click on EDIT button and Set
Automatic node repairto Enabled
Using Command line
To enable Automatic node repair for an existing cluster with node pool, run the following command:
gcloud container node-pools update \[POOL_NAME\] --cluster \[CLUSTER_NAME\] --zone \[COMPUTE_ZONE\] --enable-autorepair
Note: Node auto-repair is only available for nodes that use Container-Optimized OS as their node image. Node auto-repair is not available on Alpha Clusters.
Impact
If multiple nodes require repair, Kubernetes Engine might repair them in parallel. Kubernetes Engine limits number of repairs depending on the size of the cluster (bigger clusters have a higher limit) and the number of broken nodes in the cluster (limit decreases if many nodes are broken)
References:
- https://cloud.google.com/kubernetes-engine/docs/concepts/node-auto-repair
Multiple Remediation Paths
Google Cloud
SERVICE-WIDE (RECOMMENDED when many resources are affected): Enforce Organization Policies at org/folder level so new resources inherit secure defaults.
gcloud org-policies set-policy policy.yaml
ASSET-LEVEL: Use the product-specific remediation steps above for only the impacted project/resources.
PREVENTIVE: Use org policy constraints/custom constraints and enforce checks in deployment pipelines.
References for Service-Wide Patterns
- GCP Organization Policy overview: https://cloud.google.com/resource-manager/docs/organization-policy/overview
- GCP Organization policy constraints catalog: https://cloud.google.com/resource-manager/docs/organization-policy/org-policy-constraints
- gcloud org-policies: https://cloud.google.com/sdk/gcloud/reference/org-policies
Operational Rollout Workflow
Use this sequence to reduce risk and avoid repeated drift.
1. Contain at Service-Wide Scope First (Recommended)
- Google Cloud: apply organization policy constraints at org/folder scope.
gcloud org-policies set-policy policy.yaml
2. Remediate Existing Affected Assets
- Execute the control-specific Console/CLI steps documented above for each flagged resource.
- Prioritize internet-exposed and production assets first.
3. Validate and Prevent Recurrence
- Re-scan after each remediation batch.
- Track exceptions with owner and expiry date.
- Add preventive checks in IaC/CI pipelines.
Query logic
These are the stored checks tied to this control.
Automatic node repair is enabled for Kubernetes Clusters
Connectors
Covered asset types
Expected check: eq []
gkeClusters(where:{nodePools_SOME:{managementAutoRepair_NOT:true}}){...AssetFragment}
Google Cloud