Back to controls

Ensure Automatic node repair is enabled for Kubernetes Clusters

Kubernetes Engine's node auto-repair feature helps you keep the nodes in your cluster in a healthy, running state. When enabled, Kubernetes Engine makes periodic checks on the health state of each node in your cluster. If a node fails consecutive health checks over an extended time period, Kubernetes Engine initiates a repair process for that node. If you disable node auto-repair at any time during the repair process, the in-progress repairs are not cancelled and still complete for any node currently under repair.

Category

Controls

Medium

Applies to

Google Cloud

Coverage

null controls, 1 queries

Asset types

1 covered

Overview

Kubernetes Engine's node auto-repair feature helps you keep the nodes in your cluster in a healthy, running state. When enabled, Kubernetes Engine makes periodic checks on the health state of each node in your cluster. If a node fails consecutive health checks over an extended time period, Kubernetes Engine initiates a repair process for that node. If you disable node auto-repair at any time during the repair process, the in-progress repairs are not cancelled and still complete for any node currently under repair.

Rationale

Kubernetes Engine uses the node's health status to determine if a node needs to be repaired. A node reporting a Ready status is considered healthy. Kubernetes Engine triggers a repair action if a node reports consecutive unhealthy status reports for a given time threshold. An unhealthy status can mean:

  • A node reports a NotReady status on consecutive checks over the given time threshold (approximately 10 minutes).
  • A node does not report any status at all over the given time threshold (approximately 10 minutes).
  • A node's boot disk is out of disk space for an extended time period (approximately 30 minutes).

You can enable node auto-repair on a per-node pool basis. When you create a cluster, you can enable or disable auto-repair for the cluster's default node pool. If you create additional node pools, you can enable or disable node auto-repair for those node pools, independent of the auto-repair setting for the default node pool. Kubernetes Engine generates an entry in its operation logs for any automated repair event. You can check the logs by using the gcloud container operations list command.

Remediation guidance

From Console

  1. Go to Kubernetes GCP Console by visiting https://console.cloud.google.com/kubernetes/list?
  2. Select reported Kubernetes clusters for which Automatic node repair is disabled
  3. Click on EDIT button and Set Automatic node repair to Enabled

Using Command line

To enable Automatic node repair for an existing cluster with node pool, run the following command:

gcloud container node-pools update \[POOL_NAME\] --cluster \[CLUSTER_NAME\] --zone \[COMPUTE_ZONE\] --enable-autorepair

Note: Node auto-repair is only available for nodes that use Container-Optimized OS as their node image. Node auto-repair is not available on Alpha Clusters.

Impact

If multiple nodes require repair, Kubernetes Engine might repair them in parallel. Kubernetes Engine limits number of repairs depending on the size of the cluster (bigger clusters have a higher limit) and the number of broken nodes in the cluster (limit decreases if many nodes are broken)

References:

  1. https://cloud.google.com/kubernetes-engine/docs/concepts/node-auto-repair

Multiple Remediation Paths

Google Cloud

SERVICE-WIDE (RECOMMENDED when many resources are affected): Enforce Organization Policies at org/folder level so new resources inherit secure defaults.

gcloud org-policies set-policy policy.yaml

ASSET-LEVEL: Use the product-specific remediation steps above for only the impacted project/resources.

PREVENTIVE: Use org policy constraints/custom constraints and enforce checks in deployment pipelines.

References for Service-Wide Patterns

  • GCP Organization Policy overview: https://cloud.google.com/resource-manager/docs/organization-policy/overview
  • GCP Organization policy constraints catalog: https://cloud.google.com/resource-manager/docs/organization-policy/org-policy-constraints
  • gcloud org-policies: https://cloud.google.com/sdk/gcloud/reference/org-policies

Operational Rollout Workflow

Use this sequence to reduce risk and avoid repeated drift.

1. Contain at Service-Wide Scope First (Recommended)

  • Google Cloud: apply organization policy constraints at org/folder scope.
gcloud org-policies set-policy policy.yaml

2. Remediate Existing Affected Assets

  • Execute the control-specific Console/CLI steps documented above for each flagged resource.
  • Prioritize internet-exposed and production assets first.

3. Validate and Prevent Recurrence

  • Re-scan after each remediation batch.
  • Track exceptions with owner and expiry date.
  • Add preventive checks in IaC/CI pipelines.

Query logic

These are the stored checks tied to this control.

Automatic node repair is enabled for Kubernetes Clusters

Connectors

Google Cloud

Covered asset types

Cluster

Expected check: eq []

gkeClusters(where:{nodePools_SOME:{managementAutoRepair_NOT:true}}){...AssetFragment}
Cyscale Logo
Cyscale is an agentless cloud-native application protection platform (CNAPP) that automates the contextual analysis of cloud misconfigurations, vulnerabilities, access, and data, to provide an accurate and actionable assessment of risk.

Stay connected

Receive new blog posts and product updates from Cyscale

By clicking Subscribe, I agree to Cyscale’s Privacy Policy


© 2026 Cyscale Limited

LinkedIn icon
Twitter icon
Facebook icon
crunch base icon
angel icon