Pod Anomaly Troubleshooting

CCE CCE

  • Function Release Records
  • Common Tools
    • Command Line Scenario Examples
  • API Reference
    • Overview
    • Common Headers and Error Responses
    • General Description
  • Product Announcement
    • Announcement on the Discontinuation of CCE Standalone Clusters
    • CCE New Cluster Management Release Announcement
    • Upgrade Announcement for CCE Cluster Audit Component kube-external-auditor
    • CCE Console Upgrade Announcement
    • Announcement on Management Fees for CCE Managed Clusters
    • Container Runtime Version Release Notes
    • Announcement on the Decommissioning of CCE Image Repository
    • Kubernetes Version Release Notes
      • CCE Release of Kubernetes v1_26 History
      • CCE Kubernetes Version Update Notes
      • CCE Release of Kubernetes v1_24 History
      • CCE Release of Kubernetes v1_30 History
      • CCE Release of Kubernetes v1_22 History
      • CCE Release of Kubernetes v1_18 History
      • CCE Release of Kubernetes v1_20 History
      • CCE Release of Kubernetes v1_28 History
      • Release Notes for CCE Kubernetes 1_31 Version
      • Kubernetes Version Overview and Mechanism
    • Security Vulnerability Fix Announcement
      • Vulnerability CVE-2019-5736 Fix Announcement
      • Vulnerability CVE-2021-30465 Fix Announcement
      • CVE-2025-1097, CVE-2025-1098, and Other Vulnerabilities Fix Announcement
      • CVE-2020-14386 Vulnerability Fix Announcement
      • Impact Statement on runc Security Issue (CVE-2024-21626)
  • Service Level Agreement (SLA)
    • CCE Service Level Agreement SLA (V1_0)
  • Typical Practices
    • Pod Anomaly Troubleshooting
    • Adding CGroup V2 Node
    • Common Linux System Configuration Parameters Description
    • Encrypting etcd Data Using KMS
    • Configuring Container Network Parameters Using CNI
    • CCE - Public Network Access Practice
    • Practice of using private images in CCE clusters
    • Unified Access for Virtual Machines and Container Services via CCE Ingress
    • User Guide for Custom CNI Plugins
    • CCE Cluster Network Description and Planning
    • Cross-Cloud Application Migration to Baidu CCE Using Velero
    • CCE Resource Recommender User Documentation
    • Continuous Deployment with Jenkins in CCE Cluster
    • CCE Best Practice-Guestbook Setup
    • CCE Best Practice-Container Network Mode Selection
    • CCE Usage Checklist
    • VPC-ENI Mode Cluster Public Network Access Practice
    • CCE Container Runtime Selection
    • Cloud-native AI
      • Elastic and Fault-Tolerant Training Using CCE AITraining Operator
      • Deploy the TensorFlow Serving inference service
      • Best Practice for GPU Virtualization with Optimal Isolation
  • FAQs
    • How do business applications use load balancer
    • Using kubectl on Windows
    • Cluster management FAQs
    • Common Questions Overview
    • Auto scaling FAQs
    • Create a simple service via kubectl
  • Operation guide
    • Prerequisites for use
    • Identity and access management
    • Permission Management
      • Configure IAM Tag Permission Policy
      • Permission Overview
      • Configure IAM Custom Permission Policy
      • Configure Predefined RBAC Permission Policy
      • Configure IAM Predefined Permission Policy
      • Configure Cluster OIDC Authentication
    • Configuration Management
      • Configmap Management
      • Secret Management
    • Traffic access
      • BLB ingress annotation description
      • Use K8S_Service via CCE
      • Use K8S_Ingress via CCE
      • Implement Canary Release with CCE Based on Nginx-Ingress
      • Create CCE_Ingress via YAML
      • LoadBalancer Service Annotation Description
      • Service Reuses Existing Load Balancer BLB
      • Use Direct Pod Mode LoadBalancer Service
      • NGINX Ingress Configuration Reference
      • Create LoadBalancer_Service via YAML
      • Use NGINX Ingress
    • Virtual Node
      • Configuring BCIPod
      • Configuring bci-profile
      • Managing virtual nodes
    • Node management
      • Add a node
      • Managing Taints
      • Setting Node Blocking
      • Setting GPU Memory Sharing
      • Remove a node
      • Customizing Kubelet Parameters
      • Kubelet Container Monitor Read-Only Port Risk Warning
      • Managing Node Tag
      • Drain node
    • Component Management
      • CCE CSI CDS Plugin Description
      • CCE Fluid Description
      • CCE CSI PFS L2 Plugin
      • CCE Calico Felix Description
      • CCE Ingress Controller Description
      • CCE QoS Agent Description
      • CCE GPU Manager Description
      • CCE Ingress NGINX Controller Description
      • CCE P2P Accelerator Description
      • CCE Virtual Kubelet Component
      • CoreDNS Description
      • CCE Log Operator Description
      • CCE Node Remedier Description
      • CCE Descheduler Description
      • CCE Dynamic Scheduling Plugin Description
      • Kube Scheduler Documentation
      • CCE NPU Manager Description
      • CCE CronHPA Controller Description
      • CCE LB Controller Description
      • Kube ApiServer Description
      • CCE Backup Controller Description
      • CCE Network Plugin Description
      • CCE CSI PFS Plugin Description
      • CCE Credential Controller Description
      • CCE Deep Learning Frameworks Operator Description
      • Component Overview
      • CCE Image Accelerate Description
      • CCE CSI BOS Plugin Description
      • CCE Onepilot Description
      • Description of Kube Controller Manager
      • CCE_Hybrid_Manager Description
      • CCE NodeLocal DNSCache Description
      • CCE Node Problem Detector Description
      • CCE Ascend Mindx DL Description
      • CCE RDMA Device Plugin Description
      • CCE AI Job Scheduler Description
    • Image registry
      • Image Registry Basic Operations
      • Using Container Image to Build Services
    • Helm Management
      • Helm Template
      • Helm Instance
    • Cluster management
      • Upgrade Cluster Kubernetes Version
      • CCE Node CDS Dilatation
      • Managed Cluster Usage Instructions
      • Create cluster
      • CCE Supports GPUSharing Cluster
      • View Cluster
      • Connect to Cluster via kubectl
      • CCE Security Group
      • CCE Node Resource Reservation Instructions
      • Operate Cluster
      • Cluster Snapshot
    • Serverless Cluster
      • Product overview
      • Using Service in Serverless Cluster
      • Creating a Serverless Cluster
    • Storage Management
      • Using Cloud File System
      • Overview
      • Using Parallel File System PFS
      • Using RapidFS
      • Using Object Storage BOS
      • Using Parallel File System PFS L2
      • Using Local Storage
      • Using Cloud Disk CDS
    • Inspection and Diagnosis
      • Cluster Inspection
      • GPU Runtime Environment Check
      • Fault Diagnosis
    • Cloud-native AI
      • Cloud-Native AI Overview
      • AI Monitoring Dashboard
        • Connecting to a Prometheus Instance and Starting a Job
        • NVIDIA Chip Resource Observation
          • AI Job Scheduler component
          • GPU node resources
          • GPU workload resources
          • GPUManager component
          • GPU resource pool overview
        • Ascend Chip Resource Observation
          • Ascend resource pool overview
          • Ascend node resource
          • Ascend workload resource
      • Task Management
        • View Task Information
        • Create TensorFlow Task
        • Example of RDMA Distributed Training Based on NCCL
        • Create PaddlePaddle Task
        • Create AI Training Task
        • Delete task
        • Create PyTorch Task
        • Create Mxnet Task
      • Queue Management
        • Modify Queue
        • Create Queue
        • Usage Instructions for Logical Queues and Physical Queues
        • Queue deletion
      • Dataset Management
        • Create Dataset
        • Delete dataset
        • View Dataset
        • Operate Dataset
      • AI Acceleration Kit
        • AIAK Introduction
        • Using AIAK-Training PyTorch Edition
        • Deploying Distributed Training Tasks Using AIAK-Training
        • Accelerating Inference Business Using AIAK-Inference
      • GPU Virtualization
        • GPU Exclusive and Shared Usage Instructions
        • Image Build Precautions in Shared GPU Scenarios
        • Instructions for Multi-GPU Usage in Single-GPU Containers
        • GPU Virtualization Adaptation Table
        • GPU Online and Offline Mixed Usage Instructions
        • MPS Best Practices & Precautions
        • Precautions for Disabling Node Video Memory Sharing
    • Elastic Scaling
      • Container Timing Horizontal Scaling (CronHPA)
      • Container Horizontal Scaling (HPA)
      • Implementing Second-Level Elastic Scaling with cce-autoscaling-placeholder
      • CCE Cluster Node Auto-Scaling
    • Network Management
      • How to Continue Dilatation When Container Network Segment Space Is Exhausted (VPC-ENI Mode)
      • Container Access to External Services in CCE Clusters
      • CCE supports dual-stack networks of IPv4 and IPv6
      • Using NetworkPolicy Network Policy
      • Traffic Forwarding Configuration for Containers in Peering Connections Scenarios
      • CCE IP Masquerade Agent User Guide
      • Creating VPC-ENI Mode Cluster
      • How to Continue Dilatation When Container Network Segment Space Is Exhausted (VPC Network Mode)
      • Using NetworkPolicy in CCE Clusters
      • Network Orchestration
        • Container Network QoS Management
        • VPC-ENI Specified Subnet IP Allocation (Container Network v2)
        • Cluster Pod Subnet Topology Distribution (Container Network v2)
      • Network Connectivity
        • Container network accesses the public network via NAT gateway
      • Network Maintenance
        • Common Error Code Table for CCE Container Network
      • DNS
        • CoreDNS Component Manual Dilatation Guide
        • DNS Troubleshooting Guide
        • DNS Principle Overview
    • Namespace Management
      • Set Limit Range
      • Set Resource Quota
      • Basic Namespace Operations
    • Workload
      • CronJob Management
      • Set Workload Auto-Scaling
      • Deployment Management
      • Job Management
      • View the Pod
      • StatefulSet Management
      • Password-Free Pull of Container Image
      • Create Workload Using Private Image
      • DaemonSet Management
    • Monitor Logs
      • Monitor Cluster with Prometheus
      • CCE Event Center
      • Cluster Service Profiling
      • CCE Cluster Anomaly Event Alerts
      • Java Application Monitor
      • Cluster Audit Dashboard
      • Logging
      • Cluster Audit
      • Log Center
        • Configure Collection Rules Using CRD
        • View Cluster Control Plane Logs
        • View Business Logs
        • Log Overview
        • Configure Collection Rules in Cloud Container Engine Console
    • Application management
      • Overview
      • Secret
      • Configuration dictionary
      • Deployment
      • Service
      • Pod
    • NodeGroup Management
      • NodeGroup Management
      • NodeGroup Node Fault Detection and Self-Healing
      • Configuring Scaling Policies
      • NodeGroup Introduction
      • Adding Existing External Nodes
      • Custom NodeGroup Kubelet Configuration
      • Adding Alternative Models
      • Dilatation NodeGroup
    • Backup Center
      • Restore Management
      • Backup Overview
      • Backup Management
      • Backup repository
  • Quick Start
    • Quick Deployment of Nginx Application
    • CCE Container Engine Usage Process Overview
  • Product pricing
    • Product pricing
  • Product Description
    • Application scenarios
    • Introduction
    • Usage restrictions
    • Features
    • Advantages
    • Core concepts
  • Solution-Fabric
    • Fabric Solution
  • Development Guide
    • EFK Log Collection System Deployment Guide
    • Using Network Policy in CCE Cluster
    • Creating a LoadBalancer-Type Service
    • Prometheus Monitoring System Deployment Guide
    • kubectl Management Configuration
  • API_V2 Reference
    • Overview
    • Common Headers and Error Responses
    • Cluster Related Interfaces
    • Instance Related Interfaces
    • Service domain
    • General Description
    • Kubeconfig Related Interfaces
    • RBAC Related Interfaces
    • Autoscaler Related Interfaces
    • Network Related Interfaces
    • InstanceGroup Related Interfaces
    • Appendix
    • Component management-related APIs
    • Package adaptation-related APIs
    • Task Related Interfaces
  • Solution-Xchain
    • Hyperchain Solution
  • SDK
    • Go-SDK
      • Overview
      • NodeGroup Management
      • Initialization
      • Install the SDK Package
      • Cluster management
      • Node management
All documents
menu
No results found, please re-enter

CCE CCE

  • Function Release Records
  • Common Tools
    • Command Line Scenario Examples
  • API Reference
    • Overview
    • Common Headers and Error Responses
    • General Description
  • Product Announcement
    • Announcement on the Discontinuation of CCE Standalone Clusters
    • CCE New Cluster Management Release Announcement
    • Upgrade Announcement for CCE Cluster Audit Component kube-external-auditor
    • CCE Console Upgrade Announcement
    • Announcement on Management Fees for CCE Managed Clusters
    • Container Runtime Version Release Notes
    • Announcement on the Decommissioning of CCE Image Repository
    • Kubernetes Version Release Notes
      • CCE Release of Kubernetes v1_26 History
      • CCE Kubernetes Version Update Notes
      • CCE Release of Kubernetes v1_24 History
      • CCE Release of Kubernetes v1_30 History
      • CCE Release of Kubernetes v1_22 History
      • CCE Release of Kubernetes v1_18 History
      • CCE Release of Kubernetes v1_20 History
      • CCE Release of Kubernetes v1_28 History
      • Release Notes for CCE Kubernetes 1_31 Version
      • Kubernetes Version Overview and Mechanism
    • Security Vulnerability Fix Announcement
      • Vulnerability CVE-2019-5736 Fix Announcement
      • Vulnerability CVE-2021-30465 Fix Announcement
      • CVE-2025-1097, CVE-2025-1098, and Other Vulnerabilities Fix Announcement
      • CVE-2020-14386 Vulnerability Fix Announcement
      • Impact Statement on runc Security Issue (CVE-2024-21626)
  • Service Level Agreement (SLA)
    • CCE Service Level Agreement SLA (V1_0)
  • Typical Practices
    • Pod Anomaly Troubleshooting
    • Adding CGroup V2 Node
    • Common Linux System Configuration Parameters Description
    • Encrypting etcd Data Using KMS
    • Configuring Container Network Parameters Using CNI
    • CCE - Public Network Access Practice
    • Practice of using private images in CCE clusters
    • Unified Access for Virtual Machines and Container Services via CCE Ingress
    • User Guide for Custom CNI Plugins
    • CCE Cluster Network Description and Planning
    • Cross-Cloud Application Migration to Baidu CCE Using Velero
    • CCE Resource Recommender User Documentation
    • Continuous Deployment with Jenkins in CCE Cluster
    • CCE Best Practice-Guestbook Setup
    • CCE Best Practice-Container Network Mode Selection
    • CCE Usage Checklist
    • VPC-ENI Mode Cluster Public Network Access Practice
    • CCE Container Runtime Selection
    • Cloud-native AI
      • Elastic and Fault-Tolerant Training Using CCE AITraining Operator
      • Deploy the TensorFlow Serving inference service
      • Best Practice for GPU Virtualization with Optimal Isolation
  • FAQs
    • How do business applications use load balancer
    • Using kubectl on Windows
    • Cluster management FAQs
    • Common Questions Overview
    • Auto scaling FAQs
    • Create a simple service via kubectl
  • Operation guide
    • Prerequisites for use
    • Identity and access management
    • Permission Management
      • Configure IAM Tag Permission Policy
      • Permission Overview
      • Configure IAM Custom Permission Policy
      • Configure Predefined RBAC Permission Policy
      • Configure IAM Predefined Permission Policy
      • Configure Cluster OIDC Authentication
    • Configuration Management
      • Configmap Management
      • Secret Management
    • Traffic access
      • BLB ingress annotation description
      • Use K8S_Service via CCE
      • Use K8S_Ingress via CCE
      • Implement Canary Release with CCE Based on Nginx-Ingress
      • Create CCE_Ingress via YAML
      • LoadBalancer Service Annotation Description
      • Service Reuses Existing Load Balancer BLB
      • Use Direct Pod Mode LoadBalancer Service
      • NGINX Ingress Configuration Reference
      • Create LoadBalancer_Service via YAML
      • Use NGINX Ingress
    • Virtual Node
      • Configuring BCIPod
      • Configuring bci-profile
      • Managing virtual nodes
    • Node management
      • Add a node
      • Managing Taints
      • Setting Node Blocking
      • Setting GPU Memory Sharing
      • Remove a node
      • Customizing Kubelet Parameters
      • Kubelet Container Monitor Read-Only Port Risk Warning
      • Managing Node Tag
      • Drain node
    • Component Management
      • CCE CSI CDS Plugin Description
      • CCE Fluid Description
      • CCE CSI PFS L2 Plugin
      • CCE Calico Felix Description
      • CCE Ingress Controller Description
      • CCE QoS Agent Description
      • CCE GPU Manager Description
      • CCE Ingress NGINX Controller Description
      • CCE P2P Accelerator Description
      • CCE Virtual Kubelet Component
      • CoreDNS Description
      • CCE Log Operator Description
      • CCE Node Remedier Description
      • CCE Descheduler Description
      • CCE Dynamic Scheduling Plugin Description
      • Kube Scheduler Documentation
      • CCE NPU Manager Description
      • CCE CronHPA Controller Description
      • CCE LB Controller Description
      • Kube ApiServer Description
      • CCE Backup Controller Description
      • CCE Network Plugin Description
      • CCE CSI PFS Plugin Description
      • CCE Credential Controller Description
      • CCE Deep Learning Frameworks Operator Description
      • Component Overview
      • CCE Image Accelerate Description
      • CCE CSI BOS Plugin Description
      • CCE Onepilot Description
      • Description of Kube Controller Manager
      • CCE_Hybrid_Manager Description
      • CCE NodeLocal DNSCache Description
      • CCE Node Problem Detector Description
      • CCE Ascend Mindx DL Description
      • CCE RDMA Device Plugin Description
      • CCE AI Job Scheduler Description
    • Image registry
      • Image Registry Basic Operations
      • Using Container Image to Build Services
    • Helm Management
      • Helm Template
      • Helm Instance
    • Cluster management
      • Upgrade Cluster Kubernetes Version
      • CCE Node CDS Dilatation
      • Managed Cluster Usage Instructions
      • Create cluster
      • CCE Supports GPUSharing Cluster
      • View Cluster
      • Connect to Cluster via kubectl
      • CCE Security Group
      • CCE Node Resource Reservation Instructions
      • Operate Cluster
      • Cluster Snapshot
    • Serverless Cluster
      • Product overview
      • Using Service in Serverless Cluster
      • Creating a Serverless Cluster
    • Storage Management
      • Using Cloud File System
      • Overview
      • Using Parallel File System PFS
      • Using RapidFS
      • Using Object Storage BOS
      • Using Parallel File System PFS L2
      • Using Local Storage
      • Using Cloud Disk CDS
    • Inspection and Diagnosis
      • Cluster Inspection
      • GPU Runtime Environment Check
      • Fault Diagnosis
    • Cloud-native AI
      • Cloud-Native AI Overview
      • AI Monitoring Dashboard
        • Connecting to a Prometheus Instance and Starting a Job
        • NVIDIA Chip Resource Observation
          • AI Job Scheduler component
          • GPU node resources
          • GPU workload resources
          • GPUManager component
          • GPU resource pool overview
        • Ascend Chip Resource Observation
          • Ascend resource pool overview
          • Ascend node resource
          • Ascend workload resource
      • Task Management
        • View Task Information
        • Create TensorFlow Task
        • Example of RDMA Distributed Training Based on NCCL
        • Create PaddlePaddle Task
        • Create AI Training Task
        • Delete task
        • Create PyTorch Task
        • Create Mxnet Task
      • Queue Management
        • Modify Queue
        • Create Queue
        • Usage Instructions for Logical Queues and Physical Queues
        • Queue deletion
      • Dataset Management
        • Create Dataset
        • Delete dataset
        • View Dataset
        • Operate Dataset
      • AI Acceleration Kit
        • AIAK Introduction
        • Using AIAK-Training PyTorch Edition
        • Deploying Distributed Training Tasks Using AIAK-Training
        • Accelerating Inference Business Using AIAK-Inference
      • GPU Virtualization
        • GPU Exclusive and Shared Usage Instructions
        • Image Build Precautions in Shared GPU Scenarios
        • Instructions for Multi-GPU Usage in Single-GPU Containers
        • GPU Virtualization Adaptation Table
        • GPU Online and Offline Mixed Usage Instructions
        • MPS Best Practices & Precautions
        • Precautions for Disabling Node Video Memory Sharing
    • Elastic Scaling
      • Container Timing Horizontal Scaling (CronHPA)
      • Container Horizontal Scaling (HPA)
      • Implementing Second-Level Elastic Scaling with cce-autoscaling-placeholder
      • CCE Cluster Node Auto-Scaling
    • Network Management
      • How to Continue Dilatation When Container Network Segment Space Is Exhausted (VPC-ENI Mode)
      • Container Access to External Services in CCE Clusters
      • CCE supports dual-stack networks of IPv4 and IPv6
      • Using NetworkPolicy Network Policy
      • Traffic Forwarding Configuration for Containers in Peering Connections Scenarios
      • CCE IP Masquerade Agent User Guide
      • Creating VPC-ENI Mode Cluster
      • How to Continue Dilatation When Container Network Segment Space Is Exhausted (VPC Network Mode)
      • Using NetworkPolicy in CCE Clusters
      • Network Orchestration
        • Container Network QoS Management
        • VPC-ENI Specified Subnet IP Allocation (Container Network v2)
        • Cluster Pod Subnet Topology Distribution (Container Network v2)
      • Network Connectivity
        • Container network accesses the public network via NAT gateway
      • Network Maintenance
        • Common Error Code Table for CCE Container Network
      • DNS
        • CoreDNS Component Manual Dilatation Guide
        • DNS Troubleshooting Guide
        • DNS Principle Overview
    • Namespace Management
      • Set Limit Range
      • Set Resource Quota
      • Basic Namespace Operations
    • Workload
      • CronJob Management
      • Set Workload Auto-Scaling
      • Deployment Management
      • Job Management
      • View the Pod
      • StatefulSet Management
      • Password-Free Pull of Container Image
      • Create Workload Using Private Image
      • DaemonSet Management
    • Monitor Logs
      • Monitor Cluster with Prometheus
      • CCE Event Center
      • Cluster Service Profiling
      • CCE Cluster Anomaly Event Alerts
      • Java Application Monitor
      • Cluster Audit Dashboard
      • Logging
      • Cluster Audit
      • Log Center
        • Configure Collection Rules Using CRD
        • View Cluster Control Plane Logs
        • View Business Logs
        • Log Overview
        • Configure Collection Rules in Cloud Container Engine Console
    • Application management
      • Overview
      • Secret
      • Configuration dictionary
      • Deployment
      • Service
      • Pod
    • NodeGroup Management
      • NodeGroup Management
      • NodeGroup Node Fault Detection and Self-Healing
      • Configuring Scaling Policies
      • NodeGroup Introduction
      • Adding Existing External Nodes
      • Custom NodeGroup Kubelet Configuration
      • Adding Alternative Models
      • Dilatation NodeGroup
    • Backup Center
      • Restore Management
      • Backup Overview
      • Backup Management
      • Backup repository
  • Quick Start
    • Quick Deployment of Nginx Application
    • CCE Container Engine Usage Process Overview
  • Product pricing
    • Product pricing
  • Product Description
    • Application scenarios
    • Introduction
    • Usage restrictions
    • Features
    • Advantages
    • Core concepts
  • Solution-Fabric
    • Fabric Solution
  • Development Guide
    • EFK Log Collection System Deployment Guide
    • Using Network Policy in CCE Cluster
    • Creating a LoadBalancer-Type Service
    • Prometheus Monitoring System Deployment Guide
    • kubectl Management Configuration
  • API_V2 Reference
    • Overview
    • Common Headers and Error Responses
    • Cluster Related Interfaces
    • Instance Related Interfaces
    • Service domain
    • General Description
    • Kubeconfig Related Interfaces
    • RBAC Related Interfaces
    • Autoscaler Related Interfaces
    • Network Related Interfaces
    • InstanceGroup Related Interfaces
    • Appendix
    • Component management-related APIs
    • Package adaptation-related APIs
    • Task Related Interfaces
  • Solution-Xchain
    • Hyperchain Solution
  • SDK
    • Go-SDK
      • Overview
      • NodeGroup Management
      • Initialization
      • Install the SDK Package
      • Cluster management
      • Node management
  • Document center
  • arrow
  • CCECCE
  • arrow
  • Typical Practices
  • arrow
  • Pod Anomaly Troubleshooting
Table of contents on this page
  • 1. Scheduling issues
  • Pod not scheduled to a node
  • Pod scheduled to a node
  • 2. Image pull issues
  • 3. Startup issues
  • Pod in init state
  • Pod being created (creating)
  • Pod startup failure (CrashLoopBackOff)
  • 4. Pod running issues
  • OOM
  • Terminating
  • Evicted
  • Completed
  • 5. Other common issues
  • Pod is in running state but not working properly

Pod Anomaly Troubleshooting

Updated at:2025-10-27

1. Scheduling issues

Pod not scheduled to a node

If a pod stays in the pending state for a long time and isn't assigned to any node for execution, the issue might arise from the following reasons.

Error message Description Recommended solution
no nodes available to schedule pods. There are no available nodes in the cluster for scheduling. 1. Check if any cluster nodes are in the NotReady state. If so, inspect and resolve issues on those problematic nodes.
2. Verify if the pod has declared nodeSelector, nodeAffinity, or taint tolerations. If no affinity policies exist, add additional nodes to the cluster.
0/x nodes are available: x Insufficient cpu.
0/x nodes are available: x Insufficient memory.
No nodes in the cluster meet the pod's CPU or memory resource requirements. Check the usage of pods, CPU, and memory on the nodes page to determine the cluster’s resource utilization.
Note:
Even if adding a new pod does not immediately reach the resource limit when a node’s CPU and memory utilization is low, the scheduler will still consider it cautiously. This prevents potential resource shortages on the node during future peak hours caused by improper resource allocation.
If the cluster’s CPU or memory is exhausted, resolve it using one of the following methods.
1. Delete or reduce unnecessary pods
2. Adjust the pod’s resource configuration (reconfigure container request and limit) based on your business needs.
3. Add new nodes to the cluster that match the pod’s declared affinity policies.
x node(s) didn't match node selector.
x node(s) didn't match pod affinity/anti-affinity.
Existing cluster nodes do not align with the pod's specified nodeSelector (node affinity) or podAffinity/podAntiAffinity (Pod affinity) requirements. 1. Inspect and adjust the pod’s node affinity policies, including node labels, nodeSelector, nodeAffinity, taints, and tolerations.
2. Inspect and adjust the pod’s pod affinity policies, and evaluate if nodes meet the requirements: If podAffinity is configured, check if matching pods exist on the target node. If podAntiAffinity is configured, ensure no conflicting pods (that should not coexist) are present on the target node.
0/x nodes are available: x node(s) had volume node affinity conflict. There's an affinity conflict between the persistent volume used by the pod and the node being scheduled (e.g., cloud disks cannot be mounted across different availability zones), causing scheduling to fail. Review and modify the node affinity policies of the pod and persistent volume, including node labels, nodeSelector, nodeAffinity, taints, and tolerations.
0/x nodes are available: x node(s) had taints that the pod didn't tolerate. The target node has a taint that prevents the pod from being scheduled. 1. If the taint was manually added, delete the unintended taint.
2. If the taint cannot be deleted, configure a corresponding toleration for the pod.
3. If the taint was automatically added by the system, troubleshoot and resolve the issue based on the taint details, then wait for the pod to be rescheduled.
0/x nodes are available: pod has unbound immediate PersistentVolumeClaims. The pod failed to bind to the PVC. Check if the PVC or PV specified by the pod has been created. Use kubectl describe pvc or kubectl describe pv commands to inspect the event logs of the PVC/PV for further troubleshooting.

Pod scheduled to a node

If the pod is scheduled to a node but still remains in the pending state, resolve it using the following steps.

  1. Check if the pod is configured with hostPort: If hostPort is used, only one pod instance with that hostPort can run on each node. Thus, the replicas value in a deployment or ReplicationController cannot exceed the number of nodes in the cluster. If the port is occupied by another application, the pod will fail to schedule. hostPort introduces management and scheduling complexity; it is recommended to use service to access the pod instead.
  2. If hostPort is not configured, refer to the following steps.
     a. Use kubectl describe pod to check the pod’s event logs and resolve the identified issue. The event may describe the reason, such as image pull failure, insufficient resources, security policy restrictions, configuration errors.
      b. If events lack useful information, check the kubelet logs on the node for further troubleshooting. Search for log entries related to the pod using: grep -i /var/log/messages* | less.

2. Image pull issues

Error message Description Recommended solution
Failed to pull image "xxx": rpc error: code = Unknown desc = Error response from daemon: Get xxx: denied: Access to the image registry is denied (no imagePullSecret specified during pod creation). Verify if the secret specified in spec.template.imagePullSecrets (in the Pod YAML) exists.
When using CCR Enterprise Edition, you can pull images using the password-free component. For details, see: https://cloud.baidu.com/doc/CCE/s/4m0kru8g5.
Failed to pull image "xxxx:xxx": rpc error: code = Unknown desc = Error response from daemon: Get https://xxxxxx/xxxxx/: dial tcp: lookup xxxxxxx.xxxxx: no such host Failed to resolve the image registry's address when pulling via HTTPS. 1. Check if the Image Registry address configured in spec.containers.image (in the Pod YAML) is correct. Modify it if incorrect.
2. If the address is correct, verify network connectivity from the pod’s node to the Image Registry.
Failed create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "xxxxxxxxx": Error response from daemon: mkdir xxxxx: no space left on device The node has insufficient disk space. Log in to the node hosting the pod, and run df -h to check disk space. If the disk is full, expand its capacity.
Failed to pull image "XXX": rpc error: code = Unknown desc = context canceled Operation canceled (possibly due to an oversized image file). Kubernetes enforces a default timeout for image pulling, and if no progress is made within this period, the task is canceled. 1. Check if imagePullPolicy in the Pod YAML is set to IfNotPresent.
2. Sign in to the pod’s node and run docker pull or ctr images pull to verify if the image can be pulled manually.
Failed to pull image "xxxxx": rpc error: code = Unknown desc = Error response from daemon: Get https://xxxxxxx: xxxxx/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) Failed to connect to the image registry (network unreachable). 1. Sign in to the pod’s node and run curl https://xxxxxx/xxxxx/ to check if the address is accessible. If errors occur, troubleshoot network issues (e.g., network configuration, firewall rules, DNS resolution).
2. Verify if the node’s public network policies are normal (e.g., Baidu Load Balance (BLB), security groups, bound EIP).
Stuck in “Pulling image” state. This might activate kubelet's image pull rate-limiting mechanism. Adjust registryPullQPS (maximum QPS for the Image Registry) and registryBurst (maximum number of burst image pulls) via the Custom Kubelet Parameters feature.

3. Startup issues

Pod in init state

Error message Description Recommended solution
Stuck in Init:N/M The pod contains M init containers; N have started successfully, but M - N have failed. 1. Use kubectl describe pod -n to check pod events and identify anomalies in unstarted init containers.
2. Use kubectl logs -n -c to view logs of unstarted init containers for troubleshooting.
3. Inspect the pod configuration (e.g., health check settings) to confirm the unstarted init containers are configured correctly.
For more information about init containers, see: Debugging Init Containers.
Stuck in Init:Error An init container within the pod failed to start. Same as above
Stuck in Init:CrashLoopBackOff An init container in the pod couldn't start and is stuck in a restart loop. Same as above

Pod being created (creating)

Error message Description Recommended solution
Network plugin returns error: cni plugin not initialized CNI network components not initialized, preventing pod creation Review the statuses of cce-network-agent and cce-network-operator on the pod’s node. Gather logs for troubleshooting or request support with a ticket.
ce-network-agent is not running or ready on this node Abnormal cce-network-agent is blocking pod creation. Inspect the status of cce-network-agent on the pod’s node. Gather logs for troubleshooting or escalate with a support ticket.

Pod startup failure (CrashLoopBackOff)

Error message Description Recommended solution
exit(0) observed in logs. exit(0) observed in logs. 1. Sign in to the node where the abnormal workload resides.
2. Run docker ps -a
Liveness probe failed: Get http… reported in events. Health check failed. Ensure the container’s Liveness Probe policy appropriately aligns with the expected health check parameters and reflects the actual application state inside the container.
"No space left" error found in pod logs. Disk space is insufficient. 1. Resize disk/expand disk
2. Clean up unnecessary images to free disk space. Configure imageGCHighThresholdPercent as needed to control the threshold for triggering image garbage collection (Image GC) on the node.
Startup failed without any event logs. When the resource limits declared in the pod are below actual usage needs, the container cannot start. Confirm the pod’s resource configuration is accurate. Adjust the container’s request and limit values if necessary.
"Address already in use" detected in pod logs. Port conflicts exist between containers in the same pod. 1. Check if the pod is configured with hostNetwork: true (this means containers in the pod share the host’s network API and port space). Disable it by setting hostNetwork: false if not required.
2. If hostNetwork: true is necessary, configure pod anti-affinity to ensure pods in the same replica set are scheduled to different nodes.
3. Ensure no two or more pods with the same port requirements run on the same node.
Container initialization failed due to "setenv: invalid argument"": unknown error in pod logs. A secret is mounted in the workload, but its value has not been Base64-encoded. 1. Create the secret via the console (values are automatically Base64-encoded).
2. Create the secret via YAML and use echo -n "xxxxx"
Issue specific to the business logic. - Check the pod logs to diagnose issues based on the log content.

4. Pod running issues

OOM

The container terminated because it exceeded the memory limit, triggering an OOM (Out Of Memory) event and causing an abnormal exit. For more information on OOM events, refer to Assign Memory Resources to Containers and Pods.

  • If the container’s blocking process is terminated, the container might restart unexpectedly.
  • If the cluster is set to send abnormal replica alerts, you will receive notifications when OOM events occur.

Terminating

Potential reasons Description Recommended solution
The node is in an abnormal state (NotReady). - A NotReady node will be automatically re-added to the cluster once it recovers.
The pod has finalizers configured. Kubernetes performs cleanup operations defined by the finalizers before deleting the pod. If the cleanup operations don’t respond properly, the pod will remain in the terminating state. Run kubectl get pod -n -o yaml to check the pod’s finalizers configuration and resolve any issues.
The pod has an abnormal preStop configuration. If a preStop hook is configured, Kubernetes executes the specified operation before terminating the container. The pod stays in the terminating state while in the preStop phase. Run kubectl get pod -n -o yaml to inspect the pod’s preStop configuration and troubleshoot any issues.
The pod is configured with a termination grace period. When the pod is set with a grace period (terminationGracePeriodSeconds), it transitions to the terminating state after receiving a termination command (e.g., kubectl delete pod <pod_name>). Kubernetes considers the pod successfully shut down either after the grace period ends or when the container exits early. The pod will be automatically deleted once the container exits cleanly.
The container is unresponsive. When a request to stop or delete the pod is made, Kubernetes sends a SIGTERM signal to the container. If the container does not respond to SIGTERM, the pod may become stuck in the terminating state. 1. Force delete the pod to free resources: kubectl delete pod --grace-period=0 --force.
2. Check the containerd or docker logs on the pod’s node for further troubleshooting.

Evicted

Potential reasons Description Recommended solution
Node resource constraints (e.g., low memory or disk space) cause the kubelet to evict one or more pods to reclaim resources. Possible pressures include memory pressure, disk pressure, and PID pressure. Check using: kubectl describe node |grep Taints.
1. Memory pressure: Taint node.kubernetes.io/memory-pressure exists.
2. Disk pressure: Taint node.kubernetes.io/disk-pressure exists.
3. Pid pressure: Taint node.kubernetes.io/pid-pressure exists.
1. Memory pressure:
Adjust pod resource configuration based on business needs. Or add more nodes.
2. Disk pressure: Regularly clean business pod logs on the node to prevent disk exhaustion. Or expand the node’s disk.
3. PID pressure: Adjust pod resource configuration based on business needs. See Process ID Constraints and Reservations.
Unexpected eviction occurred. The node hosting the pending pod was manually tainted with NoExecute, leading to unexpected evictions. Use "kubectl describe node
The eviction process did not occur as expected. --pod-eviction-timeout (before 1.26): Starts evicting pods from a downed node if the node is unresponsive for longer than the set time (default: 5 minutes).
--node-eviction-rate: Number of pods evicted per second from a node. (Default: 0.1, i.e., max 1 pod evicted per 10 seconds).
--secondary-node-eviction-rate: Secondary node eviction rate. If too many nodes are down, the eviction rate decreases to this level (default: 0.01).
--unhealthy-zone-threshold: Unhealthy threshold for an availability zone (default: 0.55, i.e., an availability zone is marked unhealthy if downed nodes exceed 55% of total nodes).
--large-cluster-size-threshold: Threshold for a large cluster (default: 50 nodes, i.e., clusters with >50 nodes are considered large).
For small clusters (≤50 nodes): If failed nodes exceed 55% of total nodes, pod eviction stops. See Node Eviction Rate Limits for details.
For large clusters (>50 nodes): If unhealthy nodes exceed the --unhealthy-zone-threshold (default:0.55), the eviction rate is controlled by --secondary-node-eviction-rate (default: 0.01, representing the maximum percentage of pods evicted per minute). See Node Eviction Rate Limits for details.
The evicted pod keeps getting reassigned to the original node repeatedly. Node eviction is influenced by resource usage, while pod scheduling depends on "resource allocation" on the node. This can result in evicted pods being scheduled back to the same node. Check if the pod’s resource request configuration is reasonable based on the cluster nodes’ allocatable resources

Completed

In the "Completed" state, the container's startup command in the pod has finished running, and all processes within the container have exited successfully. This state is commonly associated with Jobs, Init containers, and similar use cases.

5. Other common issues

Pod is in running state but not working properly

If there is an issue with your business YAML, the pod may be in running state but not function correctly. Resolve it using the following steps.

  1. Inspect the pod configuration to confirm the container configuration meets expectations.
  2. Use the following methods to check if there is a spelling error in any key of the environment variables. When creating a pod, if a key in the environment variables is misspelled, the cluster will ignore this error and successfully create resources using the YAML file. However, during the container’s operation, the system will be unable to execute the commands specified in the YAML file.
      a. Add the --validate flag before running the kubectl apply -f command, then execute: kubectl apply --validate -f XXX.yaml. If a spelling error exists, an error message will be prompted.
      b. Execute the following command, then compare the output pod.yaml file with the YAML file you used to create the pod.
Plain Text
1 kubectl get pods [$Pod] -o yaml > pod.yaml

  1. If the pod.yaml file has more lines than the YAML file you used to create the pod, it indicates the created pod meets expectations.
  2. If any lines from your pod-creation YAML file are missing in pod.yaml, it indicates there is a spelling issue in the original YAML.

  1. Check the pod logs to troubleshoot based on log content.
  2. Access the container via the terminal and verify whether the local files inside the container meet expectations.

Previous
Service Level Agreement (SLA)
Next
Adding CGroup V2 Node