DNS Troubleshooting Guide

Updated at：2025-10-27

This document outlines common DNS-related problems and errors, along with corresponding troubleshooting methods and solutions.

DNS troubleshooting process

Troubleshooting process

When DNS resolution fails, troubleshoot as follows

First, we can determine the problem type based on resolution failure errors. For judgment methods, refer to Common Client Error Categorization
1. If caused by network disconnection, refer to Troubleshooting by Domain Name Type with Resolution Exception of Troubleshooting Approach
2. If it is due to domain name resolution failure, refer to Troubleshooting by Resolution Exception Frequency of Troubleshooting Approach
If the above steps do not resolve the problem, troubleshoot by the following steps
1. Check the DNSPolicy field in pod configuration to confirm whether CoreDNS is used. For details, refer to Pod DNS Configuration Check
  1. If CoreDNS is not used, node DNS configuration will be inherited by default. Refer to Node DNS Troubleshooting
  2. If CoreDNS is used, troubleshoot as follows
    1. To troubleshoot CoreDNS load status, refer to CoreDNS Pod Running Status Troubleshooting
    2. To troubleshoot CoreDNS log, refer to CoreDNS Pod Log Troubleshooting
2. If NodeLocal DNS is used, refer to NodeLocal DNS Troubleshooting
If the above steps fail to resolve the problem, submit a ticket

Category of common client errors

Category	Error
Network connectivity problem	network unreachable connection timeout connection reset by peer no route to host
Domain name resolution failure	no such host could not resolve host name or service not known NXDOMAIN

Note: Different SDKs might display varying error messages. The error keywords mentioned might not be exhaustive; refer to the actual error message when making a determination.

DNS troubleshooting approach

Troubleshoot by domain name type with resolution exception

Exception type	Solution
Intra-cluster and extra-cluster domain names are both abnormal	`Container Network Failure` `High CoreDNS Pod Load` `Node Conntrack Table Full` `IPVS Defect Causing Resolution Exception`
Only extra-cluster domain name is abnormal	`Extra-cluster domain name resolution exception`
Only the Headless service domain name is abnormal	`Headless Service Domain Name Resolution Failure`

Troubleshoot by frequency of resolution exception

Frequency of occurrence	Solution
Exception occurs only during application peak hours	`High CoreDNS Pod load` `Node Conntrack Table Full`
High frequency of exception occurrence	`IPVS defect causing resolution exception`
Low frequency of exception occurrence	`Concurrent resolution exception for A and AAAA records`
Exception occurs only during node scaling or CoreDNS scale-down	`IPVS defect causing resolution exception`

Troubleshooting methods

Pod DNS configuration check

To view the dnsPolicy field of a business pod, execute the following command:

Plain Text

1kubectl -n <ns> get pod <pod-name> -o yaml | grep dnsPolicy

The value of this field is as follows:

"Default": The pod inherits the name resolution configuration from the node it runs on, i.e. using the cloud-based DNS server for domain name resolution services
"ClusterFirst": Uses CoreDNS for domain name resolution service. In pod, the nameserver in /etc/resolv.conf points to the ClusterIP of the kube-dns service
"ClusterFirstWithHostNet": Pods running in hostNetwork mode require the DNS policy to be explicitly set to "ClusterFirstWithHostNet"; otherwise, they will fall back to the "Default" policy while using the "ClusterFirst" policy.
"None": This setting allows pods to ignore DNS setting in the Kubernetes environment, and pod will use the DNS settings provided by their dnsConfig field

You can dive into the business pod to verify whether its DNS configuration file aligns with expectations.

First, enter the container:

Plain Text

1kubectl -n <ns> exec -it <pod-name> -- bash

If the bash command does not exist, try sh

Then check the DNS configuration file:

Plain Text

1cat /etc/resolv.conf

If the dnsPolicy field is set to ClusterFirst or ClusterFirstWithHostNet, the nameserver in resolv.conf should be the ClusterIP of kube-dns in the cluster.
If the dnsPolicy field is set to "Default," the nameserver in resolv.conf should match that in /etc/resolv.conf on the node.
If the dnsPolicy field is set to none, the configuration in resolv.conf should match the dnsConfig field in the pod YAML of the user.

Connectivity exception troubleshooting

Start by verifying the connectivity between the pod and the ClusterIP of CoreDNS.

You can enter the pod using the following command:

Plain Text

1kubectl -n <ns> exec -it <pod-name> -- bash

If the bash command does not exist, try sh

If a pod cannot be signed in via bash or sh, sign in to the node first and execute the following command on the node to obtain the container-id:

Plain Text

1docker ps | grep <Container name>

Then execute the following command to obtain the pid:

Plain Text

1docker inspect <container-id> | grep -i pid

Finally, enter the pod network namespace via the following command:

Plain Text

1nsenter -t <pid> -n bash

Note: If the container runtime is containerd, replace the docker command with crictl.

First, obtain the ClusterIP of CoreDNS via the following command:

Plain Text

1kubectl -n kube-system get svc kube-dns

Then test connectivity to ClusterIP and pod-ip sequentially via the following command:

Plain Text

1telnet <ClusterIP> 53
2telnet <pod-ip> 53

If the ClusterIP is unreachable but the CoreDNS pod IP is accessible, this indicates issues with cluster service connectivity. Check components like kube-proxy related to service load balancing.

If there’s no connectivity between the pod and the CoreDNS pod IP, first ensure the cluster pod network is functioning correctly. Resolve any pod network issues before proceeding, as common problems include node network outages or incorrect security group settings.

If there is no problem with the cluster pod network, the problem may lie with the CoreDNS pod. Resolution can be tested via the following command:

Plain Text

1dig <domain> @<ClusterIP>

Based on your findings, consult other sections of this document for further troubleshooting guidance.

CoreDNS pod running status troubleshooting

First, check the running status of the CoreDNS pod by executing the following command:

Plain Text

1kubectl -n kube-system get pod -o wide | grep coredns

All pods are expected to be in the running status. For non-running pods, the following command can be executed to investigate the cause:

Plain Text

1kubectl -n kube-system describe pod <pod-name>

Or use the following command to check CoreDNS pod resource usage and determine if resources are exceeded:

Plain Text

1kubectl -n kube-system top pod -l k8s-app=kube-dns

If the CoreDNS pod's load is excessive, increase the number of CoreDNS replicas.

If the CoreDNS pod's status appears normal, analyze the CoreDNS pod logs for additional insights.

CoreDNS pod log troubleshooting

The command is as follows:

Plain Text

1kubectl -n kube-system logs <pod-name>

Within the cluster, services load-balancing across multiple pods through ClusterIP enables spot-checking the CoreDNS pod logs.

To precisely direct DNS requests to a certain CoreDNS pod, use the dig command to specify the DNS server. The command is as follows:

Plain Text

1dig baidu.com @<pod-ip>

Then check the CoreDNS pod logs of the corresponding pod-ip
Example of CoreDNS logs:

Plain Text

1[INFO] 192.168.2.90:30639 - 8870 "A IN nfd-master.kube-system.svc.cluster.local. udp 69 false 1232" NOERROR qr,aa,rd 114 0.00010245s

The keyword NOERROR indicates a successful resolution return code. Common return codes are as follows:

NOERROR: Resolution succeeded
NXDOMAIN: Domain name does not exist
SERVFAIL: Resolution errors from upstream DNS servers, etc

Moreover, attention should be paid to whether other errors exist in the logs. Some common errors include:

Unable to connect to the API server. Please check the status of the API Server
K8S API compatibility errors are generally caused by incompatibility between CoreDNS and K8S versions, typically requiring CoreDNS or K8S version upgrade

Troubleshooting for domain name resolution

Scenario 1: Intra-cluster service resolution succeeded, but public network domain name resolution failed. In such situations, CoreDNS pod logs often display return codes like NXDOMAIN or SERVFAIL. You can reach out to the cloud DNS team for assistance.

Scenario 2: Private domain name resolution failed. Verify whether the private domain name is registered in the cloud DNS. To enable CoreDNS to resolve private domain names, include the configuration of the private domain name resolution server in the CoreDNS configuration file.

Scenario 3: Headless service resolution failure. For Headless services, the resolution directly returns all pod IPs. Check whether the pods corresponding to the service are in a running state.

Node DNS troubleshooting

If the dnsPolicy of user load is set to default or uses hostNetwork, the node DNS configuration will be used.
The dig command can be used on the node for reproduction and troubleshooting, as shown in the following example:

Plain Text

1dig baidu.com

Check node kernel logs by executing the following command:

Plain Text

1dmesg

Check for network-related errors, such as:

queue failue
conntract full

If no anomalies are found in the node kernel logs, and since the default /etc/resolv.conf configuration uses the public cloud DNS server, you can contact the cloud DNS team to address the issue.

NodeLocal DNS troubleshooting

Please first read CCE Node Local DNS Description to understand the operating principle of NodeLocal DNS.
Then, verify that relevant configurations of NodeLocal DNS have taken effect as per the document. For troubleshooting of other links, refer to this document.

Common DNS problems and solutions

Container network connectivity failure

Problem phenomenon:
Persistent DNS resolution failures for business pods

Root cause:
Network connectivity failure between business pods and CoreDNS pod containers

Solution:
Troubleshoot and ensure container network connectivity is normal. Refer to Connectivity Exception Troubleshooting in the troubleshooting methods

Extra-cluster domain names resolution is abnormal

Problem phenomenon:
Business pod can resolve the intra-cluster domain name normally but fails to resolve certain extra-cluster domain names

Root cause:
Upstream server domain name resolution returns exceptions

Solution:
Check CoreDNS pod request logs to identify error codes. Example logs are as follows:

Plain Text

1[INFO] 192.168.2.90:30639 - 8870 "A x.y.com. udp 69 false 1232" NOERROR qr,aa,rd 114 0.00010245s

If the return code is not NOERROR, it indicates an upstream server error. Common errors include:

NXDOMAIN: Domain name does not exist
SERVFAIL: Typically indicates upstream server failure or inability to connect to upstream servers

If the problem is confirmed to be with the upstream server, you can submit a ticket for resolution

Headless Service Domain Name Resolution Failure

Problem phenomenon: Scenario 3: Headless svc cannot be resolved.

Root cause:
For Headless services, the resolution result directly returns all pod IPs. If the business pod is not in running status, resolution cannot be done

Solution:
Please check and ensure that the pod corresponding to the service is in the running status

High load on CoreDNS pods

Problem phenomenon: Business pod experiences great DNS request latency, intermittent failures, or persistent failures

Root cause:
High load on CoreDNS pods, and insufficient processing capacity, leading to increased request latency or failures

Solution:
Increase CoreDNS replica count or allocate more resources to CoreDNS

Node Conntrack table full

Problem phenomenon: Business pod DNS experiences intermittent or persistent request failures during peak traffic Run dmesg -T on the node, and in the logs of the corresponding period, errors containing the keyword "conntrack full" are found

Root cause:
During application peak hours, the kernel Conntrack table becomes full, preventing new TCP or UDP requests

Solution:
To increase the kernel Conntrack table limit on nodes, submit a ticket for resolution

Concurrent resolution exception of A and AAAA records

Problem phenomenon: Business pod experiences intermittent failures in domain name resolution

Root cause:
Concurrent A and AAAA DNS requests trigger a defect in the Conntrack module of Linux kernel, causing UDP message loss

Solution:

If the container image is based on Alpine, it is recommended to replace the base image
Consider adopting the NodeLocal DNS cache solution to enhance DNS resolution performance and reduce CoreDNS load
Base images like CentOS and Ubuntu can be optimized by parameters such as options timeout:2 attempts:3 rotate single-request-reopen

IPVS defects cause resolution exception

Problem phenomenon: During cluster node scaling or CoreDNS scale-down, intermittent resolution failures may occur, typically lasting for about five minutes

Root cause:
If the kube-proxy load balancer mode of cluster is IPVS, on nodes with kernel versions below 4.19 (e.g., CentOS), after removing an IPVS UDP backend, the newly initiated UDP message will be discarded if they have conflicting source locations

Solution:
Consider adopting the NodeLocal DNS solution. Since TCP is used between NodeLocal DNS pod and CoreDNS pod, it can tolerate the packet loss caused by this IPVS defect

CoreDNS Component Manual Dilatation Guide

DNS Principle Overview

CCE CCE

CCE CCE

DNS Troubleshooting Guide

DNS troubleshooting process

Troubleshooting process

Category of common client errors

DNS troubleshooting approach

Troubleshoot by domain name type with resolution exception

Troubleshoot by frequency of resolution exception

Troubleshooting methods

Pod DNS configuration check

Connectivity exception troubleshooting

CoreDNS pod running status troubleshooting

CoreDNS pod log troubleshooting

Troubleshooting for domain name resolution

Node DNS troubleshooting

NodeLocal DNS troubleshooting

Common DNS problems and solutions

Container network connectivity failure

Extra-cluster domain names resolution is abnormal

Headless Service Domain Name Resolution Failure

High load on CoreDNS pods

Node Conntrack table full

Concurrent resolution exception of A and AAAA records

IPVS defects cause resolution exception