DNS Troubleshooting Guide
This document outlines common DNS-related problems and errors, along with corresponding troubleshooting methods and solutions.
DNS troubleshooting process
Troubleshooting process
When DNS resolution fails, troubleshoot as follows
-
First, we can determine the problem type based on resolution failure errors. For judgment methods, refer to
Common Client Error Categorization- If caused by network disconnection, refer to
Troubleshooting by Domain Name Type with Resolution ExceptionofTroubleshooting Approach - If it is due to domain name resolution failure, refer to
Troubleshooting by Resolution Exception FrequencyofTroubleshooting Approach
- If caused by network disconnection, refer to
-
If the above steps do not resolve the problem, troubleshoot by the following steps
-
Check the DNSPolicy field in pod configuration to confirm whether CoreDNS is used. For details, refer to
Pod DNS Configuration Check- If CoreDNS is not used, node DNS configuration will be inherited by default. Refer to
Node DNS Troubleshooting -
If CoreDNS is used, troubleshoot as follows
- To troubleshoot CoreDNS load status, refer to
CoreDNS Pod Running Status Troubleshooting - To troubleshoot CoreDNS log, refer to
CoreDNS Pod Log Troubleshooting
- To troubleshoot CoreDNS load status, refer to
- If CoreDNS is not used, node DNS configuration will be inherited by default. Refer to
- If NodeLocal DNS is used, refer to
NodeLocal DNS Troubleshooting
-
- If the above steps fail to resolve the problem, submit a ticket
Category of common client errors
| Category | Error |
|---|---|
| Network connectivity problem | network unreachable connection timeout connection reset by peer no route to host |
| Domain name resolution failure | no such host could not resolve host name or service not known NXDOMAIN |
Note: Different SDKs might display varying error messages. The error keywords mentioned might not be exhaustive; refer to the actual error message when making a determination.
DNS troubleshooting approach
Troubleshoot by domain name type with resolution exception
| Exception type | Solution |
|---|---|
| Intra-cluster and extra-cluster domain names are both abnormal | Container Network Failure High CoreDNS Pod Load Node Conntrack Table Full IPVS Defect Causing Resolution Exception |
| Only extra-cluster domain name is abnormal | Extra-cluster domain name resolution exception |
| Only the Headless service domain name is abnormal | Headless Service Domain Name Resolution Failure |
Troubleshoot by frequency of resolution exception
| Frequency of occurrence | Solution |
|---|---|
| Exception occurs only during application peak hours | High CoreDNS Pod load Node Conntrack Table Full |
| High frequency of exception occurrence | IPVS defect causing resolution exception |
| Low frequency of exception occurrence | Concurrent resolution exception for A and AAAA records |
| Exception occurs only during node scaling or CoreDNS scale-down | IPVS defect causing resolution exception |
Troubleshooting methods
Pod DNS configuration check
To view the dnsPolicy field of a business pod, execute the following command:
1kubectl -n <ns> get pod <pod-name> -o yaml | grep dnsPolicy
The value of this field is as follows:
- "Default": The pod inherits the name resolution configuration from the node it runs on, i.e. using the cloud-based DNS server for domain name resolution services
- "ClusterFirst": Uses CoreDNS for domain name resolution service. In pod, the nameserver in /etc/resolv.conf points to the ClusterIP of the kube-dns service
- "ClusterFirstWithHostNet": Pods running in hostNetwork mode require the DNS policy to be explicitly set to "ClusterFirstWithHostNet"; otherwise, they will fall back to the "Default" policy while using the "ClusterFirst" policy.
- "None": This setting allows pods to ignore DNS setting in the Kubernetes environment, and pod will use the DNS settings provided by their dnsConfig field
You can dive into the business pod to verify whether its DNS configuration file aligns with expectations.
First, enter the container:
1kubectl -n <ns> exec -it <pod-name> -- bash
If the bash command does not exist, try sh
Then check the DNS configuration file:
1cat /etc/resolv.conf
If the dnsPolicy field is set to ClusterFirst or ClusterFirstWithHostNet, the nameserver in resolv.conf should be the ClusterIP of kube-dns in the cluster.
If the dnsPolicy field is set to "Default," the nameserver in resolv.conf should match that in /etc/resolv.conf on the node.
If the dnsPolicy field is set to none, the configuration in resolv.conf should match the dnsConfig field in the pod YAML of the user.
Connectivity exception troubleshooting
Start by verifying the connectivity between the pod and the ClusterIP of CoreDNS.
You can enter the pod using the following command:
1kubectl -n <ns> exec -it <pod-name> -- bash
If the bash command does not exist, try sh
If a pod cannot be signed in via bash or sh, sign in to the node first and execute the following command on the node to obtain the container-id:
1docker ps | grep <Container name>
Then execute the following command to obtain the pid:
1docker inspect <container-id> | grep -i pid
Finally, enter the pod network namespace via the following command:
1nsenter -t <pid> -n bash
Note: If the container runtime is containerd, replace the docker command with crictl.
First, obtain the ClusterIP of CoreDNS via the following command:
1kubectl -n kube-system get svc kube-dns
Then test connectivity to ClusterIP and pod-ip sequentially via the following command:
1telnet <ClusterIP> 53
2telnet <pod-ip> 53
If the ClusterIP is unreachable but the CoreDNS pod IP is accessible, this indicates issues with cluster service connectivity. Check components like kube-proxy related to service load balancing.
If there’s no connectivity between the pod and the CoreDNS pod IP, first ensure the cluster pod network is functioning correctly. Resolve any pod network issues before proceeding, as common problems include node network outages or incorrect security group settings.
If there is no problem with the cluster pod network, the problem may lie with the CoreDNS pod. Resolution can be tested via the following command:
1dig <domain> @<ClusterIP>
Based on your findings, consult other sections of this document for further troubleshooting guidance.
CoreDNS pod running status troubleshooting
First, check the running status of the CoreDNS pod by executing the following command:
1kubectl -n kube-system get pod -o wide | grep coredns
All pods are expected to be in the running status. For non-running pods, the following command can be executed to investigate the cause:
1kubectl -n kube-system describe pod <pod-name>
Or use the following command to check CoreDNS pod resource usage and determine if resources are exceeded:
1kubectl -n kube-system top pod -l k8s-app=kube-dns
If the CoreDNS pod's load is excessive, increase the number of CoreDNS replicas.
If the CoreDNS pod's status appears normal, analyze the CoreDNS pod logs for additional insights.
CoreDNS pod log troubleshooting
The command is as follows:
1kubectl -n kube-system logs <pod-name>
Within the cluster, services load-balancing across multiple pods through ClusterIP enables spot-checking the CoreDNS pod logs.
To precisely direct DNS requests to a certain CoreDNS pod, use the dig command to specify the DNS server. The command is as follows:
1dig baidu.com @<pod-ip>
Then check the CoreDNS pod logs of the corresponding pod-ip
Example of CoreDNS logs:
1[INFO] 192.168.2.90:30639 - 8870 "A IN nfd-master.kube-system.svc.cluster.local. udp 69 false 1232" NOERROR qr,aa,rd 114 0.00010245s
The keyword NOERROR indicates a successful resolution return code. Common return codes are as follows:
- NOERROR: Resolution succeeded
- NXDOMAIN: Domain name does not exist
- SERVFAIL: Resolution errors from upstream DNS servers, etc
Moreover, attention should be paid to whether other errors exist in the logs. Some common errors include:
- Unable to connect to the API server. Please check the status of the API Server
- K8S API compatibility errors are generally caused by incompatibility between CoreDNS and K8S versions, typically requiring CoreDNS or K8S version upgrade
Troubleshooting for domain name resolution
Scenario 1: Intra-cluster service resolution succeeded, but public network domain name resolution failed. In such situations, CoreDNS pod logs often display return codes like NXDOMAIN or SERVFAIL. You can reach out to the cloud DNS team for assistance.
Scenario 2: Private domain name resolution failed. Verify whether the private domain name is registered in the cloud DNS. To enable CoreDNS to resolve private domain names, include the configuration of the private domain name resolution server in the CoreDNS configuration file.
Scenario 3: Headless service resolution failure. For Headless services, the resolution directly returns all pod IPs. Check whether the pods corresponding to the service are in a running state.
Node DNS troubleshooting
If the dnsPolicy of user load is set to default or uses hostNetwork, the node DNS configuration will be used.
The dig command can be used on the node for reproduction and troubleshooting, as shown in the following example:
1dig baidu.com
Check node kernel logs by executing the following command:
1dmesg
Check for network-related errors, such as:
- queue failue
- conntract full
If no anomalies are found in the node kernel logs, and since the default /etc/resolv.conf configuration uses the public cloud DNS server, you can contact the cloud DNS team to address the issue.
NodeLocal DNS troubleshooting
Please first read CCE Node Local DNS Description to understand the operating principle of NodeLocal DNS.
Then, verify that relevant configurations of NodeLocal DNS have taken effect as per the document. For troubleshooting of other links, refer to this document.
Common DNS problems and solutions
Container network connectivity failure
Problem phenomenon:
Persistent DNS resolution failures for business pods
Root cause:
Network connectivity failure between business pods and CoreDNS pod containers
Solution:
Troubleshoot and ensure container network connectivity is normal. Refer to Connectivity Exception Troubleshooting in the troubleshooting methods
Extra-cluster domain names resolution is abnormal
Problem phenomenon:
Business pod can resolve the intra-cluster domain name normally but fails to resolve certain extra-cluster domain names
Root cause:
Upstream server domain name resolution returns exceptions
Solution:
Check CoreDNS pod request logs to identify error codes. Example logs are as follows:
1[INFO] 192.168.2.90:30639 - 8870 "A x.y.com. udp 69 false 1232" NOERROR qr,aa,rd 114 0.00010245s
If the return code is not NOERROR, it indicates an upstream server error. Common errors include:
- NXDOMAIN: Domain name does not exist
- SERVFAIL: Typically indicates upstream server failure or inability to connect to upstream servers
If the problem is confirmed to be with the upstream server, you can submit a ticket for resolution
Headless Service Domain Name Resolution Failure
Problem phenomenon: Scenario 3: Headless svc cannot be resolved.
Root cause:
For Headless services, the resolution result directly returns all pod IPs. If the business pod is not in running status, resolution cannot be done
Solution:
Please check and ensure that the pod corresponding to the service is in the running status
High load on CoreDNS pods
Problem phenomenon: Business pod experiences great DNS request latency, intermittent failures, or persistent failures
Root cause:
High load on CoreDNS pods, and insufficient processing capacity, leading to increased request latency or failures
Solution:
Increase CoreDNS replica count or allocate more resources to CoreDNS
Node Conntrack table full
Problem phenomenon: Business pod DNS experiences intermittent or persistent request failures during peak traffic Run dmesg -T on the node, and in the logs of the corresponding period, errors containing the keyword "conntrack full" are found
Root cause:
During application peak hours, the kernel Conntrack table becomes full, preventing new TCP or UDP requests
Solution:
To increase the kernel Conntrack table limit on nodes, submit a ticket for resolution
Concurrent resolution exception of A and AAAA records
Problem phenomenon: Business pod experiences intermittent failures in domain name resolution
Root cause:
Concurrent A and AAAA DNS requests trigger a defect in the Conntrack module of Linux kernel, causing UDP message loss
Solution:
- If the container image is based on Alpine, it is recommended to replace the base image
- Consider adopting the NodeLocal DNS cache solution to enhance DNS resolution performance and reduce CoreDNS load
- Base images like CentOS and Ubuntu can be optimized by parameters such as options timeout:2 attempts:3 rotate single-request-reopen
IPVS defects cause resolution exception
Problem phenomenon: During cluster node scaling or CoreDNS scale-down, intermittent resolution failures may occur, typically lasting for about five minutes
Root cause:
If the kube-proxy load balancer mode of cluster is IPVS, on nodes with kernel versions below 4.19 (e.g., CentOS), after removing an IPVS UDP backend, the newly initiated UDP message will be discarded if they have conflicting source locations
Solution:
Consider adopting the NodeLocal DNS solution. Since TCP is used between NodeLocal DNS pod and CoreDNS pod, it can tolerate the packet loss caused by this IPVS defect
