Prometheus Monitoring System Deployment Guide

CCE CCE

  • Function Release Records
  • Common Tools
    • Command Line Scenario Examples
  • API Reference
    • Overview
    • Common Headers and Error Responses
    • General Description
  • Product Announcement
    • Announcement on the Discontinuation of CCE Standalone Clusters
    • CCE New Cluster Management Release Announcement
    • Upgrade Announcement for CCE Cluster Audit Component kube-external-auditor
    • CCE Console Upgrade Announcement
    • Announcement on Management Fees for CCE Managed Clusters
    • Container Runtime Version Release Notes
    • Announcement on the Decommissioning of CCE Image Repository
    • Kubernetes Version Release Notes
      • CCE Release of Kubernetes v1_26 History
      • CCE Kubernetes Version Update Notes
      • CCE Release of Kubernetes v1_24 History
      • CCE Release of Kubernetes v1_30 History
      • CCE Release of Kubernetes v1_22 History
      • CCE Release of Kubernetes v1_18 History
      • CCE Release of Kubernetes v1_20 History
      • CCE Release of Kubernetes v1_28 History
      • Release Notes for CCE Kubernetes 1_31 Version
      • Kubernetes Version Overview and Mechanism
    • Security Vulnerability Fix Announcement
      • Vulnerability CVE-2019-5736 Fix Announcement
      • Vulnerability CVE-2021-30465 Fix Announcement
      • CVE-2025-1097, CVE-2025-1098, and Other Vulnerabilities Fix Announcement
      • CVE-2020-14386 Vulnerability Fix Announcement
      • Impact Statement on runc Security Issue (CVE-2024-21626)
  • Service Level Agreement (SLA)
    • CCE Service Level Agreement SLA (V1_0)
  • Typical Practices
    • Pod Anomaly Troubleshooting
    • Adding CGroup V2 Node
    • Common Linux System Configuration Parameters Description
    • Encrypting etcd Data Using KMS
    • Configuring Container Network Parameters Using CNI
    • CCE - Public Network Access Practice
    • Practice of using private images in CCE clusters
    • Unified Access for Virtual Machines and Container Services via CCE Ingress
    • User Guide for Custom CNI Plugins
    • CCE Cluster Network Description and Planning
    • Cross-Cloud Application Migration to Baidu CCE Using Velero
    • CCE Resource Recommender User Documentation
    • Continuous Deployment with Jenkins in CCE Cluster
    • CCE Best Practice-Guestbook Setup
    • CCE Best Practice-Container Network Mode Selection
    • CCE Usage Checklist
    • VPC-ENI Mode Cluster Public Network Access Practice
    • CCE Container Runtime Selection
    • Cloud-native AI
      • Elastic and Fault-Tolerant Training Using CCE AITraining Operator
      • Deploy the TensorFlow Serving inference service
      • Best Practice for GPU Virtualization with Optimal Isolation
  • FAQs
    • How do business applications use load balancer
    • Using kubectl on Windows
    • Cluster management FAQs
    • Common Questions Overview
    • Auto scaling FAQs
    • Create a simple service via kubectl
  • Operation guide
    • Prerequisites for use
    • Identity and access management
    • Permission Management
      • Configure IAM Tag Permission Policy
      • Permission Overview
      • Configure IAM Custom Permission Policy
      • Configure Predefined RBAC Permission Policy
      • Configure IAM Predefined Permission Policy
      • Configure Cluster OIDC Authentication
    • Configuration Management
      • Configmap Management
      • Secret Management
    • Traffic access
      • BLB ingress annotation description
      • Use K8S_Service via CCE
      • Use K8S_Ingress via CCE
      • Implement Canary Release with CCE Based on Nginx-Ingress
      • Create CCE_Ingress via YAML
      • LoadBalancer Service Annotation Description
      • Service Reuses Existing Load Balancer BLB
      • Use Direct Pod Mode LoadBalancer Service
      • NGINX Ingress Configuration Reference
      • Create LoadBalancer_Service via YAML
      • Use NGINX Ingress
    • Virtual Node
      • Configuring BCIPod
      • Configuring bci-profile
      • Managing virtual nodes
    • Node management
      • Add a node
      • Managing Taints
      • Setting Node Blocking
      • Setting GPU Memory Sharing
      • Remove a node
      • Customizing Kubelet Parameters
      • Kubelet Container Monitor Read-Only Port Risk Warning
      • Managing Node Tag
      • Drain node
    • Component Management
      • CCE CSI CDS Plugin Description
      • CCE Fluid Description
      • CCE CSI PFS L2 Plugin
      • CCE Calico Felix Description
      • CCE Ingress Controller Description
      • CCE QoS Agent Description
      • CCE GPU Manager Description
      • CCE Ingress NGINX Controller Description
      • CCE P2P Accelerator Description
      • CCE Virtual Kubelet Component
      • CoreDNS Description
      • CCE Log Operator Description
      • CCE Node Remedier Description
      • CCE Descheduler Description
      • CCE Dynamic Scheduling Plugin Description
      • Kube Scheduler Documentation
      • CCE NPU Manager Description
      • CCE CronHPA Controller Description
      • CCE LB Controller Description
      • Kube ApiServer Description
      • CCE Backup Controller Description
      • CCE Network Plugin Description
      • CCE CSI PFS Plugin Description
      • CCE Credential Controller Description
      • CCE Deep Learning Frameworks Operator Description
      • Component Overview
      • CCE Image Accelerate Description
      • CCE CSI BOS Plugin Description
      • CCE Onepilot Description
      • Description of Kube Controller Manager
      • CCE_Hybrid_Manager Description
      • CCE NodeLocal DNSCache Description
      • CCE Node Problem Detector Description
      • CCE Ascend Mindx DL Description
      • CCE RDMA Device Plugin Description
      • CCE AI Job Scheduler Description
    • Image registry
      • Image Registry Basic Operations
      • Using Container Image to Build Services
    • Helm Management
      • Helm Template
      • Helm Instance
    • Cluster management
      • Upgrade Cluster Kubernetes Version
      • CCE Node CDS Dilatation
      • Managed Cluster Usage Instructions
      • Create cluster
      • CCE Supports GPUSharing Cluster
      • View Cluster
      • Connect to Cluster via kubectl
      • CCE Security Group
      • CCE Node Resource Reservation Instructions
      • Operate Cluster
      • Cluster Snapshot
    • Serverless Cluster
      • Product overview
      • Using Service in Serverless Cluster
      • Creating a Serverless Cluster
    • Storage Management
      • Using Cloud File System
      • Overview
      • Using Parallel File System PFS
      • Using RapidFS
      • Using Object Storage BOS
      • Using Parallel File System PFS L2
      • Using Local Storage
      • Using Cloud Disk CDS
    • Inspection and Diagnosis
      • Cluster Inspection
      • GPU Runtime Environment Check
      • Fault Diagnosis
    • Cloud-native AI
      • Cloud-Native AI Overview
      • AI Monitoring Dashboard
        • Connecting to a Prometheus Instance and Starting a Job
        • NVIDIA Chip Resource Observation
          • AI Job Scheduler component
          • GPU node resources
          • GPU workload resources
          • GPUManager component
          • GPU resource pool overview
        • Ascend Chip Resource Observation
          • Ascend resource pool overview
          • Ascend node resource
          • Ascend workload resource
      • Task Management
        • View Task Information
        • Create TensorFlow Task
        • Example of RDMA Distributed Training Based on NCCL
        • Create PaddlePaddle Task
        • Create AI Training Task
        • Delete task
        • Create PyTorch Task
        • Create Mxnet Task
      • Queue Management
        • Modify Queue
        • Create Queue
        • Usage Instructions for Logical Queues and Physical Queues
        • Queue deletion
      • Dataset Management
        • Create Dataset
        • Delete dataset
        • View Dataset
        • Operate Dataset
      • AI Acceleration Kit
        • AIAK Introduction
        • Using AIAK-Training PyTorch Edition
        • Deploying Distributed Training Tasks Using AIAK-Training
        • Accelerating Inference Business Using AIAK-Inference
      • GPU Virtualization
        • GPU Exclusive and Shared Usage Instructions
        • Image Build Precautions in Shared GPU Scenarios
        • Instructions for Multi-GPU Usage in Single-GPU Containers
        • GPU Virtualization Adaptation Table
        • GPU Online and Offline Mixed Usage Instructions
        • MPS Best Practices & Precautions
        • Precautions for Disabling Node Video Memory Sharing
    • Elastic Scaling
      • Container Timing Horizontal Scaling (CronHPA)
      • Container Horizontal Scaling (HPA)
      • Implementing Second-Level Elastic Scaling with cce-autoscaling-placeholder
      • CCE Cluster Node Auto-Scaling
    • Network Management
      • How to Continue Dilatation When Container Network Segment Space Is Exhausted (VPC-ENI Mode)
      • Container Access to External Services in CCE Clusters
      • CCE supports dual-stack networks of IPv4 and IPv6
      • Using NetworkPolicy Network Policy
      • Traffic Forwarding Configuration for Containers in Peering Connections Scenarios
      • CCE IP Masquerade Agent User Guide
      • Creating VPC-ENI Mode Cluster
      • How to Continue Dilatation When Container Network Segment Space Is Exhausted (VPC Network Mode)
      • Using NetworkPolicy in CCE Clusters
      • Network Orchestration
        • Container Network QoS Management
        • VPC-ENI Specified Subnet IP Allocation (Container Network v2)
        • Cluster Pod Subnet Topology Distribution (Container Network v2)
      • Network Connectivity
        • Container network accesses the public network via NAT gateway
      • Network Maintenance
        • Common Error Code Table for CCE Container Network
      • DNS
        • CoreDNS Component Manual Dilatation Guide
        • DNS Troubleshooting Guide
        • DNS Principle Overview
    • Namespace Management
      • Set Limit Range
      • Set Resource Quota
      • Basic Namespace Operations
    • Workload
      • CronJob Management
      • Set Workload Auto-Scaling
      • Deployment Management
      • Job Management
      • View the Pod
      • StatefulSet Management
      • Password-Free Pull of Container Image
      • Create Workload Using Private Image
      • DaemonSet Management
    • Monitor Logs
      • Monitor Cluster with Prometheus
      • CCE Event Center
      • Cluster Service Profiling
      • CCE Cluster Anomaly Event Alerts
      • Java Application Monitor
      • Cluster Audit Dashboard
      • Logging
      • Cluster Audit
      • Log Center
        • Configure Collection Rules Using CRD
        • View Cluster Control Plane Logs
        • View Business Logs
        • Log Overview
        • Configure Collection Rules in Cloud Container Engine Console
    • Application management
      • Overview
      • Secret
      • Configuration dictionary
      • Deployment
      • Service
      • Pod
    • NodeGroup Management
      • NodeGroup Management
      • NodeGroup Node Fault Detection and Self-Healing
      • Configuring Scaling Policies
      • NodeGroup Introduction
      • Adding Existing External Nodes
      • Custom NodeGroup Kubelet Configuration
      • Adding Alternative Models
      • Dilatation NodeGroup
    • Backup Center
      • Restore Management
      • Backup Overview
      • Backup Management
      • Backup repository
  • Quick Start
    • Quick Deployment of Nginx Application
    • CCE Container Engine Usage Process Overview
  • Product pricing
    • Product pricing
  • Product Description
    • Application scenarios
    • Introduction
    • Usage restrictions
    • Features
    • Advantages
    • Core concepts
  • Solution-Fabric
    • Fabric Solution
  • Development Guide
    • EFK Log Collection System Deployment Guide
    • Using Network Policy in CCE Cluster
    • Creating a LoadBalancer-Type Service
    • Prometheus Monitoring System Deployment Guide
    • kubectl Management Configuration
  • API_V2 Reference
    • Overview
    • Common Headers and Error Responses
    • Cluster Related Interfaces
    • Instance Related Interfaces
    • Service domain
    • General Description
    • Kubeconfig Related Interfaces
    • RBAC Related Interfaces
    • Autoscaler Related Interfaces
    • Network Related Interfaces
    • InstanceGroup Related Interfaces
    • Appendix
    • Component management-related APIs
    • Package adaptation-related APIs
    • Task Related Interfaces
  • Solution-Xchain
    • Hyperchain Solution
  • SDK
    • Go-SDK
      • Overview
      • NodeGroup Management
      • Initialization
      • Install the SDK Package
      • Cluster management
      • Node management
All documents
menu
No results found, please re-enter

CCE CCE

  • Function Release Records
  • Common Tools
    • Command Line Scenario Examples
  • API Reference
    • Overview
    • Common Headers and Error Responses
    • General Description
  • Product Announcement
    • Announcement on the Discontinuation of CCE Standalone Clusters
    • CCE New Cluster Management Release Announcement
    • Upgrade Announcement for CCE Cluster Audit Component kube-external-auditor
    • CCE Console Upgrade Announcement
    • Announcement on Management Fees for CCE Managed Clusters
    • Container Runtime Version Release Notes
    • Announcement on the Decommissioning of CCE Image Repository
    • Kubernetes Version Release Notes
      • CCE Release of Kubernetes v1_26 History
      • CCE Kubernetes Version Update Notes
      • CCE Release of Kubernetes v1_24 History
      • CCE Release of Kubernetes v1_30 History
      • CCE Release of Kubernetes v1_22 History
      • CCE Release of Kubernetes v1_18 History
      • CCE Release of Kubernetes v1_20 History
      • CCE Release of Kubernetes v1_28 History
      • Release Notes for CCE Kubernetes 1_31 Version
      • Kubernetes Version Overview and Mechanism
    • Security Vulnerability Fix Announcement
      • Vulnerability CVE-2019-5736 Fix Announcement
      • Vulnerability CVE-2021-30465 Fix Announcement
      • CVE-2025-1097, CVE-2025-1098, and Other Vulnerabilities Fix Announcement
      • CVE-2020-14386 Vulnerability Fix Announcement
      • Impact Statement on runc Security Issue (CVE-2024-21626)
  • Service Level Agreement (SLA)
    • CCE Service Level Agreement SLA (V1_0)
  • Typical Practices
    • Pod Anomaly Troubleshooting
    • Adding CGroup V2 Node
    • Common Linux System Configuration Parameters Description
    • Encrypting etcd Data Using KMS
    • Configuring Container Network Parameters Using CNI
    • CCE - Public Network Access Practice
    • Practice of using private images in CCE clusters
    • Unified Access for Virtual Machines and Container Services via CCE Ingress
    • User Guide for Custom CNI Plugins
    • CCE Cluster Network Description and Planning
    • Cross-Cloud Application Migration to Baidu CCE Using Velero
    • CCE Resource Recommender User Documentation
    • Continuous Deployment with Jenkins in CCE Cluster
    • CCE Best Practice-Guestbook Setup
    • CCE Best Practice-Container Network Mode Selection
    • CCE Usage Checklist
    • VPC-ENI Mode Cluster Public Network Access Practice
    • CCE Container Runtime Selection
    • Cloud-native AI
      • Elastic and Fault-Tolerant Training Using CCE AITraining Operator
      • Deploy the TensorFlow Serving inference service
      • Best Practice for GPU Virtualization with Optimal Isolation
  • FAQs
    • How do business applications use load balancer
    • Using kubectl on Windows
    • Cluster management FAQs
    • Common Questions Overview
    • Auto scaling FAQs
    • Create a simple service via kubectl
  • Operation guide
    • Prerequisites for use
    • Identity and access management
    • Permission Management
      • Configure IAM Tag Permission Policy
      • Permission Overview
      • Configure IAM Custom Permission Policy
      • Configure Predefined RBAC Permission Policy
      • Configure IAM Predefined Permission Policy
      • Configure Cluster OIDC Authentication
    • Configuration Management
      • Configmap Management
      • Secret Management
    • Traffic access
      • BLB ingress annotation description
      • Use K8S_Service via CCE
      • Use K8S_Ingress via CCE
      • Implement Canary Release with CCE Based on Nginx-Ingress
      • Create CCE_Ingress via YAML
      • LoadBalancer Service Annotation Description
      • Service Reuses Existing Load Balancer BLB
      • Use Direct Pod Mode LoadBalancer Service
      • NGINX Ingress Configuration Reference
      • Create LoadBalancer_Service via YAML
      • Use NGINX Ingress
    • Virtual Node
      • Configuring BCIPod
      • Configuring bci-profile
      • Managing virtual nodes
    • Node management
      • Add a node
      • Managing Taints
      • Setting Node Blocking
      • Setting GPU Memory Sharing
      • Remove a node
      • Customizing Kubelet Parameters
      • Kubelet Container Monitor Read-Only Port Risk Warning
      • Managing Node Tag
      • Drain node
    • Component Management
      • CCE CSI CDS Plugin Description
      • CCE Fluid Description
      • CCE CSI PFS L2 Plugin
      • CCE Calico Felix Description
      • CCE Ingress Controller Description
      • CCE QoS Agent Description
      • CCE GPU Manager Description
      • CCE Ingress NGINX Controller Description
      • CCE P2P Accelerator Description
      • CCE Virtual Kubelet Component
      • CoreDNS Description
      • CCE Log Operator Description
      • CCE Node Remedier Description
      • CCE Descheduler Description
      • CCE Dynamic Scheduling Plugin Description
      • Kube Scheduler Documentation
      • CCE NPU Manager Description
      • CCE CronHPA Controller Description
      • CCE LB Controller Description
      • Kube ApiServer Description
      • CCE Backup Controller Description
      • CCE Network Plugin Description
      • CCE CSI PFS Plugin Description
      • CCE Credential Controller Description
      • CCE Deep Learning Frameworks Operator Description
      • Component Overview
      • CCE Image Accelerate Description
      • CCE CSI BOS Plugin Description
      • CCE Onepilot Description
      • Description of Kube Controller Manager
      • CCE_Hybrid_Manager Description
      • CCE NodeLocal DNSCache Description
      • CCE Node Problem Detector Description
      • CCE Ascend Mindx DL Description
      • CCE RDMA Device Plugin Description
      • CCE AI Job Scheduler Description
    • Image registry
      • Image Registry Basic Operations
      • Using Container Image to Build Services
    • Helm Management
      • Helm Template
      • Helm Instance
    • Cluster management
      • Upgrade Cluster Kubernetes Version
      • CCE Node CDS Dilatation
      • Managed Cluster Usage Instructions
      • Create cluster
      • CCE Supports GPUSharing Cluster
      • View Cluster
      • Connect to Cluster via kubectl
      • CCE Security Group
      • CCE Node Resource Reservation Instructions
      • Operate Cluster
      • Cluster Snapshot
    • Serverless Cluster
      • Product overview
      • Using Service in Serverless Cluster
      • Creating a Serverless Cluster
    • Storage Management
      • Using Cloud File System
      • Overview
      • Using Parallel File System PFS
      • Using RapidFS
      • Using Object Storage BOS
      • Using Parallel File System PFS L2
      • Using Local Storage
      • Using Cloud Disk CDS
    • Inspection and Diagnosis
      • Cluster Inspection
      • GPU Runtime Environment Check
      • Fault Diagnosis
    • Cloud-native AI
      • Cloud-Native AI Overview
      • AI Monitoring Dashboard
        • Connecting to a Prometheus Instance and Starting a Job
        • NVIDIA Chip Resource Observation
          • AI Job Scheduler component
          • GPU node resources
          • GPU workload resources
          • GPUManager component
          • GPU resource pool overview
        • Ascend Chip Resource Observation
          • Ascend resource pool overview
          • Ascend node resource
          • Ascend workload resource
      • Task Management
        • View Task Information
        • Create TensorFlow Task
        • Example of RDMA Distributed Training Based on NCCL
        • Create PaddlePaddle Task
        • Create AI Training Task
        • Delete task
        • Create PyTorch Task
        • Create Mxnet Task
      • Queue Management
        • Modify Queue
        • Create Queue
        • Usage Instructions for Logical Queues and Physical Queues
        • Queue deletion
      • Dataset Management
        • Create Dataset
        • Delete dataset
        • View Dataset
        • Operate Dataset
      • AI Acceleration Kit
        • AIAK Introduction
        • Using AIAK-Training PyTorch Edition
        • Deploying Distributed Training Tasks Using AIAK-Training
        • Accelerating Inference Business Using AIAK-Inference
      • GPU Virtualization
        • GPU Exclusive and Shared Usage Instructions
        • Image Build Precautions in Shared GPU Scenarios
        • Instructions for Multi-GPU Usage in Single-GPU Containers
        • GPU Virtualization Adaptation Table
        • GPU Online and Offline Mixed Usage Instructions
        • MPS Best Practices & Precautions
        • Precautions for Disabling Node Video Memory Sharing
    • Elastic Scaling
      • Container Timing Horizontal Scaling (CronHPA)
      • Container Horizontal Scaling (HPA)
      • Implementing Second-Level Elastic Scaling with cce-autoscaling-placeholder
      • CCE Cluster Node Auto-Scaling
    • Network Management
      • How to Continue Dilatation When Container Network Segment Space Is Exhausted (VPC-ENI Mode)
      • Container Access to External Services in CCE Clusters
      • CCE supports dual-stack networks of IPv4 and IPv6
      • Using NetworkPolicy Network Policy
      • Traffic Forwarding Configuration for Containers in Peering Connections Scenarios
      • CCE IP Masquerade Agent User Guide
      • Creating VPC-ENI Mode Cluster
      • How to Continue Dilatation When Container Network Segment Space Is Exhausted (VPC Network Mode)
      • Using NetworkPolicy in CCE Clusters
      • Network Orchestration
        • Container Network QoS Management
        • VPC-ENI Specified Subnet IP Allocation (Container Network v2)
        • Cluster Pod Subnet Topology Distribution (Container Network v2)
      • Network Connectivity
        • Container network accesses the public network via NAT gateway
      • Network Maintenance
        • Common Error Code Table for CCE Container Network
      • DNS
        • CoreDNS Component Manual Dilatation Guide
        • DNS Troubleshooting Guide
        • DNS Principle Overview
    • Namespace Management
      • Set Limit Range
      • Set Resource Quota
      • Basic Namespace Operations
    • Workload
      • CronJob Management
      • Set Workload Auto-Scaling
      • Deployment Management
      • Job Management
      • View the Pod
      • StatefulSet Management
      • Password-Free Pull of Container Image
      • Create Workload Using Private Image
      • DaemonSet Management
    • Monitor Logs
      • Monitor Cluster with Prometheus
      • CCE Event Center
      • Cluster Service Profiling
      • CCE Cluster Anomaly Event Alerts
      • Java Application Monitor
      • Cluster Audit Dashboard
      • Logging
      • Cluster Audit
      • Log Center
        • Configure Collection Rules Using CRD
        • View Cluster Control Plane Logs
        • View Business Logs
        • Log Overview
        • Configure Collection Rules in Cloud Container Engine Console
    • Application management
      • Overview
      • Secret
      • Configuration dictionary
      • Deployment
      • Service
      • Pod
    • NodeGroup Management
      • NodeGroup Management
      • NodeGroup Node Fault Detection and Self-Healing
      • Configuring Scaling Policies
      • NodeGroup Introduction
      • Adding Existing External Nodes
      • Custom NodeGroup Kubelet Configuration
      • Adding Alternative Models
      • Dilatation NodeGroup
    • Backup Center
      • Restore Management
      • Backup Overview
      • Backup Management
      • Backup repository
  • Quick Start
    • Quick Deployment of Nginx Application
    • CCE Container Engine Usage Process Overview
  • Product pricing
    • Product pricing
  • Product Description
    • Application scenarios
    • Introduction
    • Usage restrictions
    • Features
    • Advantages
    • Core concepts
  • Solution-Fabric
    • Fabric Solution
  • Development Guide
    • EFK Log Collection System Deployment Guide
    • Using Network Policy in CCE Cluster
    • Creating a LoadBalancer-Type Service
    • Prometheus Monitoring System Deployment Guide
    • kubectl Management Configuration
  • API_V2 Reference
    • Overview
    • Common Headers and Error Responses
    • Cluster Related Interfaces
    • Instance Related Interfaces
    • Service domain
    • General Description
    • Kubeconfig Related Interfaces
    • RBAC Related Interfaces
    • Autoscaler Related Interfaces
    • Network Related Interfaces
    • InstanceGroup Related Interfaces
    • Appendix
    • Component management-related APIs
    • Package adaptation-related APIs
    • Task Related Interfaces
  • Solution-Xchain
    • Hyperchain Solution
  • SDK
    • Go-SDK
      • Overview
      • NodeGroup Management
      • Initialization
      • Install the SDK Package
      • Cluster management
      • Node management
  • Document center
  • arrow
  • CCECCE
  • arrow
  • Development Guide
  • arrow
  • Prometheus Monitoring System Deployment Guide
Table of contents on this page
  • Introduction to Prometheus
  • Pre-deployment preparation
  • Create a Prometheus user
  • Create a configuration object (ConfigMap)
  • Create Node Exporter
  • Create Prometheus and its associated service

Prometheus Monitoring System Deployment Guide

Updated at:2025-10-27

Introduction to Prometheus

Prometheus is an open-source monitoring system that originated as an alert toolkit at SoundCloud. Since 2012, it has been widely adopted by companies and organizations. Prometheus has a vibrant developer and user community, with increasing participation in its development and usage. It is now an independent open-source project not tied to any single company. To highlight this independence and formalize its governance, Prometheus joined the Cloud Native Computing Foundation (CNCF) in 2016, following the lead of Kubernetes. Its key features include:

  • A multi-dimensional data model: Time series consist of a metric name and key-value (k/v) labels.
  • A flexible query language: PromQL.
  • Storage that operates without dependencies: Supports both local and remote storage solutions.
  • HTTP-based pull mode: Utilizes the HTTP protocol to retrieve data, offering simplicity and clarity.
  • Monitoring targets can be configured through service discovery or static configurations.
  • Supports various statistical data models, offering user-friendly visualizations.

Pre-deployment preparation

To successfully deploy the Prometheus in a Kubernetes cluster provided by the CCE service, complete the following prerequisites first:

  • You have an initialized Kubernetes cluster on CCE
  • You can access the cluster normally via kubectl according to the [guide document](CCE/Operation guide/Operation process.md).

Create a Prometheus user

To effectively separate user roles and permissions, we create a specific user for the monitoring system and assign appropriate cluster roles. First, edit the configuration file named rbac-setup.yml. The file content is as follows:

Plain Text
1apiVersion: rbac.authorization.k8s.io/v1beta1
2kind: ClusterRole
3metadata:
4  name: prometheus
5rules:
6- apiGroups: [""]
7  resources:
8  - nodes
9  - nodes/proxy
10  - services
11  - endpoints
12  - pods
13  verbs: ["get", "list", "watch"]
14- nonResourceURLs: ["/metrics"]
15  verbs: ["get"]
16---
17apiVersion: v1
18kind: ServiceAccount
19metadata:
20  name: prometheus
21  namespace: default
22---
23apiVersion: rbac.authorization.k8s.io/v1beta1
24kind: ClusterRoleBinding
25metadata:
26  name: prometheus
27roleRef:
28  apiGroup: rbac.authorization.k8s.io
29  kind: ClusterRole
30  name: prometheus
31subjects:
32- kind: ServiceAccount
33  name: prometheus
34  namespace: default

After completing the editing, execute the command:

Plain Text
1$ kubectl create -f rbac-setup.yml
2$ kubectl get sa

The response similar to the following information will be returned:

Plain Text
1NAME         SECRETS   AGE
2default      1         1d
3prometheus   1         8h

Create a configuration object (ConfigMap)

After creating the roles, we need to create a configuration file object (ConfigMap) for Prometheus. Edit the file prometheus-kubernetes-configmap.yml with the following content: The alerting.rules section contains user-defined alerting rules. Here, we use examples of alerts for “container memory usage exceeding 90%” and “node unavailability”. For the syntax of defining alerting rules, refer to Prometheus Document:

Plain Text
1apiVersion: v1
2kind: ConfigMap
3metadata:
4  name: prometheus
5data:
6  alerting.rules: |-
7      # ALERT when container memory usage exceed 90%
8      ALERT container_mem_over_90
9        IF (sum(container_memory_working_set_bytes{image!="",name=~"^k8s_.*", pod_name!=""}) by (pod_name)) / (sum (container_spec_memory_limit_bytes{image!="",name=~"^k8s_.*", pod_name!=""}) by (pod_name)) > 0.9 and (sum(container_memory_working_set_bytes{image!="",name=~"^k8s_.*", pod_name!=""}) by (pod_name)) / (sum (container_spec_memory_limit_bytes{image!="",name=~"^k8s_.*", pod_name!=""}) by (pod_name)) < 2
10        FOR 30s
11        ANNOTATIONS {
12          description = "Memory Usage of Pod {{ $labels.pod_name }} on {{ $labels.kubernetes_io_hostname }} has exceeded 90%",
13        }
14      # ALERT when node is down
15      ALERT node_down
16        IF up == 0
17        FOR 30s
18        ANNOTATIONS {
19          description = "Node {{ $labels.kubernetes_io_hostname }} is down",
20        }
21  prometheus.yml: |-
22      rule_files:
23        # alerting rules
24        - /etc/prometheus/alert.rules
25      alerting:
26        alertmanagers:
27        - scheme: http
28          static_configs:
29          - targets:
30            - "localhost:9093"
31      # A scrape configuration for running Prometheus on a Kubernetes cluster.
32      # This uses separate scrape configs for cluster components (i.e. API server, node)
33      # and services to allow each to use different authentication configs.
34      #
35      # Kubernetes labels will be added as Prometheus labels on metrics via the
36      # `labelmap` relabeling action.
37      #
38      # If you are using Kubernetes 1.7.2 or earlier, please take note of the comments
39      # for the kubernetes-cadvisor job; you will need to edit or remove this job.
40      
41      # Scrape config for API servers.
42      #
43      # Kubernetes exposes API servers as endpoints to the default/kubernetes
44      # service so this uses `endpoints` role and uses relabelling to only keep
45      # the endpoints associated with the default/kubernetes service using the
46      # default named port `https`. This works for single API server deployments as
47      # well as HA API server deployments.
48      scrape_configs:
49      - job_name: 'kubernetes-apiservers'
50      
51        kubernetes_sd_configs:
52        - role: endpoints
53      
54        # Default to scraping over https. If required, just disable this or change to
55        # `http`.
56        scheme: https
57      
58        # This TLS & bearer token file config is used to connect to the actual scrape
59        # endpoints for cluster components. This is separate to discovery auth
60        # configuration because discovery & scraping are two separate concerns in
61        # Prometheus. The discovery auth config is automatic if Prometheus runs inside
62        # the cluster. Otherwise, more config options have to be provided within the
63        # <kubernetes_sd_config>.
64        tls_config:
65          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
66          # If your node certificates are self-signed or use a different CA to the
67          # master CA, then disable certificate verification below. Note that
68          # certificate verification is an integral part of a secure infrastructure
69          # so this should only be disabled in a controlled environment. You can
70          # disable certificate verification by uncommenting the line below.
71          #
72          insecure_skip_verify: true
73        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
74      
75        # Keep only the default/kubernetes service endpoints for the https port. This
76        # will add targets for each API server which Kubernetes adds an endpoint to
77        # the default/kubernetes service.
78        relabel_configs:
79        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
80          action: keep
81          regex: default;kubernetes;https
82      
83      # Scrape config for nodes (kubelet).
84      #
85      # Rather than connecting directly to the node, the scrape is proxied though the
86      # Kubernetes apiserver.  This means it will work if Prometheus is running out of
87      # cluster, or can't connect to nodes for some other reason (e.g. because of
88      # firewalling).
89      - job_name: 'kubernetes-nodes'
90      
91        # Default to scraping over https. If required, just disable this or change to
92        # `http`.
93        scheme: https
94      
95        # This TLS & bearer token file config is used to connect to the actual scrape
96        # endpoints for cluster components. This is separate to discovery auth
97        # configuration because discovery & scraping are two separate concerns in
98        # Prometheus. The discovery auth config is automatic if Prometheus runs inside
99        # the cluster. Otherwise, more config options have to be provided within the
100        # <kubernetes_sd_config>.
101        tls_config:
102          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
103          insecure_skip_verify: true
104        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
105      
106        kubernetes_sd_configs:
107        - role: node
108      
109        relabel_configs:
110        - action: labelmap
111          regex: __meta_kubernetes_node_label_(.+)
112        - target_label: __address__
113          replacement: kubernetes.default.svc:443
114        - source_labels: [__meta_kubernetes_node_name]
115          regex: (.+)
116          target_label: __metrics_path__
117          replacement: /api/v1/nodes/${1}/proxy/metrics
118      
119      # Scrape config for Kubelet cAdvisor.
120      #
121      # This is required for Kubernetes 1.7.3 and later, where cAdvisor metrics
122      # (those whose names begin with 'container_') have been removed from the
123      # Kubelet metrics endpoint.  This job scrapes the cAdvisor endpoint to
124      # retrieve those metrics.
125      #
126      # In Kubernetes 1.7.0-1.7.2, these metrics are only exposed on the cAdvisor
127      # HTTP endpoint; use "replacement: /api/v1/nodes/${1}:4194/proxy/metrics"
128      # in that case (and ensure cAdvisor's HTTP server hasn't been disabled with
129      # the --cadvisor-port=0 Kubelet flag).
130      #
131      # This job is not necessary and should be removed in Kubernetes 1.6 and
132      # earlier versions, or it will cause the metrics to be scraped twice.
133      - job_name: 'kubernetes-cadvisor'
134      
135        # Default to scraping over https. If required, just disable this or change to
136        # `http`.
137        scheme: https
138      
139        # This TLS & bearer token file config is used to connect to the actual scrape
140        # endpoints for cluster components. This is separate to discovery auth
141        # configuration because discovery & scraping are two separate concerns in
142        # Prometheus. The discovery auth config is automatic if Prometheus runs inside
143        # the cluster. Otherwise, more config options have to be provided within the
144        # <kubernetes_sd_config>.
145        tls_config:
146          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
147        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
148      
149        kubernetes_sd_configs:
150        - role: node
151      
152        relabel_configs:
153        - action: labelmap
154          regex: __meta_kubernetes_node_label_(.+)
155        - target_label: __address__
156          replacement: kubernetes.default.svc:443
157        - source_labels: [__meta_kubernetes_node_name]
158          regex: (.+)
159          target_label: __metrics_path__
160          replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
161      
162      # Scrape config for service endpoints.
163      #
164      # The relabeling allows the actual service scrape endpoint to be configured
165      # via the following annotations:
166      #
167      # * `prometheus.io/scrape`: Only scrape services that have a value of `true`
168      # * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
169      # to set this to `https` & most likely set the `tls_config` of the scrape config.
170      # * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
171      # * `prometheus.io/port`: If the metrics are exposed on a different port to the
172      # service then set this appropriately.
173      - job_name: 'kubernetes-service-endpoints'
174      
175        kubernetes_sd_configs:
176        - role: endpoints
177      
178        relabel_configs:
179        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
180          action: keep
181          regex: true
182        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
183          action: replace
184          target_label: __scheme__
185          regex: (https?)
186        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
187          action: replace
188          target_label: __metrics_path__
189          regex: (.+)
190        - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
191          action: replace
192          target_label: __address__
193          regex: ([^:]+)(?::\d+)?;(\d+)
194          replacement: $1:$2
195        - action: labelmap
196          regex: __meta_kubernetes_service_label_(.+)
197        - source_labels: [__meta_kubernetes_namespace]
198          action: replace
199          target_label: kubernetes_namespace
200        - source_labels: [__meta_kubernetes_service_name]
201          action: replace
202          target_label: kubernetes_name
203      
204      # Example scrape config for probing services via the Blackbox Exporter.
205      #
206      # The relabeling allows the actual service scrape endpoint to be configured
207      # via the following annotations:
208      #
209      # * `prometheus.io/probe`: Only probe services that have a value of `true`
210      - job_name: 'kubernetes-services'
211      
212        metrics_path: /probe
213        params:
214          module: [http_2xx]
215      
216        kubernetes_sd_configs:
217        - role: service
218      
219        relabel_configs:
220        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
221          action: keep
222          regex: true
223        - source_labels: [__address__]
224          target_label: __param_target
225        - target_label: __address__
226          replacement: blackbox-exporter.example.com:9115
227        - source_labels: [__param_target]
228          target_label: instance
229        - action: labelmap
230          regex: __meta_kubernetes_service_label_(.+)
231        - source_labels: [__meta_kubernetes_namespace]
232          target_label: kubernetes_namespace
233        - source_labels: [__meta_kubernetes_service_name]
234          target_label: kubernetes_name
235      
236      # Example scrape config for probing ingresses via the Blackbox Exporter.
237      #
238      # The relabeling allows the actual ingress scrape endpoint to be configured
239      # via the following annotations:
240      #
241      # * `prometheus.io/probe`: Only probe services that have a value of `true`
242      - job_name: 'kubernetes-ingresses'
243      
244        metrics_path: /probe
245        params:
246          module: [http_2xx]
247      
248        kubernetes_sd_configs:
249          - role: ingress
250      
251        relabel_configs:
252          - source_labels: [__meta_kubernetes_ingress_annotation_prometheus_io_probe]
253            action: keep
254            regex: true
255          - source_labels: [__meta_kubernetes_ingress_scheme,__address__,__meta_kubernetes_ingress_path]
256            regex: (.+);(.+);(.+)
257            replacement: ${1}://${2}${3}
258            target_label: __param_target
259          - target_label: __address__
260            replacement: blackbox-exporter.example.com:9115
261          - source_labels: [__param_target]
262            target_label: instance
263          - action: labelmap
264            regex: __meta_kubernetes_ingress_label_(.+)
265          - source_labels: [__meta_kubernetes_namespace]
266            target_label: kubernetes_namespace
267          - source_labels: [__meta_kubernetes_ingress_name]
268            target_label: kubernetes_name
269      
270      # Example scrape config for pods
271      #
272      # The relabeling allows the actual pod scrape endpoint to be configured via the
273      # following annotations:
274      #
275      # * `prometheus.io/scrape`: Only scrape pods that have a value of `true`
276      # * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
277      # * `prometheus.io/port`: Scrape the pod on the indicated port instead of the
278      # pod's declared ports (default is a port-free target if none are declared).
279      - job_name: 'kubernetes-pods'
280      
281        kubernetes_sd_configs:
282        - role: pod
283      
284        relabel_configs:
285        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
286          action: keep
287          regex: true
288        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
289          action: replace
290          target_label: __metrics_path__
291          regex: (.+)
292        - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
293          action: replace
294          regex: ([^:]+)(?::\d+)?;(\d+)
295          replacement: $1:$2
296          target_label: __address__
297        - action: labelmap
298          regex: __meta_kubernetes_pod_label_(.+)
299        - source_labels: [__meta_kubernetes_namespace]
300          action: replace
301          target_label: kubernetes_namespace
302        - source_labels: [__meta_kubernetes_pod_name]
303          action: replace
304          target_label: kubernetes_pod_name

Create a configuration file object for the monitoring alert component (AlertManager). Edit the file alertmanager-kubernetes-configmap.yml with the following content. Replace the configuration with valid SMTP settings and email recipients

Plain Text
1apiVersion: v1
2kind: ConfigMap
3metadata:
4  name: alertmanager
5data:
6  alertmanager.yml: |-
7      global:
8# Set your own SMTP server and authentication parameters
9        smtp_smarthost: 'localhost:25'
10        smtp_from: 'addr@domain.com'
11        smtp_auth_username: 'username@domain.com'
12        smtp_auth_password: 'password'
13      # The directory from which notification templates are read.
14      templates:
15      - '/etc/alertmanager/template/*.tmpl'
16      # The root route on which each incoming alert enters.
17      route:
18        # The labels by which incoming alerts are grouped together. For example,
19        # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
20        # be batched into a single group.
21        group_by: ['alertname', 'pod_name']
22        # When a new group of alerts is created by an incoming alert, wait at
23        # least 'group_wait' to send the initial notification.
24        # This way ensures that you get multiple alerts for the same group that start
25        # firing shortly after another are batched together on the first
26        # notification.
27        group_wait: 30s
28        # When the first notification was sent, wait 'group_interval' to send a batch
29        # of new alerts that started firing for that group.
30        group_interval: 5m
31        # If an alert has successfully been sent, wait 'repeat_interval' to
32        # resend them.
33        repeat_interval: 3h
34        # A default receiver
35        receiver: AlertMail
36      receivers:
37      - name: 'AlertMail'
38        email_configs:
39 - to: 'receiver@domain.com' # Replace with the alert recipient’s email address

After editing, use the kubectl command to create the corresponding ConfigMap object

Plain Text
1$ kubectl create -f prometheus-kubernetes-configmap.yml
2$ kubectl create -f alertmanager-kubernetes-configmap.yml
3$ kubectl get configmaps

The response similar to the following information will be returned:

Plain Text
1NAME           DATA      AGE
2alertmanager   1         29s
3prometheus     2         36s

Create Node Exporter

The default monitoring setup gathers limited resource information about cluster nodes. To access more detailed node resource data, we need to deploy the Node Exporter service on every node in the Kubernetes cluster. This can be achieved by using Kubernetes’ DaemonSet object to deploy Node Exporter on all nodes. The node-exporter.yaml file for creating the DaemonSet and service objects is as follows:

Plain Text
1apiVersion: v1
2kind: Service
3metadata:
4  annotations:
5    prometheus.io/scrape: 'true'
6  labels:
7    app: node-exporter
8    name: node-exporter
9  name: node-exporter
10spec:
11  clusterIP: None
12  ports:
13  - name: scrape
14    port: 9100
15    protocol: TCP
16  selector:
17    app: node-exporter
18  type: ClusterIP
19
20---
21apiVersion: extensions/v1beta1
22kind: DaemonSet
23metadata:
24  name: node-exporter
25spec:
26  template:
27    metadata:
28      labels:
29        app: node-exporter
30      name: node-exporter
31    spec:
32      containers:
33      - image: hub.baidubce.com/public/node-exporter:latest
34        name: node-exporter
35        ports:
36        - containerPort: 9100
37          hostPort: 9100
38          name: scrape
39      hostNetwork: true
40      hostPID: true

Then create the relevant objects:

Plain Text
1$ kubectl create -f node-exporter.yaml
2$ kubectl get daemonsets
3NAME            DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE-SELECTOR   AGE
4node-exporter   2         2         2         2            2           <none>          8h
5$ kubectl get services
6NAME            CLUSTER-IP       EXTERNAL-IP      PORT(S)          AGE
7kubernetes      172.18.0.1       <none>           443/TCP          1d
8node-exporter   None             <none>           9100/TCP         8h

Create Prometheus and its associated service

Finally, we will create Prometheus and its service for data aggregation and display. Create the file prometheus-deployment.yaml for deployment, with the following content:

Plain Text
1apiVersion: v1
2kind: Service
3metadata:
4  annotations:
5    prometheus.io/scrape: 'true'
6  labels:
7    name: prometheus
8  name: prometheus
9spec:
10  selector:
11    app: prometheus
12  type: LoadBalancer
13  ports:
14  - name: prometheus
15    protocol: TCP
16    port: 9090
17    nodePort: 30900
18
19---
20apiVersion: extensions/v1beta1
21kind: Deployment
22metadata:
23  name: prometheus
24spec:
25  replicas: 1
26  selector:
27    matchLabels:
28      app: prometheus
29  template:
30    metadata:
31      name: prometheus
32      labels:
33        app: prometheus
34    spec:
35      serviceAccountName: prometheus
36      containers:
37      - name: prometheus
38        image: hub.baidubce.com/public/prometheus:latest
39        args:
40          - '-storage.local.retention=6h'
41          - '-storage.local.memory-chunks=500000'
42          - '-config.file=/etc/prometheus/prometheus.yml'
43        ports:
44        - name: web
45          containerPort: 9090
46        volumeMounts:
47        - name: prometheus-config-volume
48          mountPath: /etc/prometheus
49      - name: alertmanager
50        image: hub.baidubce.com/public/alertmanager:latest
51        args:
52          - '-config.file=/etc/alertmanager/alertmanager.yml'
53        ports:
54        - name: web
55          containerPort: 9093
56        volumeMounts:
57        - name: alertmanager-config-volume
58          mountPath: /etc/alertmanager
59      #imagePullSecrets:
60      #- name: myregistrykey
61      volumes:
62      - name: prometheus-config-volume
63        configMap:
64          name: prometheus
65      - name: alertmanager-config-volume
66        configMap:
67          name: alertmanager

Execute the following command:

Plain Text
1$ kubectl create -f prometheus-deployment.yaml
2$ kubectl get deployments

The response similar to the following information will be returned:

Plain Text
1NAME         DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
2prometheus   1         1         1            1           8h

Execute the following command:

$ kubectl get services

The response similar to the following information will be returned:

Plain Text
1NAME            CLUSTER-IP       EXTERNAL-IP      PORT(S)          AGE
2kubernetes      172.18.0.1       <none>           443/TCP          1d
3node-exporter   None             <none>           9100/TCP         8h
4prometheus      172.18.164.101   180.72.136.254   9090:30900/TCP   8h

As shown above, we can access the Prometheus system via 180.72.136.254:9090 to monitor the cluster. The monitoring page is as shown in the figure below:

Previous
Creating a LoadBalancer-Type Service
Next
kubectl Management Configuration