Prometheus监控系统部署指南
Prometheus简介
Prometheus是一个开源监控系统,它前身是SoundCloud的警告工具包。从2012年开始,许多公司和组织开始使用Prometheus。该项目的开发人员和用户社区非常活跃,越来越多的开发人员和用户参与到该项目中。目前它是一个独立的开源项目,且不依赖于任何公司。为了强调这点和明确该项目治理结构,Prometheus在2016年继Kurberntes之后,加入了Cloud Native Computing Foundation。主要具有如下功能:
- 多维 数据模型(时序由 metric名字和k/v的labels 构成)。
- 灵活的查询语句(PromQL)。
- 无依赖存储,支持 local 和 remote 不同模型。
- 采用 http 协议,使用 pull 模式,拉取数据,简单易懂。
- 监控目标,可以采用服务发现或静态配置的方式。
- 支持多种统计数据模型,图形化友好。
部署前置准备工作
为了顺利完成Prometheus监控系统在CCE服务提供的Kubernetes集群的部署流程,我们首先需要完成一些前置工作:
- 用户需要在CCE上拥有一个已经完成初始化的Kubernetes集群
- 用户已经根据指导文档能够通过kubectl正常访问集群
创建Prometheus用户
为了合理隔离不同用户的职能和权限,我们为监控系统创建专用用户并分配相关集群角色。首先编辑相关配置文件,命名为rbac-setup.yml,具体文件内容如下:
1apiVersion: rbac.authorization.k8s.io/v1beta1
2kind: ClusterRole
3metadata:
4 name: prometheus
5rules:
6- apiGroups: [""]
7 resources:
8 - nodes
9 - nodes/proxy
10 - services
11 - endpoints
12 - pods
13 verbs: ["get", "list", "watch"]
14- nonResourceURLs: ["/metrics"]
15 verbs: ["get"]
16---
17apiVersion: v1
18kind: ServiceAccount
19metadata:
20 name: prometheus
21 namespace: default
22---
23apiVersion: rbac.authorization.k8s.io/v1beta1
24kind: ClusterRoleBinding
25metadata:
26 name: prometheus
27roleRef:
28 apiGroup: rbac.authorization.k8s.io
29 kind: ClusterRole
30 name: prometheus
31subjects:
32- kind: ServiceAccount
33 name: prometheus
34 namespace: default
完成编辑后,执行命令:
1$ kubectl create -f rbac-setup.yml
2$ kubectl get sa
答复应该类似于:
1NAME SECRETS AGE
2default 1 1d
3prometheus 1 8h
创建配置信息对象(ConfigMap)
完成角色创建之后,我们需要创建Prometheus的配置文件对象(ConfigMap),编辑文件prometheus-kubernetes-configmap.yml内容如下,其中alerting.rules部分为用户自定义的报警规则,这里以容器内存占用超过90%和节点不可用的情况为例,报警规则定义语法见Prometheus Document:
1apiVersion: v1
2kind: ConfigMap
3metadata:
4 name: prometheus
5data:
6 alerting.rules: |-
7 # ALERT when container memory usage exceed 90%
8 ALERT container_mem_over_90
9 IF (sum(container_memory_working_set_bytes{image!="",name=~"^k8s_.*", pod_name!=""}) by (pod_name)) / (sum (container_spec_memory_limit_bytes{image!="",name=~"^k8s_.*", pod_name!=""}) by (pod_name)) > 0.9 and (sum(container_memory_working_set_bytes{image!="",name=~"^k8s_.*", pod_name!=""}) by (pod_name)) / (sum (container_spec_memory_limit_bytes{image!="",name=~"^k8s_.*", pod_name!=""}) by (pod_name)) < 2
10 FOR 30s
11 ANNOTATIONS {
12 description = "Memory Usage of Pod {{ $labels.pod_name }} on {{ $labels.kubernetes_io_hostname }} has exceeded 90%",
13 }
14 # ALERT when node is down
15 ALERT node_down
16 IF up == 0
17 FOR 30s
18 ANNOTATIONS {
19 description = "Node {{ $labels.kubernetes_io_hostname }} is down",
20 }
21 prometheus.yml: |-
22 rule_files:
23 # alerting rules
24 - /etc/prometheus/alert.rules
25 alerting:
26 alertmanagers:
27 - scheme: http
28 static_configs:
29 - targets:
30 - "localhost:9093"
31 # A scrape configuration for running Prometheus on a Kubernetes cluster.
32 # This uses separate scrape configs for cluster components (i.e. API server, node)
33 # and services to allow each to use different authentication configs.
34 #
35 # Kubernetes labels will be added as Prometheus labels on metrics via the
36 # `labelmap` relabeling action.
37 #
38 # If you are using Kubernetes 1.7.2 or earlier, please take note of the comments
39 # for the kubernetes-cadvisor job; you will need to edit or remove this job.
40
41 # Scrape config for API servers.
42 #
43 # Kubernetes exposes API servers as endpoints to the default/kubernetes
44 # service so this uses `endpoints` role and uses relabelling to only keep
45 # the endpoints associated with the default/kubernetes service using the
46 # default named port `https`. This works for single API server deployments as
47 # well as HA API server deployments.
48 scrape_configs:
49 - job_name: 'kubernetes-apiservers'
50
51 kubernetes_sd_configs:
52 - role: endpoints
53
54 # Default to scraping over https. If required, just disable this or change to
55 # `http`.
56 scheme: https
57
58 # This TLS & bearer token file config is used to connect to the actual scrape
59 # endpoints for cluster components. This is separate to discovery auth
60 # configuration because discovery & scraping are two separate concerns in
61 # Prometheus. The discovery auth config is automatic if Prometheus runs inside
62 # the cluster. Otherwise, more config options have to be provided within the
63 # <kubernetes_sd_config>.
64 tls_config:
65 ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
66 # If your node certificates are self-signed or use a different CA to the
67 # master CA, then disable certificate verification below. Note that
68 # certificate verification is an integral part of a secure infrastructure
69 # so this should only be disabled in a controlled environment. You can
70 # disable certificate verification by uncommenting the line below.
71 #
72 insecure_skip_verify: true
73 bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
74
75 # Keep only the default/kubernetes service endpoints for the https port. This
76 # will add targets for each API server which Kubernetes adds an endpoint to
77 # the default/kubernetes service.
78 relabel_configs:
79 - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
80 action: keep
81 regex: default;kubernetes;https
82
83 # Scrape config for nodes (kubelet).
84 #
85 # Rather than connecting directly to the node, the scrape is proxied though the
86 # Kubernetes apiserver. This means it will work if Prometheus is running out of
87 # cluster, or can't connect to nodes for some other reason (e.g. because of
88 # firewalling).
89 - job_name: 'kubernetes-nodes'
90
91 # Default to scraping over https. If required, just disable this or change to
92 # `http`.
93 scheme: https
94
95 # This TLS & bearer token file config is used to connect to the actual scrape
96 # endpoints for cluster components. This is separate to discovery auth
97 # configuration because discovery & scraping are two separate concerns in
98 # Prometheus. The discovery auth config is automatic if Prometheus runs inside
99 # the cluster. Otherwise, more config options have to be provided within the
100 # <kubernetes_sd_config>.
101 tls_config:
102 ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
103 insecure_skip_verify: true
104 bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
105
106 kubernetes_sd_configs:
107 - role: node
108
109 relabel_configs:
110 - action: labelmap
111 regex: __meta_kubernetes_node_label_(.+)
112 - target_label: __address__
113 replacement: kubernetes.default.svc:443
114 - source_labels: [__meta_kubernetes_node_name]
115 regex: (.+)
116 target_label: __metrics_path__
117 replacement: /api/v1/nodes/${1}/proxy/metrics
118
119 # Scrape config for Kubelet cAdvisor.
120 #
121 # This is required for Kubernetes 1.7.3 and later, where cAdvisor metrics
122 # (those whose names begin with 'container_') have been removed from the
123 # Kubelet metrics endpoint. This job scrapes the cAdvisor endpoint to
124 # retrieve those metrics.
125 #
126 # In Kubernetes 1.7.0-1.7.2, these metrics are only exposed on the cAdvisor
127 # HTTP endpoint; use "replacement: /api/v1/nodes/${1}:4194/proxy/metrics"
128 # in that case (and ensure cAdvisor's HTTP server hasn't been disabled with
129 # the --cadvisor-port=0 Kubelet flag).
130 #
131 # This job is not necessary and should be removed in Kubernetes 1.6 and
132 # earlier versions, or it will cause the metrics to be scraped twice.
133 - job_name: 'kubernetes-cadvisor'
134
135 # Default to scraping over https. If required, just disable this or change to
136 # `http`.
137 scheme: https
138
139 # This TLS & bearer token file config is used to connect to the actual scrape
140 # endpoints for cluster components. This is separate to discovery auth
141 # configuration because discovery & scraping are two separate concerns in
142 # Prometheus. The discovery auth config is automatic if Prometheus runs inside
143 # the cluster. Otherwise, more config options have to be provided within the
144 # <kubernetes_sd_config>.
145 tls_config:
146 ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
147 bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
148
149 kubernetes_sd_configs:
150 - role: node
151
152 relabel_configs:
153 - action: labelmap
154 regex: __meta_kubernetes_node_label_(.+)
155 - target_label: __address__
156 replacement: kubernetes.default.svc:443
157 - source_labels: [__meta_kubernetes_node_name]
158 regex: (.+)
159 target_label: __metrics_path__
160 replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
161
162 # Scrape config for service endpoints.
163 #
164 # The relabeling allows the actual service scrape endpoint to be configured
165 # via the following annotations:
166 #
167 # * `prometheus.io/scrape`: Only scrape services that have a value of `true`
168 # * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
169 # to set this to `https` & most likely set the `tls_config` of the scrape config.
170 # * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
171 # * `prometheus.io/port`: If the metrics are exposed on a different port to the
172 # service then set this appropriately.
173 - job_name: 'kubernetes-service-endpoints'
174
175 kubernetes_sd_configs:
176 - role: endpoints
177
178 relabel_configs:
179 - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
180 action: keep
181 regex: true
182 - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
183 action: replace
184 target_label: __scheme__
185 regex: (https?)
186 - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
187 action: replace
188 target_label: __metrics_path__
189 regex: (.+)
190 - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
191 action: replace
192 target_label: __address__
193 regex: ([^:]+)(?::\d+)?;(\d+)
194 replacement: $1:$2
195 - action: labelmap
196 regex: __meta_kubernetes_service_label_(.+)
197 - source_labels: [__meta_kubernetes_namespace]
198 action: replace
199 target_label: kubernetes_namespace
200 - source_labels: [__meta_kubernetes_service_name]
201 action: replace
202 target_label: kubernetes_name
203
204 # Example scrape config for probing services via the Blackbox Exporter.
205 #
206 # The relabeling allows the actual service scrape endpoint to be configured
207 # via the following annotations:
208 #
209 # * `prometheus.io/probe`: Only probe services that have a value of `true`
210 - job_name: 'kubernetes-services'
211
212 metrics_path: /probe
213 params:
214 module: [http_2xx]
215
216 kubernetes_sd_configs:
217 - role: service
218
219 relabel_configs:
220 - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
221 action: keep
222 regex: true
223 - source_labels: [__address__]
224 target_label: __param_target
225 - target_label: __address__
226 replacement: blackbox-exporter.example.com:9115
227 - source_labels: [__param_target]
228 target_label: instance
229 - action: labelmap
230 regex: __meta_kubernetes_service_label_(.+)
231 - source_labels: [__meta_kubernetes_namespace]
232 target_label: kubernetes_namespace
233 - source_labels: [__meta_kubernetes_service_name]
234 target_label: kubernetes_name
235
236 # Example scrape config for probing ingresses via the Blackbox Exporter.
237 #
238 # The relabeling allows the actual ingress scrape endpoint to be configured
239 # via the following annotations:
240 #
241 # * `prometheus.io/probe`: Only probe services that have a value of `true`
242 - job_name: 'kubernetes-ingresses'
243
244 metrics_path: /probe
245 params:
246 module: [http_2xx]
247
248 kubernetes_sd_configs:
249 - role: ingress
250
251 relabel_configs:
252 - source_labels: [__meta_kubernetes_ingress_annotation_prometheus_io_probe]
253 action: keep
254 regex: true
255 - source_labels: [__meta_kubernetes_ingress_scheme,__address__,__meta_kubernetes_ingress_path]
256 regex: (.+);(.+);(.+)
257 replacement: ${1}://${2}${3}
258 target_label: __param_target
259 - target_label: __address__
260 replacement: blackbox-exporter.example.com:9115
261 - source_labels: [__param_target]
262 target_label: instance
263 - action: labelmap
264 regex: __meta_kubernetes_ingress_label_(.+)
265 - source_labels: [__meta_kubernetes_namespace]
266 target_label: kubernetes_namespace
267 - source_labels: [__meta_kubernetes_ingress_name]
268 target_label: kubernetes_name
269
270 # Example scrape config for pods
271 #
272 # The relabeling allows the actual pod scrape endpoint to be configured via the
273 # following annotations:
274 #
275 # * `prometheus.io/scrape`: Only scrape pods that have a value of `true`
276 # * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
277 # * `prometheus.io/port`: Scrape the pod on the indicated port instead of the
278 # pod's declared ports (default is a port-free target if none are declared).
279 - job_name: 'kubernetes-pods'
280
281 kubernetes_sd_configs:
282 - role: pod
283
284 relabel_configs:
285 - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
286 action: keep
287 regex: true
288 - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
289 action: replace
290 target_label: __metrics_path__
291 regex: (.+)
292 - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
293 action: replace
294 regex: ([^:]+)(?::\d+)?;(\d+)
295 replacement: $1:$2
296 target_label: __address__
297 - action: labelmap
298 regex: __meta_kubernetes_pod_label_(.+)
299 - source_labels: [__meta_kubernetes_namespace]
300 action: replace
301 target_label: kubernetes_namespace
302 - source_labels: [__meta_kubernetes_pod_name]
303 action: replace
304 target_label: kubernetes_pod_name
创建监控报警组件 AlertManager 配置文件对象 alertmanager-kubernetes-configmap.yml 内容如下,需替换可用的 smtp 配置和邮件接收人
1apiVersion: v1
2kind: ConfigMap
3metadata:
4 name: alertmanager
5data:
6 alertmanager.yml: |-
7 global:
8 # 设置自己的smtp服务器和认证参数
9 smtp_smarthost: 'localhost:25'
10 smtp_from: 'addr@domain.com'
11 smtp_auth_username: 'username@domain.com'
12 smtp_auth_password: 'password'
13 # The directory from which notification templates are read.
14 templates:
15 - '/etc/alertmanager/template/*.tmpl'
16 # The root route on which each incoming alert enters.
17 route:
18 # The labels by which incoming alerts are grouped together. For example,
19 # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
20 # be batched into a single group.
21 group_by: ['alertname', 'pod_name']
22 # When a new group of alerts is created by an incoming alert, wait at
23 # least 'group_wait' to send the initial notification.
24 # This way ensures that you get multiple alerts for the same group that start
25 # firing shortly after another are batched together on the first
26 # notification.
27 group_wait: 30s
28 # When the first notification was sent, wait 'group_interval' to send a batch
29 # of new alerts that started firing for that group.
30 group_interval: 5m
31 # If an alert has successfully been sent, wait 'repeat_interval' to
32 # resend them.
33 repeat_interval: 3h
34 # A default receiver
35 receiver: AlertMail
36 receivers:
37 - name: 'AlertMail'
38 email_configs:
39 - to: 'receiver@domain.com' # 改成报警接收人的邮箱
完成编辑后,使用kubectl命令创建相应的ConfigMap对象
1$ kubectl create -f prometheus-kubernetes-configmap.yml
2$ kubectl create -f alertmanager-kubernetes-configmap.yml
3$ kubectl get configmaps
答复应该类似于:
1NAME DATA AGE
2alertmanager 1 29s
3prometheus 2 36s
创建Node Exporter
默认的监控配置获取的集群node节点资源信息相对较少,为了能够获得更为细粒度的node节点资源信息,我们需要在Kubernetes集群的每个node上部署Node Exporter服务,在这里我们刚好可以使用Kubernetes的DaemonSet对象创建Node Exporter服务部署在每个node节点。用以创建DaemonSet及Service对象node-exporter.yaml文件如下:
1apiVersion: v1
2kind: Service
3metadata:
4 annotations:
5 prometheus.io/scrape: 'true'
6 labels:
7 app: node-exporter
8 name: node-exporter
9 name: node-exporter
10spec:
11 clusterIP: None
12 ports:
13 - name: scrape
14 port: 9100
15 protocol: TCP
16 selector:
17 app: node-exporter
18 type: ClusterIP
19
20---
21apiVersion: extensions/v1beta1
22kind: DaemonSet
23metadata:
24 name: node-exporter
25spec:
26 template:
27 metadata:
28 labels:
29 app: node-exporter
30 name: node-exporter
31 spec:
32 containers:
33 - image: hub.baidubce.com/public/node-exporter:latest
34 name: node-exporter
35 ports:
36 - containerPort: 9100
37 hostPort: 9100
38 name: scrape
39 hostNetwork: true
40 hostPID: true
然后创建相关对象:
1$ kubectl create -f node-exporter.yaml
2$ kubectl get daemonsets
3NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE-SELECTOR AGE
4node-exporter 2 2 2 2 2 <none> 8h
5$ kubectl get services
6NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
7kubernetes 172.18.0.1 <none> 443/TCP 1d
8node-exporter None <none> 9100/TCP 8h
创建Prometheus及关联Sevice
最后我们将创建Prometheus及其Service,用以数据汇聚、展示,创建文件prometheus-deployment.yaml,进行部署,文件内容如下:
1apiVersion: v1
2kind: Service
3metadata:
4 annotations:
5 prometheus.io/scrape: 'true'
6 labels:
7 name: prometheus
8 name: prometheus
9spec:
10 selector:
11 app: prometheus
12 type: LoadBalancer
13 ports:
14 - name: prometheus
15 protocol: TCP
16 port: 9090
17 nodePort: 30900
18
19---
20apiVersion: extensions/v1beta1
21kind: Deployment
22metadata:
23 name: prometheus
24spec:
25 replicas: 1
26 selector:
27 matchLabels:
28 app: prometheus
29 template:
30 metadata:
31 name: prometheus
32 labels:
33 app: prometheus
34 spec:
35 serviceAccountName: prometheus
36 containers:
37 - name: prometheus
38 image: hub.baidubce.com/public/prometheus:latest
39 args:
40 - '-storage.local.retention=6h'
41 - '-storage.local.memory-chunks=500000'
42 - '-config.file=/etc/prometheus/prometheus.yml'
43 ports:
44 - name: web
45 containerPort: 9090
46 volumeMounts:
47 - name: prometheus-config-volume
48 mountPath: /etc/prometheus
49 - name: alertmanager
50 image: hub.baidubce.com/public/alertmanager:latest
51 args:
52 - '-config.file=/etc/alertmanager/alertmanager.yml'
53 ports:
54 - name: web
55 containerPort: 9093
56 volumeMounts:
57 - name: alertmanager-config-volume
58 mountPath: /etc/alertmanager
59 #imagePullSecrets:
60 #- name: myregistrykey
61 volumes:
62 - name: prometheus-config-volume
63 configMap:
64 name: prometheus
65 - name: alertmanager-config-volume
66 configMap:
67 name: alertmanager
执行以下命令:
1$ kubectl create -f prometheus-deployment.yaml
2$ kubectl get deployments
答复应该类似于:
1NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
2prometheus 1 1 1 1 8h
执行以下命令:
$ kubectl get services
答复应该类似于:
1NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
2kubernetes 172.18.0.1 <none> 443/TCP 1d
3node-exporter None <none> 9100/TCP 8h
4prometheus 172.18.164.101 180.72.136.254 9090:30900/TCP 8h
如上所示我们可以通过180.72.136.254:9090进行Prometheus系统的访问,并监控集群,监控页面如下图所示:

