Home | 简体中文 | 繁体中文 | 杂文 | Github | 知乎专栏 | 51CTO学院 | CSDN程序员研修院 | OSChina 博客 | 腾讯云社区 | 阿里云栖社区 | Facebook | Linkedin | Youtube | 打赏(Donations) | About
知乎专栏多维度架构

69.2. Prometheus 配置

69.2.1. Prometheus 命令行工具

69.2.1.1. 刷新配置文件

			
#方式1: 
kill -HUP ${prometheus_pid} 

docker kill -s HUP <容器ID> 
 
#方式2: 
# 需要 --web.enable-lifecycle 参数为true
curl -X POST http://10.0.209.140:9090/-/reload			
			
			

69.2.1.2. promtool 配置文件校验工具

安装 promtool

		
go get github.com/prometheus/prometheus/cmd/promtool
promtool check rules /path/to/example.rules.yml		
		
			

		
promtool check config /etc/prometheus/prometheus.yml
		
			

69.2.2. rules 规则配置

prometheus.yml 配置文件

			
rule_files:
  - "rules/node.yml"	# 载入单个配置文件
  - "rules/*.rules"		# 通过通配符载入文件
			
		

prometheus 支持两种 rules

  • recording rules
  • alerting rules

69.2.2.1. recording rules

			
groups:
- name: cpu-node
  rules:
  - record: job_instance_mode:node_cpu_seconds:avg_rate5m
    expr: avg by (job, instance, mode) (rate(node_cpu_seconds_total[5m]))			
			
			

69.2.2.2. alerting rules

			
groups:
- name: example
  rules:

  # Alert for any instance that is unreachable for >5 minutes.
  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

  # Alert for any instance that has a median request latency >1s.
  - alert: APIHighRequestLatency
    expr: api_http_request_latencies_second{quantile="0.5"} > 1
    for: 10m
    annotations:
      summary: "High request latency on {{ $labels.instance }}"
      description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"			
			
			

69.2.3. SpringBoot

Maven pom.xml 文件中增加依赖

		
        <dependency>
            <groupId>io.micrometer</groupId>
            <artifactId>micrometer-registry-prometheus</artifactId>
        </dependency>		
		
		

打包后运行 Springboot 项目,然后使用 /actuator/prometheus 地址测试是否有监控数据输出。 https://api.netkiller.cn/actuator/prometheus

/etc/prometheus/prometheus.yml 增加如下配置:

		
  - job_name: 'springboot'
    scrape_interval: 5s
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['127.0.0.1:8080']	
		
		

Grafana 面板ID:4701

69.2.4. PromQL 自定义查询语言

69.2.4.1. Metrics 格式

Metric 的格式: metric 名称 {标签名=标签值} 监控样本

			
<metric name>{<label name>=<label value>, ...} <sample>
			
			

指标的名称(metric name)用于定义监控样本的含义,名称只能由ASCII字符、数字、下划线以及冒号组成并必须符合正则表达式[a-zA-Z_:][a-zA-Z0-9_:]*

标签(label)反映了当前样本的特征维度,通过这些维度Prometheus可以对样本数据进行过滤,聚合等。标签的名称只能由ASCII字符、数字以及下划线组成并满足正则表达式[a-zA-Z_][a-zA-Z0-9_]*

			
neo@MacBook-Pro-Neo ~ % curl -s http://localhost:9100/metrics | grep node_cpu_seconds_total
# HELP node_cpu_seconds_total Seconds the cpus spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 16761.9
node_cpu_seconds_total{cpu="0",mode="iowait"} 2.91
node_cpu_seconds_total{cpu="0",mode="irq"} 0
node_cpu_seconds_total{cpu="0",mode="nice"} 0
node_cpu_seconds_total{cpu="0",mode="softirq"} 5.76
node_cpu_seconds_total{cpu="0",mode="steal"} 0
node_cpu_seconds_total{cpu="0",mode="system"} 440.28
node_cpu_seconds_total{cpu="0",mode="user"} 135.58
node_cpu_seconds_total{cpu="1",mode="idle"} 16851.16
node_cpu_seconds_total{cpu="1",mode="iowait"} 1.81
node_cpu_seconds_total{cpu="1",mode="irq"} 0
node_cpu_seconds_total{cpu="1",mode="nice"} 0
node_cpu_seconds_total{cpu="1",mode="softirq"} 1.33
node_cpu_seconds_total{cpu="1",mode="steal"} 0
node_cpu_seconds_total{cpu="1",mode="system"} 440.52
node_cpu_seconds_total{cpu="1",mode="user"} 125.7
node_cpu_seconds_total{cpu="2",mode="idle"} 16792.57
node_cpu_seconds_total{cpu="2",mode="iowait"} 2.52
node_cpu_seconds_total{cpu="2",mode="irq"} 0
node_cpu_seconds_total{cpu="2",mode="nice"} 0
node_cpu_seconds_total{cpu="2",mode="softirq"} 1.36
node_cpu_seconds_total{cpu="2",mode="steal"} 0
node_cpu_seconds_total{cpu="2",mode="system"} 445.29
node_cpu_seconds_total{cpu="2",mode="user"} 129.73
node_cpu_seconds_total{cpu="3",mode="idle"} 16844.57
node_cpu_seconds_total{cpu="3",mode="iowait"} 1.16
node_cpu_seconds_total{cpu="3",mode="irq"} 0
node_cpu_seconds_total{cpu="3",mode="nice"} 0
node_cpu_seconds_total{cpu="3",mode="softirq"} 1.24
node_cpu_seconds_total{cpu="3",mode="steal"} 0
node_cpu_seconds_total{cpu="3",mode="system"} 430.82
node_cpu_seconds_total{cpu="3",mode="user"} 135.15
			
			

69.2.4.2. metric 类型

Prometheus 定义了4种不同的指标类型(metric type):

  • Counter(计数器)
  • Gauge(仪表盘)
  • Histogram(直方图)
  • Summary(摘要)
Counter:只增不减的计数器

Counter 例子

			
neo@MacBook-Pro-Neo ~ % curl -s http://localhost:9100/metrics | grep node_cpu_seconds_total
# HELP node_cpu_seconds_total Seconds the cpus spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 16761.9			
			
				
Gauge:可增可减的仪表盘

Gauge 类型的指标侧重于反应系统的当前状态,指标的样本数据可增可减。常用于内存容量的监控。

				
neo@MacBook-Pro-Neo ~ % curl -s http://localhost:9100/metrics | grep node_memory_MemFree
# HELP node_memory_MemFree_bytes Memory information field MemFree_bytes.
# TYPE node_memory_MemFree_bytes gauge
node_memory_MemFree_bytes 2.933243904e+09				
				
				
Histogram
				
neo@MacBook-Pro-Neo ~ % curl -s http://localhost:9090/metrics | grep prometheus_tsdb_compaction_chunk_range
# HELP prometheus_tsdb_compaction_chunk_range_seconds Final time range of chunks on their first compaction
# TYPE prometheus_tsdb_compaction_chunk_range_seconds histogram
prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="100"} 2
prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="400"} 2
prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="1600"} 2
prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="6400"} 2
prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="25600"} 2
prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="102400"} 3
prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="409600"} 1506
prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="1.6384e+06"} 1558
prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="6.5536e+06"} 4564
prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="2.62144e+07"} 4564
prometheus_tsdb_compaction_chunk_range_seconds_bucket{le="+Inf"} 4564
prometheus_tsdb_compaction_chunk_range_seconds_sum 5.85524936e+09
prometheus_tsdb_compaction_chunk_range_seconds_count 4564				
				
				
Summary
				
neo@MacBook-Pro-Neo ~ % curl -s http://localhost:9090/metrics | grep prometheus_tsdb_wal_fsync_duration_seconds
# HELP prometheus_tsdb_wal_fsync_duration_seconds Duration of WAL fsync.
# TYPE prometheus_tsdb_wal_fsync_duration_seconds summary
prometheus_tsdb_wal_fsync_duration_seconds{quantile="0.5"} NaN
prometheus_tsdb_wal_fsync_duration_seconds{quantile="0.9"} NaN
prometheus_tsdb_wal_fsync_duration_seconds{quantile="0.99"} NaN
prometheus_tsdb_wal_fsync_duration_seconds_sum 1.63e-05
prometheus_tsdb_wal_fsync_duration_seconds_count 1				
				
				

69.2.4.3. 查询时间序列

标签查询

查询 instance="node-exporter:9100"

			
node_cpu_seconds_total{instance="node-exporter:9100"}			
			
				

mode!="irq" 排出 irq

			
node_cpu_seconds_total{mode!="irq"}			
			
				

查询所有 mode="user"

			
{mode="user"}	
			
				

正则查询

			
node_cpu_seconds_total{mode=~"user|system|nice"}			
restful_api_requests_total{environment=~"staging|testing|development",method!="GET"}	

{instance =~"n.*"} 		
			
				

正则排除

			
node_cpu_seconds_total{mode!~"steal|softirq|irq|iowait|idle"}			
			
				
范围查询

PromQL的时间范围选择器支持时间单位:

  1. s - 秒
  2. m - 分钟
  3. h - 小时
  4. d - 天
  5. w - 周
  6. y - 年

该表达式将会查询返回时间序列中最近5分钟的所有样本数据:

			
rate(node_memory_MemAvailable_bytes{}[5m])
			
				

可以使用offset时间位移操作:

			
node_memory_MemAvailable_bytes{} offset 5m
rate(node_load1{}[5m] offset 1m)	
			
				
数学运算

PromQL 支持:数学运算符,逻辑运算符,布尔运算符

PromQL操作符中优先级由高到低依次为:

  • ^
  • *, /, %
  • +, -
  • ==, !=, <=, <, >=, >
  • and, unless
  • or

Bytes 转 MB 的例子

			
node_memory_MemFree_bytes /  (1024 * 1024)			
			
				

计算磁盘读写总量

			
(node_disk_read_bytes_total{device="vda"} + node_disk_written_bytes_total{device="vda"}) / (1024 * 1024)			
			
				

内存使用率计算

			
(node_memory_MemTotal_bytes - node_memory_MemFree_bytes) / node_memory_MemTotal_bytes * 100

# 查询出内存使用率到达 80% 的节点			
(node_memory_MemTotal_bytes - node_memory_MemFree_bytes) / node_memory_MemTotal_bytes > 0.8

node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 > 80
			
				

69.2.4.4. 聚合操作

PromQL内置的聚合操作和函数可以让用户对这些数据进行进一步的分析

rate()

通过rate()函数计算HTTP请求量的增长率:

				
rate(http_requests_total[5m])				
				
				
topk() 和 bottomk()

查询当前访问量前10的HTTP地址:

				
topk(10, http_requests_total)
				
				
delta()

通过PromQL内置函数delta()可以获取样本在一段时间返回内的变化情况。例如,计算CPU温度在两个小时内的差异:

				
delta(cpu_temp_celsius{host="zeus"}[2h])
				
				

delta 适用于 Gauge 类型的监控指标

predict_linear()

使用predict_linear()对数据的变化趋势进行预测。例如,预测系统磁盘空间在4个小时之后的剩余情况:

				
predict_linear(node_filesystem_free{job="node"}[1h], 4 * 3600)				
				
				
deriv()

deriv()计算样本的线性回归模型

				
				
				
				
sum()

求和操作

			
sum(node_cpu_seconds_total) 
sum(node_cpu_seconds_total) by (mode)	
			
				
			
Element			Value
{mode="steal"}	0
{mode="system"}	2632.2400000000002
{mode="user"}	768.49
{mode="idle"}	93899.19
{mode="iowait"}	8.85
{mode="irq"}	0
{mode="nice"}	0
{mode="softirq"}	13.35			
			
				

			
sum(node_cpu_seconds_total) without (instance)
			
				
			
sum(node_cpu_seconds_total) by (mode,cpu)
			
				
			
sum(sum(irate(node_cpu{mode!='idle'}[5m]))  / sum(irate(node_cpu[5m]))) by (instance)			
			
				
avg()

计算平均数

			
avg(node_cpu_seconds_total) by (mode)
			
				

			

Element			Value
{mode="nice"}	0
{mode="softirq"}	3.3374999999999995
{mode="steal"}	0
{mode="system"}	658.06
{mode="user"}	192.1225
{mode="idle"}	23474.7975
{mode="iowait"}	2.2125
{mode="irq"}	0			
			
				
min (最小值),max (最大值)

			
			
			
				
count_values()
quantile()