[ PROMPT_NODE_24660 ]
observability
[ SKILL_DOCUMENTATION ]
# 可观测性:监控、日志与链路追踪
## 可观测性的三大支柱
### 1. 指标 (Metrics) (发生了什么?)
- **定义**:随时间变化的数值测量
- **示例**:CPU 使用率、请求速率、错误率、延迟
- **工具**:Prometheus, Datadog, CloudWatch, New Relic
### 2. 日志 (Logs) (为什么发生?)
- **定义**:带时间戳的事件记录
- **示例**:应用日志、访问日志、错误日志
- **工具**:ELK Stack, Splunk, CloudWatch Logs, Loki
### 3. 链路追踪 (Traces) (在哪里发生?)
- **定义**:请求在分布式系统中的旅程
- **示例**:服务调用链、数据库查询、外部 API 调用
- **工具**:Jaeger, Zipkin, AWS X-Ray, Datadog APM
## SLI/SLO/SLA 框架
### 服务水平指标 (SLIs)
**服务质量的定量测量**
yaml
# 常见 SLIs
availability:
definition: "成功请求的百分比"
measurement: "(successful_requests / total_requests) * 100"
latency:
definition: "处理请求的时间"
measurement: "p95 响应时间 < 200ms"
error_rate:
definition: "失败请求的百分比"
measurement: "(failed_requests / total_requests) * 100"
throughput:
definition: "每秒处理的请求数"
measurement: "requests_per_second"
### 服务水平目标 (SLOs)
**SLIs 的目标值**
yaml
# SLO 示例
availability_slo:
target: 99.9%
measurement_window: 30 天
error_budget: 0.1% (每月 43 分钟)
latency_slo:
target: "95% 的请求 < 200ms"
measurement_window: 7 天
error_rate_slo:
target: "< 0.1%"
measurement_window: 24 小时
### 服务水平协议 (SLAs)
**带有后果的业务合同**
yaml
# SLA 示例
web_application_sla:
availability: 99.9%
latency_p95: 300ms
consequences:
- availability < 99.9%: 10% 服务抵扣
- availability < 99.0%: 25% 服务抵扣
- availability < 95.0%: 50% 服务抵扣
## Prometheus 设置
### Prometheus 配置
yaml
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
environment: 'prod'
# Alert manager 配置
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# 加载规则
rule_files:
- "/etc/prometheus/rules/*.yml"
# 抓取配置
scrape_configs:
# Prometheus 自监控
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']