[ PROMPT_NODE_24911 ]
It Operations – Monitoring
[ SKILL_DOCUMENTATION ]
# Monitoring and Observability
Comprehensive guide to implementing observability, metrics collection, alerting strategies, and dashboard design for IT operations.
## Table of Contents
- [Observability Principles](#observability-principles)
- [The Three Pillars](#the-three-pillars)
- [Metrics Strategy](#metrics-strategy)
- [Alerting Best Practices](#alerting-best-practices)
- [Dashboard Design](#dashboard-design)
- [SLI/SLO/SLA Framework](#slislosla-framework)
- [Monitoring Tools](#monitoring-tools)
- [Implementation Examples](#implementation-examples)
## Observability Principles
### Definition
**Observability**: The ability to understand the internal state of a system by examining its external outputs (metrics, logs, traces).
**Monitoring vs Observability**:
| Monitoring | Observability |
|------------|---------------|
| Known unknowns | Unknown unknowns |
| Predefined dashboards | Exploratory analysis |
| Threshold-based alerts | Context-aware investigation |
| "Is the system up?" | "Why is the system behaving this way?" |
### Key Principles
```yaml
1. Instrument Everything:
- Application code (business metrics, errors, latency)
- Infrastructure (CPU, memory, disk, network)
- Dependencies (databases, APIs, queues)
- User experience (frontend performance, transactions)
2. High Cardinality Data:
- Enable filtering by user_id, region, version, etc.
- Support arbitrary dimensional queries
- Example: "Show me errors for user_id=123 in us-west-2 for version 2.3.1"
3. Context and Correlation:
- Link metrics, logs, and traces together
- Use consistent labels and tags across telemetry
- Include trace IDs in logs and metrics
4. Real-Time and Historical:
- Real-time for incident response (< 1 min delay)
- Historical for trend analysis (retain 13+ months)
- Different retention policies by data type
5. Self-Service:
- Empower teams to create their own dashboards
- Provide query language training
- Build reusable dashboard templates
```
## The Three Pillars
### 1. Metrics (What)
**Definition**: Numeric measurements over time (counters, gauges, histograms).
**Types**:
```yaml
Counter:
Description: Monotonically increasing value
Examples:
- http_requests_total
- errors_total
- bytes_sent_total
Operations: Rate, increase over time
Gauge:
Description: Value that can go up or down
Examples:
- cpu_usage_percent
- memory_available_bytes
- queue_depth
Operations: Current value, average, min, max
Histogram:
Description: Distribution of values in buckets
Examples:
- http_request_duration_seconds
- database_query_duration_seconds
Operations: Percentiles (p50, p95, p99), averages
Summary:
Description: Pre-computed percentiles
Examples:
- request_latency_summary
Operations: Pre-defined percentiles
```
**Metric Naming Convention**:
```
{namespace}_{component}_{metric}_{unit}
Examples:
- api_http_requests_total
- db_postgres_connections_active
- cache_redis_hits_total
- queue_sqs_messages_received_total
```
### 2. Logs (Why)
**Definition**: Timestamped text records of discrete events.
**Log Levels**:
```yaml
ERROR:
When: Failures requiring immediate attention
Example: "Database connection failed after 3 retries"
WARN:
When: Unexpected but handled situations
Example: "API rate limit approaching (85% of quota)"
INFO:
When: Important business events
Example: "User 12345 completed checkout for $150.00"
DEBUG:
When: Detailed diagnostic information
Example: "Loaded configuration from /etc/app/config.yaml"
```
**Structured Logging Format**:
```json
{
"timestamp": "2025-01-15T14:32:10.123Z",
"level": "ERROR",
"service": "payment-api",
"version": "2.3.1",
"environment": "production",
"trace_id": "a1b2c3d4e5f6",
"span_id": "1234567890",
"user_id": "user-789",
"message": "Payment processing failed",
"error": {
"type": "StripeAPIException",
"message": "Card declined: insufficient funds",
"stack_trace": "..."
},
"context": {
"amount": 150.00,
"currency": "USD",
"payment_method": "card_****1234"
}
}
```
**Log Aggregation Best Practices**:
```yaml
Collection:
- Use lightweight agents (Fluentd, Filebeat, Vector)
- Buffer locally to handle backend outages
- Compress during transmission
- Sample debug logs in high-volume scenarios
Storage:
- Hot tier (last 7 days): Fast SSD for queries
- Warm tier (8-90 days): Standard storage
- Cold tier (90+ days): Archive storage (S3, Glacier)
Indexing:
- Index critical fields: timestamp, level, service, trace_id, user_id
- Full-text search on message field
- Use field extraction for structured logs
```
### 3. Traces (Where)
**Definition**: End-to-end request flow across distributed systems.
**Trace Anatomy**:
```
Trace (entire request)
├─ Span 1: API Gateway (50ms)
│ ├─ Span 2: Auth Service (10ms)
│ └─ Span 3: User Service (35ms)
│ ├─ Span 4: Database Query (20ms)
│ └─ Span 5: Cache Lookup (5ms)
└─ Span 6: Response Serialization (5ms)
Total Trace Duration: 50ms
Critical Path: Span 1 → Span 3 → Span 4
```
**Trace Context Propagation**:
```python
# OpenTelemetry Python Example
from opentelemetry import trace
from opentelemetry.propagate import inject, extract
tracer = trace.get_tracer(__name__)
# Starting a trace
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
span.set_attribute("order.amount", amount)
# Propagate context to downstream service
headers = {}
inject(headers) # Adds traceparent header
response = requests.post(
"https://payment-service/charge",
headers=headers,
json={"amount": amount}
)
if response.status_code != 200:
span.set_status(Status(StatusCode.ERROR))
span.record_exception(Exception("Payment failed"))
```
**Sampling Strategies**:
```yaml
Always Sample:
- Errors and exceptions (100%)
- Slow requests (p95+, 100%)
- Specific user_ids (for debugging, 100%)
Head Sampling (at trace start):
- Random sampling (1% of all traces)
- Rate limiting (max 1000 traces/second)
Tail Sampling (after trace completion):
- Sample interesting traces (errors, slow, specific attributes)
- Requires buffering and additional processing
- More accurate but higher resource cost
```
## Metrics Strategy
### The Four Golden Signals (Google SRE)
```yaml
1. Latency:
Definition: Time to service a request
Metrics:
- http_request_duration_seconds (histogram)
- Percentiles: p50, p90, p95, p99
Thresholds:
- p50 < 100ms
- p95 < 500ms
- p99 < 1000ms
2. Traffic:
Definition: Demand on your system
Metrics:
- http_requests_per_second (counter rate)
- active_connections (gauge)
Analysis:
- Daily patterns
- Growth trends
- Capacity planning
3. Errors:
Definition: Rate of failed requests
Metrics:
- http_requests_total{status=~"5.."} (counter)
- error_rate = errors / total_requests
Thresholds:
- Error rate 85% for 10 minutes on app-server-3"
Every alert should answer:
- What is wrong?
- Which component is affected?
- What should I do about it?
2. Reduce False Positives:
- Use sustained thresholds (not instantaneous spikes)
- Example: Alert after 5 minutes > threshold, not first breach
- Avoid alerting on symptoms if root cause is already alerting
3. Alert on Symptoms, Not Causes:
BETTER: "API error rate > 1%" (user-facing symptom)
WORSE: "Redis connection count 20% false positive rate
- Track alert effectiveness metrics
2. Alert Grouping:
- Group related alerts (same root cause)
- Example: Don't alert on every pod failure if deployment is alerting
3. Dynamic Thresholds:
- Use anomaly detection instead of static thresholds
- Adjust thresholds based on time of day/week
4. Escalation Policies:
- Primary on-call: 5 min
- Secondary on-call: 15 min
- Team lead: 30 min
- Engineering manager: 60 min
5. Maintenance Windows:
- Silence alerts during planned maintenance
- Auto-create maintenance windows from change tickets
```
### Prometheus Alerting Rules
```yaml
# alerts.yml
groups:
- name: api_alerts
interval: 30s
rules:
# High error rate
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
> 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }} on {{ $labels.service }}"
runbook: "https://wiki.example.com/runbooks/high-error-rate"
dashboard: "https://grafana.example.com/d/api-dashboard"
# High latency (p95)
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 1.0
for: 10m
labels:
severity: warning
team: backend
annotations:
summary: "High p95 latency on {{ $labels.service }}"
description: "p95 latency is {{ $value }}s on {{ $labels.service }}"
# Saturation (CPU)
- alert: HighCPUUsage
expr: |
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 10m
labels:
severity: warning
team: infrastructure
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value | humanize }}% on {{ $labels.instance }}"
# Disk space prediction
- alert: DiskWillFillIn4Hours
expr: |
predict_linear(node_filesystem_free_bytes{fstype!~"tmpfs|fuse.lxcfs"}[1h], 4*3600) < 0
for: 5m
labels:
severity: critical
team: infrastructure
annotations:
summary: "Disk will fill on {{ $labels.instance }}"
description: "Filesystem {{ $labels.mountpoint }} will fill in approximately 4 hours"
# Service down
- alert: ServiceDown
expr: up == 0
for: 5m
labels:
severity: critical
team: infrastructure
annotations:
summary: "Service {{ $labels.job }} is down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes"
# Certificate expiration
- alert: CertificateExpiringSoon
expr: |
(probe_ssl_earliest_cert_expiry - time()) / 86400 < 14
for: 1h
labels:
severity: warning
team: infrastructure
annotations:
summary: "SSL certificate expiring soon"
description: "Certificate for {{ $labels.instance }} expires in {{ $value | humanize }} days"
```
### PagerDuty Integration
```yaml
# alertmanager.yml
global:
resolve_timeout: 5m
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
route:
receiver: 'default-receiver'
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Critical alerts go to PagerDuty
- match:
severity: critical
receiver: pagerduty-critical
continue: true
# Warnings go to Slack
- match:
severity: warning
receiver: slack-warnings
# Infrastructure team alerts
- match:
team: infrastructure
receiver: slack-infrastructure
routes:
- match:
severity: critical
receiver: pagerduty-infrastructure
receivers:
- name: 'default-receiver'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX'
channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
details:
firing: '{{ .Alerts.Firing | len }}'
resolved: '{{ .Alerts.Resolved | len }}'
num_alerts: '{{ .Alerts | len }}'
links:
- href: '{{ .CommonAnnotations.runbook }}'
text: 'Runbook'
- href: '{{ .CommonAnnotations.dashboard }}'
text: 'Dashboard'
- name: 'slack-warnings'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YYY'
channel: '#alerts-warnings'
color: 'warning'
inhibit_rules:
# Inhibit warning if critical is firing
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'service', 'instance']
```
## Dashboard Design
### Dashboard Principles
```yaml
1. Audience-Specific Dashboards:
- Executive Dashboard: Business metrics, SLAs, revenue impact
- Operations Dashboard: System health, alerts, capacity
- Development Dashboard: Deployment status, error rates, traces
- Service Dashboard: Detailed metrics for specific service
2. Information Hierarchy:
Top: Most critical information (current status)
Middle: Supporting metrics and trends
Bottom: Detailed breakdowns and diagnostics
3. Visual Best Practices:
- Use color purposefully (red=bad, green=good, yellow=warning)
- Avoid more than 6-8 panels per row
- Consistent time ranges across panels
- Include units in axis labels
- Use logarithmic scale for wide-ranging data
4. Dashboard Variables:
- Environment (production, staging, dev)
- Service/Component
- Time range
- Region/Datacenter
5. Actionable Context:
- Link panels to detailed views
- Include threshold lines on graphs
- Add annotations for deployments/incidents
```
### Grafana Dashboard Structure
```yaml
Executive Dashboard (Business Metrics):
Row 1: Key Business Metrics
- Revenue (last hour, today, this month)
- Active Users (gauge)
- Transaction Volume (time series)
- Conversion Rate (percentage)
Row 2: System Health Overview
- Overall Availability (SLA compliance)
- P95 Latency Across All Services
- Error Budget Remaining
- Active Incidents (count)
Row 3: Trends
- Revenue Trend (7 days)
- User Growth (30 days)
- Error Rate Trend (7 days)
Operations Dashboard (System Health):
Row 1: Traffic Light Status
- All Services Status (red/yellow/green stat panels)
- Active Alerts Count
- On-Call Engineer
Row 2: Golden Signals
- Request Rate (requests/sec across all services)
- Error Rate (% errors)
- P50/P95/P99 Latency
- Saturation (CPU, Memory, Disk across fleet)
Row 3: Infrastructure Health
- CPU Usage by Host (heatmap)
- Memory Usage by Host
- Disk Usage by Host
- Network Traffic
Row 4: Recent Changes
- Deployments (annotations)
- Configuration Changes
- Infrastructure Changes
Service-Specific Dashboard:
Row 1: Service Overview
- Request Rate
- Error Rate
- Latency (p50, p95, p99)
- Active Instances
Row 2: RED Metrics Breakdown
- Requests by Endpoint
- Errors by Type
- Latency Distribution (histogram)
Row 3: Dependencies
- Database Query Performance
- External API Call Performance
- Cache Hit Rate
- Queue Depth
Row 4: Resource Usage
- CPU per Instance
- Memory per Instance
- JVM/Runtime Metrics (if applicable)
```
### Grafana JSON Dashboard Example
```json
{
"dashboard": {
"title": "API Service Dashboard",
"tags": ["api", "production"],
"timezone": "browser",
"templating": {
"list": [
{
"name": "environment",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(http_requests_total, environment)",
"current": {
"text": "production",
"value": "production"
}
},
{
"name": "service",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(http_requests_total{environment="$environment"}, service)",
"multi": true
}
]
},
"annotations": {
"list": [
{
"name": "Deployments",
"datasource": "Prometheus",
"expr": "deployment_events{service="$service"}",
"iconColor": "green"
}
]
},
"panels": [
{
"id": 1,
"title": "Request Rate",
"type": "graph",
"gridPos": {"x": 0, "y": 0, "w": 12, "h": 8},
"targets": [
{
"expr": "sum(rate(http_requests_total{service="$service", environment="$environment"}[5m])) by (service)",
"legendFormat": "{{service}}"
}
],
"yaxes": [
{"format": "reqps", "label": "Requests/sec"}
]
},
{
"id": 2,
"title": "Error Rate",
"type": "graph",
"gridPos": {"x": 12, "y": 0, "w": 12, "h": 8},
"targets": [
{
"expr": "sum(rate(http_requests_total{service="$service", status=~"5.."}[5m])) / sum(rate(http_requests_total{service="$service"}[5m]))",
"legendFormat": "Error Rate"
}
],
"thresholds": [
{
"value": 0.01,
"colorMode": "critical",
"op": "gt",
"line": true,
"fill": true
}
],
"yaxes": [
{"format": "percentunit", "max": 0.05}
]
},
{
"id": 3,
"title": "Latency (p95)",
"type": "graph",
"gridPos": {"x": 0, "y": 8, "w": 12, "h": 8},
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le, endpoint))",
"legendFormat": "{{endpoint}}"
}
],
"yaxes": [
{"format": "s", "label": "Duration"}
]
}
]
}
}
```
## SLI/SLO/SLA Framework
### Definitions
```yaml
SLI (Service Level Indicator):
Definition: Quantitative measure of service level
Examples:
- Request latency (95th percentile < 500ms)
- Availability (% of successful requests)
- Throughput (requests per second)
- Data freshness (lag in minutes)
SLO (Service Level Objective):
Definition: Target value or range for an SLI
Examples:
- 99.9% of requests complete in < 500ms
- 99.95% availability over 30 days
- Data lag < 5 minutes for 99% of data
SLA (Service Level Agreement):
Definition: Contractual commitment with consequences
Examples:
- 99.9% uptime or customer gets credit
- <500ms p95 latency or penalty payment
```
### SLO Design
```yaml
1. Choose Meaningful SLIs:
User-Facing:
- Availability: Can users access the service?
- Latency: How fast do requests complete?
- Quality: Are results correct/fresh?
Behind-the-Scenes:
- Throughput: Can system handle load?
- Durability: Is data safe?
- Correctness: Are computations accurate?
2. Set Realistic SLOs:
- Start with current performance baseline
- Add buffer for improvement (don't set SLO = current performance)
- Consider user expectations and business requirements
- Remember: 100% is the wrong SLO (no room for changes)
Example:
Current p95 latency: 300ms
User expectation: < 1 second
Set SLO: 500ms (between current and user max tolerance)
3. Error Budget:
Formula: Error Budget = 100% - SLO
Example:
SLO: 99.9% availability
Error Budget: 0.1% = 43.2 minutes/month
Use:
- Budget consumed = Actual downtime / Error budget
- If budget exhausted: Freeze deployments, focus on reliability
- If budget remaining: Safe to take risks (new features, refactors)
4. Multi-Window SLOs:
- Short window (7 days): Detect immediate issues
- Long window (30 days): Track trends
- Rolling window: Continuous monitoring
Example:
7-day SLO: 99.5% (allows 50 minutes downtime)
30-day SLO: 99.9% (allows 43 minutes downtime)
```
### SLO Monitoring with Prometheus
```yaml
# SLO recording rules
groups:
- name: slo_recording_rules
interval: 30s
rules:
# Total requests
- record: slo:http_requests:total
expr: sum(rate(http_requests_total[5m]))
# Successful requests (not 5xx)
- record: slo:http_requests:success
expr: sum(rate(http_requests_total{status!~"5.."}[5m]))
# Availability SLI (success rate)
- record: slo:availability:ratio
expr: slo:http_requests:success / slo:http_requests:total
# Latency SLI (% of requests under threshold)
- record: slo:latency:good_requests
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
- record: slo:latency:ratio
expr: slo:latency:good_requests / slo:http_requests:total
# Error budget calculation (30-day window)
- record: slo:error_budget:availability:30d
expr: |
1 - (
(1 - 0.999) / # SLO target
(1 - avg_over_time(slo:availability:ratio[30d]))
)
# SLO alerting rules
- name: slo_alerts
rules:
# Availability SLO burn rate alerts
- alert: AvailabilitySLOBurnRateCritical
expr: |
(
slo:availability:ratio 14.4 * (1 - 0.999) # Burn rate > 14.4x (will exhaust budget in 2 days)
)
for: 5m
labels:
severity: critical
annotations:
summary: "Critical SLO burn rate"
description: "At current rate, 30-day error budget will be exhausted in 2 days"
- alert: AvailabilitySLOBurnRateWarning
expr: |
(
slo:availability:ratio 6 * (1 - 0.999) # Burn rate > 6x
)
for: 30m
labels:
severity: warning
annotations:
summary: "Elevated SLO burn rate"
description: "Error budget consumption is higher than expected"
# Error budget exhausted
- alert: ErrorBudgetExhausted
expr: slo:error_budget:availability:30d 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on API"
```
This comprehensive monitoring guide provides everything needed to implement robust observability for IT operations.
Source: claude-code-templates (MIT). See About Us for full credits.