Prometheus is an open-source monitoring and alerting system originally developed by SoundCloud in 2012 and is now a graduated project of the Cloud Native Computing Foundation (CNCF). It has become the de facto standard for monitoring cloud-native applications, containerized workloads, and distributed systems in dynamic environments.
Overview
Prometheus fundamentally changes how you approach monitoring by using a pull-based model to collect metrics from instrumented jobs. It stores all data as time series, identified by metric names and key/value pairs called labels. This approach makes it particularly well-suited for dynamic, containerized environments where services come and go frequently.
Core Concepts
- Metrics: Numerical measurements over time (CPU usage, request count, etc.)
- Time Series: A stream of timestamped values belonging to the same metric and label set
- Labels: Key-value pairs that identify different dimensions of a metric
- Targets: Endpoints that Prometheus scrapes for metrics
- Jobs: Collections of targets with the same purpose
- Instances: Individual endpoints of a job
Key Characteristics
- Pull-based model: Prometheus scrapes metrics from HTTP endpoints
- Time-series data: All data is stored as time-series with timestamps
- PromQL: Powerful functional query language for data analysis
- Multi-dimensional data: Metrics can have multiple labels for flexible querying
- No dependencies: Single binary with local storage
- Service discovery: Automatic discovery of monitoring targets
When to Use Prometheus
Prometheus is ideal for:
- Microservices Monitoring: Track distributed system performance
- Container Orchestration: Monitor Kubernetes, Docker Swarm, and similar platforms
- Infrastructure Monitoring: System metrics, network performance, storage usage
- Application Monitoring: Custom metrics, business KPIs, performance indicators
- Real-time Alerting: Proactive notifications based on metric thresholds
Data Model and Metric Types
Data Model
Prometheus stores all data as time-series, identified by:
- Metric name: Describes the feature being measured
- Labels: Key-value pairs for multi-dimensional data
- Timestamp: When the measurement was taken
- Value: The numeric measurement
Example metric:
http_requests_total{method="GET", handler="/api/users", status="200"} 1027
Metric Types
Counter
Cumulative metric that only increases (or resets to zero):
http_requests_total
process_cpu_seconds_total
Gauge
Metric that can go up and down:
memory_usage_bytes
cpu_temperature_celsius
active_connections
Histogram
Samples observations and counts them in configurable buckets:
http_request_duration_seconds
response_size_bytes
Summary
Similar to histogram but calculates quantiles over a sliding time window:
request_duration_seconds{quantile="0.5"}
request_duration_seconds{quantile="0.9"}
Key Features
Monitoring Capabilities
- Multi-dimensional Data Model: Time series identified by metric name and labels
- Flexible Query Language: PromQL for querying and aggregating data
- Pull-based Collection: Scrapes targets over HTTP for metrics
- Service Discovery: Automatic discovery of monitoring targets
- Efficient Storage: Custom time-series database optimized for monitoring data
Alerting and Notification
- Alertmanager Integration: Sophisticated alerting with routing and notification
- Alert Rules: Define conditions that trigger alerts
- Notification Channels: Email, Slack, PagerDuty, webhooks, and more
- Alert Grouping: Intelligent grouping and deduplication of alerts
- Silencing: Temporarily suppress alerts during maintenance
Ecosystem Integration
- Visualization: Grafana integration for rich dashboards
- Client Libraries: SDKs for major programming languages
- Exporters: Third-party integrations for databases, systems, and services
- Federation: Hierarchical aggregation of metrics across multiple instances
- Remote Storage: Integration with long-term storage solutions
Architecture
Core Components
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Prometheus │ │ Alertmanager │ │ Grafana │
│ Server │◄──►│ │ │ (Optional) │
│ │ │ │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
▲ │ ▲
│ ▼ │
│ ┌─────────────────┐ │
│ │ Notification │ │
│ │ Channels │ │
│ │ (Email, Slack) │ │
│ └─────────────────┘ │
│ │
▼ │
┌─────────────────┐ ┌─────────────────┐ │
│ Targets │ │ Exporters │ │
│ (Applications, │ │ (Node, cAdvisor,│◄─────────────┘
│ Services) │ │ MySQL, etc.) │
└─────────────────┘ └─────────────────┘
Component Descriptions
Prometheus Server: The main component that scrapes and stores time-series data, and serves queries via PromQL.
Client Libraries: Libraries for instrumenting application code in various programming languages (Go, Java, Python, .NET, etc.).
Pushgateway: Allows ephemeral and batch jobs to push metrics to Prometheus.
Exporters: Third-party tools that export metrics from existing systems (databases, hardware, messaging systems, etc.).
Alertmanager: Handles alerts sent by Prometheus server and routes them to various notification channels.
Prometheus Server Components
- Retrieval: Scrapes metrics from configured targets
- Storage: Time-series database for storing metrics
- PromQL Engine: Query language processor
- Web UI: Built-in expression browser and graph interface
- API: HTTP API for querying data and managing configuration
Data Flow
- Service Discovery: Identifies targets to monitor
- Scraping: Pulls metrics from targets via HTTP
- Storage: Stores time-series data locally
- Rule Evaluation: Processes recording and alerting rules
- Alert Generation: Sends alerts to Alertmanager
- Querying: Serves queries via API or web interface
Installation and Deployment
Binary Installation
# Download and extract Prometheus binary
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvfz prometheus-*.tar.gz
cd prometheus-*
# Run Prometheus with default configuration
./prometheus --config.file=prometheus.yml
# Run with custom configuration and storage options
./prometheus \
--config.file=/path/to/prometheus.yml \
--storage.tsdb.path=/path/to/data \
--storage.tsdb.retention.time=30d \
--web.enable-lifecycle
Docker Container
# Basic Prometheus container
docker run -d \
--name prometheus \
-p 9090:9090 \
-v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus:latest
# With additional configuration
docker run -d \
--name prometheus \
-p 9090:9090 \
-v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
-v $(pwd)/alert_rules:/etc/prometheus/rules \
-v prometheus-data:/prometheus \
prom/prometheus:latest \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/prometheus \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.console.templates=/etc/prometheus/consoles \
--web.enable-lifecycle \
--storage.tsdb.retention.time=30d
Docker Compose
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/alert_rules:/etc/prometheus/rules
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=10GB'
networks:
- monitoring
restart: unless-stopped
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
networks:
- monitoring
restart: unless-stopped
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker:/var/lib/docker:ro
- /dev/disk:/dev/disk:ro
networks:
- monitoring
restart: unless-stopped
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager-data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
networks:
- monitoring
restart: unless-stopped
volumes:
prometheus-data:
alertmanager-data:
networks:
monitoring:
driver: bridge
Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
labels:
app: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
containers:
- name: prometheus
image: prom/prometheus:latest
ports:
- containerPort: 9090
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
- '--storage.tsdb.retention.time=30d'
volumeMounts:
- name: config-volume
mountPath: /etc/prometheus
- name: storage-volume
mountPath: /prometheus
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1000m"
volumes:
- name: config-volume
configMap:
name: prometheus-config
- name: storage-volume
persistentVolumeClaim:
claimName: prometheus-pvc
---
apiVersion: v1
kind: Service
metadata:
name: prometheus-service
namespace: monitoring
spec:
selector:
app: prometheus
ports:
- port: 9090
targetPort: 9090
type: LoadBalancer
Helm Installation
# Add Prometheus Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install kube-prometheus-stack (includes Prometheus, Alertmanager, and Grafana)
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
--set alertmanager.alertmanagerSpec.retention=120h
# Install standalone Prometheus
helm install prometheus prometheus-community/prometheus \
--namespace monitoring \
--create-namespace \
--set server.persistentVolume.size=50Gi \
--set server.retention=30d
Configuration
Basic Configuration File
The main configuration file (prometheus.yml
):
global:
# How frequently to scrape targets by default
scrape_interval: 15s
# How long until a scrape request times out
scrape_timeout: 10s
# How frequently to evaluate rules
evaluation_interval: 15s
# Attach labels to any time series or alerts when communicating with external systems
external_labels:
cluster: 'production'
replica: '1'
# Rule files specify a list of globs
rule_files:
- "alert_rules/*.yml"
- "recording_rules/*.yml"
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scheme: http
timeout: 10s
api_version: v2
# Scrape configuration
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
scrape_interval: 5s
metrics_path: /metrics
# Node Exporter
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
scrape_interval: 15s
# cAdvisor for container metrics
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
scrape_interval: 15s
# Application metrics
- job_name: 'app-metrics'
static_configs:
- targets: ['app1:8080', 'app2:8080']
metrics_path: /actuator/prometheus
scrape_interval: 30s
# Remote write configuration (optional)
remote_write:
- url: "https://remote-storage-endpoint/api/v1/write"
basic_auth:
username: "user"
password: "password"
write_relabel_configs:
- source_labels: [__name__]
regex: 'expensive_metric.*'
action: drop
Advanced Configuration Options
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 15s
# Query log file
query_log_file: /var/log/prometheus_queries.log
# External labels for federation and remote storage
external_labels:
cluster: 'prod-cluster'
region: 'us-west-2'
environment: 'production'
# Recording rules for pre-computing expensive queries
rule_files:
- "/etc/prometheus/rules/*.yml"
# Multiple Alertmanager instances for HA
alerting:
alert_relabel_configs:
- source_labels: [severity]
target_label: priority
regex: critical
replacement: high
alertmanagers:
- static_configs:
- targets:
- alertmanager-1:9093
- alertmanager-2:9093
timeout: 10s
path_prefix: /alertmanager
# Federation configuration
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"prometheus"}'
- '{__name__=~"up|instance:.*"}'
static_configs:
- targets:
- 'prometheus-1:9090'
- 'prometheus-2:9090'
# Kubernetes service discovery
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
Service Discovery
Static Configuration
scrape_configs:
- job_name: 'static-targets'
static_configs:
- targets: ['server1:9100', 'server2:9100']
labels:
environment: production
team: infrastructure
- targets: ['app1:8080', 'app2:8080']
labels:
environment: production
team: backend
File-based Service Discovery
scrape_configs:
- job_name: 'file-discovery'
file_sd_configs:
- files:
- '/etc/prometheus/targets/*.json'
- '/etc/prometheus/targets/*.yml'
refresh_interval: 30s
Target file example (/etc/prometheus/targets/web-servers.json
):
[
{
"targets": ["web1:9100", "web2:9100", "web3:9100"],
"labels": {
"job": "web-servers",
"environment": "production",
"datacenter": "us-west-1"
}
},
{
"targets": ["db1:9100", "db2:9100"],
"labels": {
"job": "database-servers",
"environment": "production",
"datacenter": "us-west-1"
}
}
]
Kubernetes Service Discovery
scrape_configs:
# Discover Kubernetes nodes
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
# Discover Kubernetes pods
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# Discover Kubernetes services
- job_name: 'kubernetes-services'
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
Docker Service Discovery
scrape_configs:
- job_name: 'docker-containers'
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 30s
relabel_configs:
- source_labels: [__meta_docker_container_label_prometheus_job]
target_label: job
- source_labels: [__meta_docker_container_name]
target_label: container_name
- source_labels: [__meta_docker_container_label_prometheus_port]
target_label: __address__
regex: (.+)
replacement: ${1}
PromQL Query Language
PromQL is a functional query language that allows you to select and aggregate time-series data in real-time. It's designed specifically for monitoring use cases.
Basic Queries
# Simple metric selection
http_requests_total
# Filter by labels
http_requests_total{method="GET"}
# Multiple label filters
http_requests_total{method="GET", status="200"}
# Regular expression matching
http_requests_total{handler=~"/api/.*"}
# Negative matching
http_requests_total{status!="200"}
# Time range selection
cpu_usage_seconds_total[5m]
Rate and Range Queries
# Rate of requests per second over 5 minutes
rate(http_requests_total[5m])
# Average CPU usage over 10 minutes
avg_over_time(cpu_usage_percent[10m])
# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Increase function (total increase over time range)
increase(http_requests_total[1h])
# Delta function (difference between first and last values)
delta(memory_usage_bytes[10m])
Time Series Functions
# Rate function (per-second rate)
rate(cpu_usage_seconds_total[5m])
# Average CPU usage over 10 minutes
avg_over_time(cpu_usage_percent[10m])
# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Increase function (total increase over time range)
increase(http_requests_total[1h])
# Delta function (difference between first and last values)
delta(memory_usage_bytes[10m])
# Derivative function (per-second derivative)
deriv(disk_usage_bytes[5m])
# Rate function (per-second rate)
rate(cpu_usage_seconds_total[5m])
# Increase function (total increase over time range)
increase(http_requests_total[1h])
# Delta function (difference between first and last values)
delta(memory_usage_bytes[10m])
# Derivative function (per-second derivative)
deriv(disk_usage_bytes[5m])
Aggregation Functions
# Sum across all series
sum(rate(http_requests_total[5m]))
# Sum by specific labels
sum by (job) (rate(http_requests_total[5m]))
# Average CPU usage per instance
avg by (instance) (rate(cpu_usage_seconds_total[5m]))
# Maximum memory usage
max(memory_usage_bytes)
# Count number of instances
count(up == 1)
# Top 5 instances by CPU usage
topk(5, rate(cpu_usage_seconds_total[5m]))
# Bottom 3 instances by memory
bottomk(3, memory_available_bytes)
Mathematical Operations
# Calculate CPU utilization percentage
100 - (avg by (instance) (rate(cpu_usage_seconds_total{mode="idle"}[5m])) * 100)
# Memory utilization percentage
(1 - (memory_available_bytes / memory_total_bytes)) * 100
# Network bandwidth utilization
rate(network_receive_bytes_total[5m]) + rate(network_transmit_bytes_total[5m])
# Disk space usage percentage
(1 - (filesystem_free_bytes / filesystem_size_bytes)) * 100
Advanced PromQL Examples
# Service availability over time
avg_over_time(up[1h])
# 99th percentile response time
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Error rate percentage
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100
# Predict disk full time (linear regression)
predict_linear(filesystem_free_bytes[1h], 3600 * 24)
# Alert if service is down for more than 5 minutes
absent_over_time(up[5m])
# Container CPU throttling
rate(container_cpu_cfs_throttled_seconds_total[5m])
# Memory pressure detection
(container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.8
Alerting
Alert Rules Configuration
Create alert rules file (/etc/prometheus/rules/alerts.yml
):
groups:
- name: infrastructure.rules
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
team: infrastructure
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
runbook_url: "https://runbooks.example.com/InstanceDown"
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
for: 10m
labels:
severity: warning
team: infrastructure
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% on {{ $labels.instance }} for more than 10 minutes."
current_value: "{{ $value }}%"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "High memory usage detected"
description: "Memory usage is above 85% on {{ $labels.instance }}."
current_value: "{{ $value }}%"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
for: 5m
labels:
severity: critical
team: infrastructure
annotations:
summary: "Low disk space"
description: "Disk space is below 10% on {{ $labels.instance }}."
current_value: "{{ $value }}%"
- name: application.rules
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 > 5
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "High error rate detected"
description: "Error rate is above 5% for {{ $labels.job }} service."
current_value: "{{ $value }}%"
- alert: HighResponseTime
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 10m
labels:
severity: warning
team: backend
annotations:
summary: "High response time"
description: "95th percentile response time is above 2 seconds for {{ $labels.job }}."
current_value: "{{ $value }}s"
- alert: ServiceUnavailable
expr: absent(up{job="critical-service"})
for: 1m
labels:
severity: critical
team: backend
annotations:
summary: "Critical service is not available"
description: "The critical-service job is not reporting any metrics."
Alertmanager Configuration
Create Alertmanager configuration (alertmanager.yml
):
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alerts@example.com'
smtp_auth_username: 'alerts@example.com'
smtp_auth_password: 'password'
# Template files
templates:
- '/etc/alertmanager/templates/*.tmpl'
# Route tree
route:
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default'
routes:
# Critical alerts go to PagerDuty and Slack
- match:
severity: critical
receiver: 'critical-alerts'
continue: true
# Infrastructure team alerts
- match:
team: infrastructure
receiver: 'infrastructure-team'
# Backend team alerts
- match:
team: backend
receiver: 'backend-team'
# Maintenance window silencing
- match:
alertname: 'MaintenanceMode'
receiver: 'null'
# Inhibit rules
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
# Notification receivers
receivers:
- name: 'default'
email_configs:
- to: 'admin@example.com'
subject: 'Prometheus Alert: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Instance: {{ .Labels.instance }}
Severity: {{ .Labels.severity }}
{{ end }}
- name: 'critical-alerts'
pagerduty_configs:
- service_key: 'your-pagerduty-service-key'
description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.instance }}'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#critical-alerts'
title: 'Critical Alert: {{ .GroupLabels.alertname }}'
text: |
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Instance:* {{ .Labels.instance }}
*Severity:* {{ .Labels.severity }}
{{ end }}
- name: 'infrastructure-team'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#infrastructure'
username: 'Prometheus'
title: 'Infrastructure Alert'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
- name: 'backend-team'
email_configs:
- to: 'backend-team@example.com'
subject: 'Backend Service Alert'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#backend-alerts'
- name: 'null'
# Silent receiver for maintenance windows
Storage
Local Storage Configuration
# Prometheus startup flags for storage configuration
command:
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=50GB'
- '--storage.tsdb.min-block-duration=2h'
- '--storage.tsdb.max-block-duration=36h'
- '--storage.tsdb.wal-compression'
Remote Storage Integration
# Remote write configuration
remote_write:
- url: "https://cortex.example.com/api/prom/push"
basic_auth:
username: "user"
password: "password"
write_relabel_configs:
- source_labels: [__name__]
regex: 'expensive_metric_.*'
action: drop
queue_config:
capacity: 10000
max_shards: 200
min_shards: 1
max_samples_per_send: 5000
batch_send_deadline: 5s
# Remote read configuration
remote_read:
- url: "https://cortex.example.com/api/prom/read"
basic_auth:
username: "user"
password: "password"
read_recent: true
Storage Optimization
# Recording rules for pre-aggregation
groups:
- name: recording.rules
interval: 30s
rules:
# Pre-calculate CPU usage
- record: instance:cpu_usage:rate5m
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Pre-calculate memory usage
- record: instance:memory_usage:percentage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Pre-calculate disk usage
- record: instance:disk_usage:percentage
expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100
# Application-specific recording rules
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
- record: job:http_request_duration:p95
expr: histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))
Security
Basic Authentication
# Enable basic authentication
global:
# Basic auth for scraping
scrape_configs:
- job_name: 'secure-app'
static_configs:
- targets: ['app:8080']
basic_auth:
username: 'prometheus'
password: 'secure_password'
scheme: https
tls_config:
ca_file: /etc/prometheus/ca.pem
cert_file: /etc/prometheus/prometheus.pem
key_file: /etc/prometheus/prometheus-key.pem
insecure_skip_verify: false
TLS Configuration
# TLS configuration for secure communication
global:
scrape_configs:
- job_name: 'tls-enabled-service'
static_configs:
- targets: ['secure-service:8443']
scheme: https
tls_config:
ca_file: /etc/prometheus/certs/ca.crt
cert_file: /etc/prometheus/certs/prometheus.crt
key_file: /etc/prometheus/certs/prometheus.key
server_name: secure-service.example.com
insecure_skip_verify: false
Network Security
# Network policies for Kubernetes
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: prometheus-network-policy
namespace: monitoring
spec:
podSelector:
matchLabels:
app: prometheus
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: monitoring
- podSelector:
matchLabels:
app: grafana
ports:
- protocol: TCP
port: 9090
egress:
- to: []
ports:
- protocol: TCP
port: 53
- protocol: UDP
port: 53
- to:
- namespaceSelector: {}
ports:
- protocol: TCP
port: 9100 # Node exporter
- protocol: TCP
port: 8080 # cAdvisor
RBAC Configuration
# ServiceAccount for Prometheus
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitoring
---
# ClusterRole with necessary permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups:
- extensions
resources:
- ingresses
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
# ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring
Container Monitoring
Docker Container Metrics
Key metrics for container monitoring:
# Container CPU usage
rate(container_cpu_usage_seconds_total[5m])
# Container memory usage
container_memory_usage_bytes
# Container memory limit
container_spec_memory_limit_bytes
# Memory utilization percentage
(container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100
# Network I/O
rate(container_network_receive_bytes_total[5m])
rate(container_network_transmit_bytes_total[5m])
# Disk I/O
rate(container_fs_reads_bytes_total[5m])
rate(container_fs_writes_bytes_total[5m])
# Container restart count
increase(container_restart_count[1h])
Kubernetes Monitoring
# Pod CPU usage
sum by (pod) (rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m]))
# Pod memory usage
sum by (pod) (container_memory_usage_bytes{container!="POD",container!=""})
# Node resource utilization
(1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) * 100
# Cluster capacity
sum(node_memory_MemTotal_bytes) / 1024 / 1024 / 1024
# Pod status
kube_pod_status_phase{phase!="Running"}
# Deployment replica status
kube_deployment_status_replicas_available / kube_deployment_spec_replicas
# Persistent volume usage
(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100
Application Instrumentation
Example Go application with Prometheus metrics:
package main
import (
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "The total number of processed HTTP requests",
},
[]string{"method", "endpoint", "status_code"},
)
httpRequestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "The HTTP request latencies in seconds",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
activeConnections = promauto.NewGauge(
prometheus.GaugeOpts{
Name: "active_connections",
Help: "The current number of active connections",
},
)
)
func instrumentHandler(endpoint string, handler http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
activeConnections.Inc()
defer activeConnections.Dec()
handler(w, r)
duration := time.Since(start).Seconds()
httpRequestDuration.WithLabelValues(r.Method, endpoint).Observe(duration)
httpRequestsTotal.WithLabelValues(r.Method, endpoint, "200").Inc()
}
}
func main() {
http.Handle("/metrics", promhttp.Handler())
http.HandleFunc("/", instrumentHandler("/", homeHandler))
http.HandleFunc("/api/health", instrumentHandler("/api/health", healthHandler))
http.ListenAndServe(":8080", nil)
}
Best Practices
Configuration Management
- Version Control: Store all configuration files in version control
- Environment Separation: Use different configurations for different environments
- Validation: Validate configuration syntax before deployment
- Documentation: Document custom metrics and alert rules
- Standardization: Use consistent naming conventions and labels
Metric Design
- Naming Conventions: Use clear, consistent metric names
- Label Usage: Use labels wisely - avoid high cardinality
- Metric Types: Choose appropriate metric types (counter, gauge, histogram, summary)
- Documentation: Include help text for all custom metrics
- Aggregation: Design metrics for efficient aggregation
Query Optimization
# Good: Efficient query with specific labels
rate(http_requests_total{job="api", method="GET"}[5m])
# Avoid: High cardinality queries
rate(http_requests_total{user_id=~".*"}[5m])
# Good: Use recording rules for expensive queries
instance:cpu_usage:rate5m
# Avoid: Complex calculations in dashboards
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Alert Design
- Clear Criteria: Define specific, actionable alert conditions
- Appropriate Thresholds: Set realistic thresholds based on historical data
- Timing: Use appropriate
for
clauses to avoid flapping - Context: Include relevant context in alert annotations
- Escalation: Design multi-level alert escalation
Storage Management
- Retention Policies: Set appropriate retention based on requirements
- Disk Space: Monitor disk usage and set size limits
- Backup Strategy: Implement regular backup procedures
- Compaction: Understand TSDB compaction behavior
- Remote Storage: Consider remote storage for long-term retention
High Availability
# Prometheus HA setup with shared storage
version: '3.8'
services:
prometheus-1:
image: prom/prometheus:latest
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.enable-lifecycle'
- '--web.external-url=http://prometheus-1:9090'
- '--storage.tsdb.min-block-duration=2h'
- '--storage.tsdb.max-block-duration=36h'
external_labels:
replica: '1'
cluster: 'production'
prometheus-2:
image: prom/prometheus:latest
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.enable-lifecycle'
- '--web.external-url=http://prometheus-2:9090'
- '--storage.tsdb.min-block-duration=2h'
- '--storage.tsdb.max-block-duration=36h'
external_labels:
replica: '2'
cluster: 'production'
Troubleshooting
Common Issues
Target Discovery Issues
# Check service discovery
curl http://prometheus:9090/api/v1/targets
# Verify DNS resolution
nslookup target-service
# Test connectivity
telnet target-service 9100
# Check logs
docker logs prometheus 2>&1 | grep -i "discovery\|scrape"
Query Performance
# Enable query logging
--query.log-file=/var/log/prometheus_queries.log
# Monitor query performance
tail -f /var/log/prometheus_queries.log | grep -E "took|ms"
# Check for expensive queries
curl "http://prometheus:9090/api/v1/query?query=topk(10,count by (__name__)({__name__=~\".+\"}))"
Storage Issues
# Check TSDB stats
curl http://prometheus:9090/api/v1/status/tsdb
# Monitor disk usage
df -h /prometheus
# Check WAL status
ls -la /prometheus/wal/
# Verify block integrity
promtool tsdb analyze /prometheus
Memory Issues
# Monitor memory usage
ps aux | grep prometheus
# Check for memory leaks
curl http://prometheus:9090/debug/pprof/heap
# Optimize memory usage
--storage.tsdb.head-chunks-write-queue-size=10000
--query.max-concurrency=20
--query.max-samples=50000000
Debugging Tools
# Validate configuration
promtool check config prometheus.yml
# Check rules syntax
promtool check rules /etc/prometheus/rules/*.yml
# Query Prometheus API
curl "http://prometheus:9090/api/v1/query?query=up"
# Test alerts
curl "http://prometheus:9090/api/v1/alerts"
# Check metrics metadata
curl "http://prometheus:9090/api/v1/metadata"
Log Analysis
# Important log patterns
tail -f /var/log/prometheus.log | grep -E "(error|Error|ERROR)"
tail -f /var/log/prometheus.log | grep -E "(scrape|discovery)"
tail -f /var/log/prometheus.log | grep -E "(rule|alert)"
tail -f /var/log/prometheus.log | grep -E "(storage|tsdb)"
Resources
Official Documentation
- Prometheus Official Documentation
- PromQL Documentation
- Alertmanager Documentation
- Configuration Reference
Container Resources
Client Libraries
Exporters
- Node Exporter - Hardware and OS metrics
- Blackbox Exporter - HTTP, DNS, TCP probing
- MySQL Exporter - MySQL metrics
- PostgreSQL Exporter - PostgreSQL metrics
- Redis Exporter - Redis metrics
- NGINX Exporter - NGINX metrics
Community Resources
Tutorials and Guides
Application Monitoring
Instrument your applications to expose custom metrics for comprehensive monitoring.
Go Application Example
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
httpRequestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "Duration of HTTP requests in seconds",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
)
func init() {
prometheus.MustRegister(httpRequestsTotal)
prometheus.MustRegister(httpRequestDuration)
}
func instrumentedHandler(w http.ResponseWriter, r *http.Request) {
timer := prometheus.NewTimer(httpRequestDuration.WithLabelValues(r.Method, r.URL.Path))
defer timer.ObserveDuration()
// Your application logic here
w.WriteHeader(http.StatusOK)
httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
}
func main() {
http.Handle("/metrics", promhttp.Handler())
http.HandleFunc("/", instrumentedHandler)
http.ListenAndServe(":8080", nil)
}
Key Application Metrics
- Request rate:
http_requests_total
- Request duration:
http_request_duration_seconds
- Active connections:
active_connections
- Queue depth:
queue_depth
- Business metrics: Custom counters/gauges for domain-specific events
Integration with Grafana
Important
While Prometheus provides basic graphing capabilities, Grafana is the preferred tool for creating rich, interactive dashboards.
Setting up Grafana with Prometheus
# Run Grafana with Docker
docker run -d -p 3000:3000 --name grafana grafana/grafana
# Or with Docker Compose
version: '3'
services:
grafana:
image: grafana/grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-storage:/var/lib/grafana
volumes:
grafana-storage:
Adding Prometheus as Data Source
- Access Grafana at
http://localhost:3000
(admin/admin) - Go to Configuration → Data Sources
- Click Add data source and select Prometheus
- Set URL to
http://prometheus:9090
(or your Prometheus URL) - Click Save & Test
Sample Dashboard Queries
# CPU Usage Panel
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory Usage Panel
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Request Rate Panel
rate(http_requests_total[5m])
# Response Time Percentiles
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))
Advanced Monitoring Patterns
Recording Rules for Performance
# /etc/prometheus/rules/recording.yml
groups:
- name: performance_rules
interval: 30s
rules:
- record: node:cpu_utilization:rate5m
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- record: node:memory_utilization:ratio
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))
- record: instance:request_rate:rate5m
expr: rate(http_requests_total[5m])
Metric Naming Best Practices
Tip
Follow Prometheus naming conventions for consistency and clarity across your monitoring infrastructure.
Naming Guidelines:
- Use snake_case:
http_requests_total
- Include units in the name:
process_cpu_seconds_total
- Use descriptive, unambiguous names:
database_connection_pool_size
- End counters with
_total
:api_requests_total
- End gauges with descriptive units:
memory_usage_bytes
Examples:
# Good naming
http_requests_total{method="GET", status="200"}
database_connections_active
memory_usage_bytes
disk_write_bytes_total
# Avoid these patterns
requests # Too generic, missing _total
db_conn # Abbreviated, unclear
mem # Too short, no units
Monitoring Prometheus Self-Health
Essential self-monitoring queries:
# Prometheus health
prometheus_config_last_reload_successful
prometheus_tsdb_reloads_total
prometheus_build_info
# Performance metrics
rate(prometheus_http_requests_total[5m])
prometheus_tsdb_head_samples_appended_total
prometheus_tsdb_compaction_duration_seconds
# Storage metrics
prometheus_tsdb_symbol_table_size_bytes
prometheus_tsdb_head_series
prometheus_tsdb_retention_limit_bytes