Prometheus is an open-source monitoring and alerting system originally developed by SoundCloud in 2012. It has become the de facto standard for monitoring cloud-native applications and is a graduated project of the Cloud Native Computing Foundation (CNCF).
Note
Prometheus is designed for reliability and scalability, making it ideal for monitoring microservices, containers, and distributed systems in dynamic environments.
What is Prometheus?
Prometheus is a time-series database and monitoring system that collects metrics from configured targets at given intervals, evaluates rule expressions, displays results, and can trigger alerts when specified conditions are met.
Key Characteristics
- Pull-based model: Prometheus scrapes metrics from HTTP endpoints
- Time-series data: All data is stored as time-series with timestamps
- PromQL: Powerful functional query language for data analysis
- Multi-dimensional data: Metrics can have multiple labels for flexible querying
- No dependencies: Single binary with local storage
- Service discovery: Automatic discovery of monitoring targets
Core Components
Prometheus Server
The main component that scrapes and stores time-series data, and serves queries via PromQL.
Client Libraries
Libraries for instrumenting application code in various programming languages (Go, Java, Python, .NET, etc.).
Pushgateway
Allows ephemeral and batch jobs to push metrics to Prometheus.
Exporters
Third-party tools that export metrics from existing systems (databases, hardware, messaging systems, etc.).
Alertmanager
Handles alerts sent by Prometheus server and routes them to various notification channels.
How Prometheus Works
1. Metrics Collection
Prometheus uses a pull model to collect metrics:
# prometheus.yml configuration
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
- job_name: 'application'
static_configs:
- targets: ['app1:8080', 'app2:8080']
2. Data Model
Prometheus stores all data as time-series, identified by:
- Metric name: Describes the feature being measured
- Labels: Key-value pairs for multi-dimensional data
- Timestamp: When the measurement was taken
- Value: The numeric measurement
Example metric:
http_requests_total{method="GET", handler="/api/users", status="200"} 1027
3. Metric Types
Counter
Cumulative metric that only increases (or resets to zero):
http_requests_total
process_cpu_seconds_total
Gauge
Metric that can go up and down:
memory_usage_bytes
cpu_temperature_celsius
active_connections
Histogram
Samples observations and counts them in configurable buckets:
http_request_duration_seconds
response_size_bytes
Summary
Similar to histogram but calculates quantiles over a sliding time window:
request_duration_seconds{quantile="0.5"}
request_duration_seconds{quantile="0.9"}
PromQL - Prometheus Query Language
PromQL is a functional query language that allows you to select and aggregate time-series data.
Basic Queries
# Select all time series with a specific metric name
http_requests_total
# Filter by labels
http_requests_total{method="GET"}
# Multiple label filters
http_requests_total{method="GET", status="200"}
# Regular expression matching
http_requests_total{handler=~"/api/.*"}
Rate and Range Queries
# Rate of requests per second over 5 minutes
rate(http_requests_total[5m])
# Average CPU usage over 10 minutes
avg_over_time(cpu_usage_percent[10m])
# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Aggregation
# Sum of requests across all instances
sum(http_requests_total)
# Average memory usage by job
avg by (job) (memory_usage_bytes)
# Top 5 endpoints by request rate
topk(5, rate(http_requests_total[5m]))
Setting Up Prometheus
Installation Options
Docker
# Run Prometheus in Docker
docker run -p 9090:9090 \
-v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus
Binary Installation
# Download and extract
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvfz prometheus-*.tar.gz
cd prometheus-*
# Run Prometheus
./prometheus --config.file=prometheus.yml
Kubernetes (Helm)
# Add Prometheus Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install Prometheus
helm install prometheus prometheus-community/kube-prometheus-stack
Basic Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
scrape_interval: 30s
metrics_path: /metrics
Common Use Cases
Infrastructure Monitoring
Tip
Use Node Exporter to monitor Linux/Unix system metrics like CPU, memory, disk, and network.
# Install Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.0/node_exporter-1.6.0.linux-amd64.tar.gz
tar xvfz node_exporter-*.tar.gz
./node_exporter
Key infrastructure metrics:
- CPU usage:
node_cpu_seconds_total
- Memory usage:
node_memory_MemAvailable_bytes
- Disk usage:
node_filesystem_avail_bytes
- Network traffic:
node_network_receive_bytes_total
Application Monitoring
Instrument your applications to expose custom metrics:
Go Example
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
)
func init() {
prometheus.MustRegister(httpRequestsTotal)
}
func handler(w http.ResponseWriter, r *http.Request) {
// Your application logic here
httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
}
func main() {
http.Handle("/metrics", promhttp.Handler())
http.HandleFunc("/", handler)
http.ListenAndServe(":8080", nil)
}
Container Monitoring
Monitor Docker containers and Kubernetes:
# Docker Compose with Prometheus
version: '3'
services:
prometheus:
image: prom/prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
cadvisor:
image: google/cadvisor
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
Alerting with Prometheus
Alert Rules
# alert_rules.yml
groups:
- name: example_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for more than 5 minutes"
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service is down"
description: "{{ $labels.instance }} has been down for more than 1 minute"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Memory usage is above 90%"
Alertmanager Configuration
# alertmanager.yml
global:
smtp_smarthost: 'localhost:587'
smtp_from: 'alerts@company.com'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
email_configs:
- to: 'admin@company.com'
subject: 'Prometheus Alert'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ end }}
Integration with Grafana
Important
While Prometheus provides basic graphing capabilities, Grafana is the preferred tool for creating rich, interactive dashboards.
Setting up Grafana with Prometheus
# Run Grafana
docker run -d -p 3000:3000 grafana/grafana
# Access Grafana at http://localhost:3000
# Default credentials: admin/admin
Adding Prometheus as Data Source
- Go to Configuration → Data Sources
- Add Prometheus data source
- Set URL to
http://prometheus:9090
- Click "Save & Test"
Best Practices
Metric Naming
Tip
Follow Prometheus naming conventions for consistency and clarity.
- Use snake_case:
http_requests_total
- Include units:
process_cpu_seconds_total
- Use descriptive names:
database_connection_pool_size
Label Usage
- Keep labels finite and low-cardinality
- Avoid putting IDs or high-cardinality data in labels
- Use consistent label names across metrics
# Good
http_requests_total{method="GET", status="200"}
# Avoid - high cardinality
http_requests_total{user_id="12345", session_id="abc123"}
Performance Optimization
- Set appropriate scrape intervals (15s-1m for most cases)
- Use recording rules for expensive queries
- Monitor Prometheus itself
- Consider federation for large deployments
Troubleshooting
Common Issues
Warning
High cardinality metrics can cause memory issues and slow queries. Monitor your metrics carefully.
- High memory usage: Check for high-cardinality metrics
- Slow queries: Use
query_log
to identify expensive queries - Missing metrics: Verify target discovery and scrape configs
- Alert fatigue: Fine-tune alert thresholds and conditions
Monitoring Prometheus
# Prometheus self-monitoring queries
prometheus_tsdb_symbol_table_size_bytes
prometheus_config_last_reload_successful
rate(prometheus_http_requests_total[5m])
prometheus_notifications_total
Next Steps
- Learn advanced PromQL functions and operators
- Explore Grafana dashboard creation
- Study alerting best practices
- Set up high availability Prometheus