Grafana and Prometheus Monitoring Stack

This guide covers deploying a production-ready monitoring stack with Grafana and Prometheus, including system metrics, container monitoring, network device monitoring, alerting, and high availability.

Overview

Grafana and Prometheus form a powerful open-source monitoring solution that provides comprehensive observability for infrastructure, applications, and services.

Why This Stack?

Open Source: No licensing costs, community-driven development
Scalable: Handles millions of time-series metrics efficiently
Flexible: Extensive ecosystem of exporters and integrations
Powerful Query Language: PromQL for complex metric analysis
Active Community: Large community, extensive documentation
Cloud Native: Kubernetes-native with Helm charts and operators

Key Components

Prometheus: Time-series database and monitoring system
Grafana: Visualization and dashboarding platform
Alertmanager: Alert routing and notification management
Node Exporter: System metrics collection (CPU, memory, disk, network)
cAdvisor: Container metrics and resource usage
Blackbox Exporter: Endpoint monitoring and network probing
Exporters: Application-specific and database metrics
Thanos/Cortex: Long-term storage and high availability (optional)

Production Considerations

High Availability: Deploy multiple Prometheus instances with Thanos or federation
Security: Implement TLS/mTLS, authentication, and secrets management
Scalability: Use remote storage, recording rules, and efficient scrape configs
Backup: Automated backup procedures for Prometheus and Grafana data
Monitoring: Monitor the monitoring stack itself with self-monitoring
Performance: Optimize retention, cardinality, and query performance

Quick Start

# Clone example configuration
git clone https://github.com/example/prometheus-stack-example
cd prometheus-stack-example

# Review and customize .env file
cp .env.example .env
vim .env

# Deploy stack
docker-compose up -d

# Access services
# Grafana: http://localhost:3000 (see .env for credentials)
# Prometheus: http://localhost:9090
# Alertmanager: http://localhost:9093

Warning: This quick start uses default configurations unsuitable for production. Follow the security guide before exposing to networks.

Documentation Structure

For comprehensive configuration details, see:

Installation Guide: Docker Compose setup, native installation, secrets management
Configuration Guide: Prometheus scrape configs, service discovery, recording rules, Grafana provisioning
Security Configuration: TLS/mTLS, authentication, secrets management, network security
Exporters Guide: Node Exporter, cAdvisor, Blackbox, database exporters, custom metrics
Alerting Guide: Alert rules, Alertmanager, notification channels, runbooks
High Availability: HA architecture, Thanos, Grafana clustering, federation
Backup and Recovery: Automated backups, restore procedures, disaster recovery

Quick Example: Node Exporter

Basic Node Exporter installation:

# Download and install Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar -zxvf node_exporter-1.7.0.linux-amd64.tar.gz
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/

# Create service
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<EOF
[Unit]
Description=Node Exporter
[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
EOF

# Start service
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter

Add to Prometheus configuration:

scrape_configs:
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

Recommended Grafana Dashboards

Popular dashboards for import:

Node Exporter Full (ID: 1860) - System metrics
Docker Container & Host Metrics (ID: 10619) - Container monitoring
UniFi Poller (ID: 11315) - UniFi network devices
Blackbox Exporter (ID: 13659) - Network probing

Quick Troubleshooting

Check Prometheus targets:

curl http://localhost:9090/api/v1/targets

Verify metrics collection:

curl http://localhost:9100/metrics

Test Grafana connection:

curl -u admin:password http://localhost:3000/api/health

Best Practices Summary

Security: Enable TLS, use strong passwords, implement authentication
Performance: Use recording rules, set appropriate retention, optimize queries
Reliability: Implement HA, set up backups, monitor the monitoring stack
Alerting: Create meaningful alerts, avoid alert fatigue, document runbooks
Maintenance: Keep components updated, review configurations regularly

For detailed information on each topic, refer to the specific guides listed above.

Next Steps

Install: Follow the Installation Guide to set up the stack
Configure: Review Configuration Guide for advanced setups
Secure: Implement recommendations from Security Configuration
Monitor: Add exporters using Exporters Guide
Alert: Set up alerting with Alerting Guide
Scale: Implement High Availability for production
Protect: Configure Backup and Recovery procedures

Table of Contents