This guide covers deploying a production-ready monitoring stack with Grafana and Prometheus, including system metrics, container monitoring, network device monitoring, alerting, and high availability.
Overview
Grafana and Prometheus form a powerful open-source monitoring solution that provides comprehensive observability for infrastructure, applications, and services.
Why This Stack?
- Open Source: No licensing costs, community-driven development
- Scalable: Handles millions of time-series metrics efficiently
- Flexible: Extensive ecosystem of exporters and integrations
- Powerful Query Language: PromQL for complex metric analysis
- Active Community: Large community, extensive documentation
- Cloud Native: Kubernetes-native with Helm charts and operators
Key Components
- Prometheus: Time-series database and monitoring system
- Grafana: Visualization and dashboarding platform
- Alertmanager: Alert routing and notification management
- Node Exporter: System metrics collection (CPU, memory, disk, network)
- cAdvisor: Container metrics and resource usage
- Blackbox Exporter: Endpoint monitoring and network probing
- Exporters: Application-specific and database metrics
- Thanos/Cortex: Long-term storage and high availability (optional)
Production Considerations
- High Availability: Deploy multiple Prometheus instances with Thanos or federation
- Security: Implement TLS/mTLS, authentication, and secrets management
- Scalability: Use remote storage, recording rules, and efficient scrape configs
- Backup: Automated backup procedures for Prometheus and Grafana data
- Monitoring: Monitor the monitoring stack itself with self-monitoring
- Performance: Optimize retention, cardinality, and query performance
Quick Start
# Clone example configuration
git clone https://github.com/example/prometheus-stack-example
cd prometheus-stack-example
# Review and customize .env file
cp .env.example .env
vim .env
# Deploy stack
docker-compose up -d
# Access services
# Grafana: http://localhost:3000 (see .env for credentials)
# Prometheus: http://localhost:9090
# Alertmanager: http://localhost:9093
Warning: This quick start uses default configurations unsuitable for production. Follow the security guide before exposing to networks.
Documentation Structure
For comprehensive configuration details, see:
- Installation Guide: Docker Compose setup, native installation, secrets management
- Configuration Guide: Prometheus scrape configs, service discovery, recording rules, Grafana provisioning
- Security Configuration: TLS/mTLS, authentication, secrets management, network security
- Exporters Guide: Node Exporter, cAdvisor, Blackbox, database exporters, custom metrics
- Alerting Guide: Alert rules, Alertmanager, notification channels, runbooks
- High Availability: HA architecture, Thanos, Grafana clustering, federation
- Backup and Recovery: Automated backups, restore procedures, disaster recovery
Quick Example: Node Exporter
Basic Node Exporter installation:
# Download and install Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar -zxvf node_exporter-1.7.0.linux-amd64.tar.gz
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
# Create service
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<EOF
[Unit]
Description=Node Exporter
[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
EOF
# Start service
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
Add to Prometheus configuration:
scrape_configs:
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
Recommended Grafana Dashboards
Popular dashboards for import:
- Node Exporter Full (ID: 1860) - System metrics
- Docker Container & Host Metrics (ID: 10619) - Container monitoring
- UniFi Poller (ID: 11315) - UniFi network devices
- Blackbox Exporter (ID: 13659) - Network probing
Quick Troubleshooting
Check Prometheus targets:
curl http://localhost:9090/api/v1/targets
Verify metrics collection:
curl http://localhost:9100/metrics
Test Grafana connection:
curl -u admin:password http://localhost:3000/api/health
Best Practices Summary
- Security: Enable TLS, use strong passwords, implement authentication
- Performance: Use recording rules, set appropriate retention, optimize queries
- Reliability: Implement HA, set up backups, monitor the monitoring stack
- Alerting: Create meaningful alerts, avoid alert fatigue, document runbooks
- Maintenance: Keep components updated, review configurations regularly
For detailed information on each topic, refer to the specific guides listed above.
Next Steps
- Install: Follow the Installation Guide to set up the stack
- Configure: Review Configuration Guide for advanced setups
- Secure: Implement recommendations from Security Configuration
- Monitor: Add exporters using Exporters Guide
- Alert: Set up alerting with Alerting Guide
- Scale: Implement High Availability for production
- Protect: Configure Backup and Recovery procedures