Infrastructure Monitoring
Infrastructure monitoring is essential for maintaining system reliability, performance, and availability. This guide covers monitoring solutions from basic system metrics to advanced observability platforms.
Overview
Effective infrastructure monitoring provides:
- Real-time visibility into system health and performance
- Proactive alerting for potential issues before they become problems
- Historical data for capacity planning and trend analysis
- Root cause analysis capabilities for faster incident resolution
- Compliance reporting for regulatory requirements
Monitoring Stack Components
Metrics Collection
- System Metrics: CPU, memory, disk, network utilization
- Application Metrics: Response times, error rates, throughput
- Custom Metrics: Business-specific measurements
- Infrastructure Metrics: Database, web server, load balancer performance
Time Series Databases
- Prometheus - Open-source monitoring and alerting toolkit
- InfluxDB - Purpose-built time series database
- Grafana Cloud - Managed observability platform
- Azure Monitor - Cloud-native monitoring solution
Visualization and Dashboards
- Grafana - Feature-rich visualization and analytics platform
- Kibana - Data visualization for Elasticsearch
- Azure Monitor Workbooks - Interactive reports and dashboards
- Custom Dashboards - Purpose-built monitoring interfaces
Prometheus Monitoring
Prometheus Setup
Prometheus is the de facto standard for metrics collection in modern infrastructure:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
- job_name: 'docker'
static_configs:
- targets: ['localhost:9323']
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
Node Exporter for System Metrics
# Install and run node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
cd node_exporter-1.6.1.linux-amd64
./node_exporter
Docker Container Monitoring
# docker-compose.yml for monitoring stack
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
volumes:
prometheus_data:
grafana_data:
Grafana Dashboards
System Overview Dashboard
Create comprehensive system dashboards for infrastructure monitoring:
{
"dashboard": {
"id": null,
"title": "Infrastructure Overview",
"tags": ["infrastructure", "monitoring"],
"panels": [
{
"title": "CPU Usage",
"type": "stat",
"targets": [
{
"expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"refId": "A"
}
]
},
{
"title": "Memory Usage",
"type": "stat",
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
"refId": "A"
}
]
}
]
}
}
Container Monitoring Dashboard
Monitor Docker containers and Kubernetes pods:
# Container metrics configuration
- name: container_cpu_usage
query: rate(container_cpu_usage_seconds_total[5m])
- name: container_memory_usage
query: container_memory_working_set_bytes / container_spec_memory_limit_bytes
- name: container_network_io
query: rate(container_network_receive_bytes_total[5m])
Alerting Configuration
Prometheus Alerting Rules
# alert_rules.yml
groups:
- name: infrastructure_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for more than 5 minutes"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage detected"
description: "Memory usage is above 85% for more than 5 minutes"
- alert: DiskSpaceLow
expr: (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 90
for: 2m
labels:
severity: critical
annotations:
summary: "Disk space is running low"
description: "Disk usage is above 90% on {{ $labels.device }}"
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service is down"
description: "{{ $labels.job }} service is down"
Alertmanager Configuration
# alertmanager.yml
global:
smtp_smarthost: 'localhost:587'
smtp_from: 'alerts@example.com'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
email_configs:
- to: 'admin@example.com'
subject: 'Alert: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ end }}
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK_URL'
channel: '#alerts'
title: 'Infrastructure Alert'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
Cloud Monitoring Solutions
Azure Monitor
# Install Azure Monitor agent
wget https://aka.ms/azcmagent
sudo chmod +x azcmagent
sudo ./azcmagent connect --resource-group "rg-monitoring" --tenant-id "your-tenant-id"
AWS CloudWatch
# CloudWatch configuration
Resources:
LogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: /aws/infrastructure/monitoring
RetentionInDays: 30
MetricFilter:
Type: AWS::Logs::MetricFilter
Properties:
LogGroupName: !Ref LogGroup
FilterPattern: "[timestamp, request_id, level=ERROR]"
MetricTransformations:
- MetricNamespace: "Custom/Application"
MetricName: "ErrorCount"
MetricValue: "1"
Log Management
Centralized Logging with ELK Stack
# docker-compose-elk.yml
version: '3.8'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.10.2
environment:
- discovery.type=single-node
- xpack.security.enabled=false
ports:
- "9200:9200"
logstash:
image: docker.elastic.co/logstash/logstash:8.10.2
volumes:
- ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
ports:
- "5044:5044"
kibana:
image: docker.elastic.co/kibana/kibana:8.10.2
ports:
- "5601:5601"
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
Fluentd for Log Collection
# fluentd.conf
<source>
@type tail
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
format json
</source>
<match kubernetes.**>
@type elasticsearch
host elasticsearch
port 9200
index_name kubernetes
type_name _doc
</match>
Performance Monitoring
Application Performance Monitoring (APM)
Monitor application performance and user experience:
# APM configuration
apm:
enabled: true
service_name: "my-application"
environment: "production"
instrumentation:
- http_requests
- database_queries
- cache_operations
- external_services
sampling:
rate: 0.1 # 10% sampling rate
alerts:
response_time_threshold: 500ms
error_rate_threshold: 5%
Database Monitoring
-- Database performance queries
SELECT
query,
mean_time,
calls,
total_time
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;
Container and Kubernetes Monitoring
Kubernetes Monitoring Stack
# monitoring-namespace.yml
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:latest
ports:
- containerPort: 9090
volumeMounts:
- name: config
mountPath: /etc/prometheus
volumes:
- name: config
configMap:
name: prometheus-config
Best Practices
Monitoring Strategy
- Start with the basics: CPU, memory, disk, network
- Monitor what matters: Focus on business-critical metrics
- Set meaningful alerts: Avoid alert fatigue with proper thresholds
- Document your monitoring: Maintain runbooks for common alerts
- Regular review: Continuously refine and improve monitoring
Alert Design
- Clear alert names: Use descriptive alert names and summaries
- Actionable alerts: Every alert should have a clear resolution path
- Severity levels: Use appropriate severity levels (info, warning, critical)
- Alert grouping: Group related alerts to reduce noise
- Escalation policies: Define clear escalation procedures
Dashboard Design
- User-focused: Design dashboards for specific audiences
- Key metrics first: Most important metrics should be prominently displayed
- Consistent layout: Use consistent colors, fonts, and layouts
- Drill-down capability: Enable users to explore details
- Regular updates: Keep dashboards current and relevant
Troubleshooting
Common Monitoring Issues
High cardinality metrics
# Check metric cardinality curl http://localhost:9090/api/v1/label/__name__/values | jq '.data | length'
Missing metrics
# Verify scrape targets curl http://localhost:9090/api/v1/targets
Alert fatigue
# Review alert frequency - alert: HighAlertFrequency expr: increase(prometheus_notifications_total[1h]) > 10
Performance Optimization
# Prometheus optimization
global:
scrape_interval: 30s # Increase interval for less critical metrics
evaluation_interval: 30s # Match scrape interval
storage:
tsdb:
retention.time: 30d # Adjust retention based on needs
retention.size: 10GB # Set size limits
Related Documentation
- Grafana Configuration - Dashboard and visualization setup
- Container Monitoring - Docker-specific monitoring
- Kubernetes Monitoring - K8s cluster monitoring
- Infrastructure Security - Securing monitoring infrastructure
This guide provides comprehensive coverage of infrastructure monitoring from basic system metrics to enterprise-scale observability solutions. Choose the tools and approaches that best fit your infrastructure requirements and operational complexity.