Infrastructure Monitoring

Infrastructure monitoring is essential for maintaining system reliability, performance, and availability. This guide covers monitoring solutions from basic system metrics to advanced observability platforms.

Overview

Effective infrastructure monitoring provides:

Real-time visibility into system health and performance
Proactive alerting for potential issues before they become problems
Historical data for capacity planning and trend analysis
Root cause analysis capabilities for faster incident resolution
Compliance reporting for regulatory requirements

Monitoring Stack Components

Metrics Collection

System Metrics: CPU, memory, disk, network utilization
Application Metrics: Response times, error rates, throughput
Custom Metrics: Business-specific measurements
Infrastructure Metrics: Database, web server, load balancer performance

Time Series Databases

Prometheus - Open-source monitoring and alerting toolkit
InfluxDB - Purpose-built time series database
Grafana Cloud - Managed observability platform
Azure Monitor - Cloud-native monitoring solution

Visualization and Dashboards

Grafana - Feature-rich visualization and analytics platform
Kibana - Data visualization for Elasticsearch
Azure Monitor Workbooks - Interactive reports and dashboards
Custom Dashboards - Purpose-built monitoring interfaces

Prometheus Monitoring

Prometheus Setup

Prometheus is the de facto standard for metrics collection in modern infrastructure:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'docker'
    static_configs:
      - targets: ['localhost:9323']

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093

Node Exporter for System Metrics

# Install and run node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
cd node_exporter-1.6.1.linux-amd64
./node_exporter

Docker Container Monitoring

# docker-compose.yml for monitoring stack
version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'

volumes:
  prometheus_data:
  grafana_data:

Grafana Dashboards

System Overview Dashboard

Create comprehensive system dashboards for infrastructure monitoring:

{
  "dashboard": {
    "id": null,
    "title": "Infrastructure Overview",
    "tags": ["infrastructure", "monitoring"],
    "panels": [
      {
        "title": "CPU Usage",
        "type": "stat",
        "targets": [
          {
            "expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "refId": "A"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "type": "stat",
        "targets": [
          {
            "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
            "refId": "A"
          }
        ]
      }
    ]
  }
}

Container Monitoring Dashboard

Monitor Docker containers and Kubernetes pods:

# Container metrics configuration
- name: container_cpu_usage
  query: rate(container_cpu_usage_seconds_total[5m])
  
- name: container_memory_usage
  query: container_memory_working_set_bytes / container_spec_memory_limit_bytes

- name: container_network_io
  query: rate(container_network_receive_bytes_total[5m])

Alerting Configuration

Prometheus Alerting Rules

# alert_rules.yml
groups:
- name: infrastructure_alerts
  rules:
  - alert: HighCPUUsage
    expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage detected"
      description: "CPU usage is above 80% for more than 5 minutes"

  - alert: HighMemoryUsage
    expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High memory usage detected"
      description: "Memory usage is above 85% for more than 5 minutes"

  - alert: DiskSpaceLow
    expr: (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 90
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Disk space is running low"
      description: "Disk usage is above 90% on {{ $labels.device }}"

  - alert: ServiceDown
    expr: up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Service is down"
      description: "{{ $labels.job }} service is down"

Alertmanager Configuration

# alertmanager.yml
global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alerts@example.com'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

receivers:
- name: 'web.hook'
  email_configs:
  - to: 'admin@example.com'
    subject: 'Alert: {{ .GroupLabels.alertname }}'
    body: |
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      {{ end }}

  slack_configs:
  - api_url: 'YOUR_SLACK_WEBHOOK_URL'
    channel: '#alerts'
    title: 'Infrastructure Alert'
    text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

Cloud Monitoring Solutions

Azure Monitor

# Install Azure Monitor agent
wget https://aka.ms/azcmagent
sudo chmod +x azcmagent
sudo ./azcmagent connect --resource-group "rg-monitoring" --tenant-id "your-tenant-id"

AWS CloudWatch

# CloudWatch configuration
Resources:
  LogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: /aws/infrastructure/monitoring
      RetentionInDays: 30

  MetricFilter:
    Type: AWS::Logs::MetricFilter
    Properties:
      LogGroupName: !Ref LogGroup
      FilterPattern: "[timestamp, request_id, level=ERROR]"
      MetricTransformations:
        - MetricNamespace: "Custom/Application"
          MetricName: "ErrorCount"
          MetricValue: "1"

Log Management

Centralized Logging with ELK Stack

# docker-compose-elk.yml
version: '3.8'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.10.2
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
    ports:
      - "9200:9200"

  logstash:
    image: docker.elastic.co/logstash/logstash:8.10.2
    volumes:
      - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
    ports:
      - "5044:5044"

  kibana:
    image: docker.elastic.co/kibana/kibana:8.10.2
    ports:
      - "5601:5601"
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200

Fluentd for Log Collection

# fluentd.conf
<source>
  @type tail
  path /var/log/containers/*.log
  pos_file /var/log/fluentd-containers.log.pos
  tag kubernetes.*
  format json
</source>

<match kubernetes.**>
  @type elasticsearch
  host elasticsearch
  port 9200
  index_name kubernetes
  type_name _doc
</match>

Performance Monitoring

Application Performance Monitoring (APM)

Monitor application performance and user experience:

# APM configuration
apm:
  enabled: true
  service_name: "my-application"
  environment: "production"
  
  instrumentation:
    - http_requests
    - database_queries
    - cache_operations
    - external_services

  sampling:
    rate: 0.1  # 10% sampling rate
    
  alerts:
    response_time_threshold: 500ms
    error_rate_threshold: 5%

Database Monitoring

-- Database performance queries
SELECT 
    query,
    mean_time,
    calls,
    total_time
FROM pg_stat_statements 
ORDER BY total_time DESC 
LIMIT 10;

Container and Kubernetes Monitoring

Kubernetes Monitoring Stack

# monitoring-namespace.yml
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:
      - name: prometheus
        image: prom/prometheus:latest
        ports:
        - containerPort: 9090
        volumeMounts:
        - name: config
          mountPath: /etc/prometheus
      volumes:
      - name: config
        configMap:
          name: prometheus-config

Best Practices

Monitoring Strategy

Start with the basics: CPU, memory, disk, network
Monitor what matters: Focus on business-critical metrics
Set meaningful alerts: Avoid alert fatigue with proper thresholds
Document your monitoring: Maintain runbooks for common alerts
Regular review: Continuously refine and improve monitoring

Alert Design

Clear alert names: Use descriptive alert names and summaries
Actionable alerts: Every alert should have a clear resolution path
Severity levels: Use appropriate severity levels (info, warning, critical)
Alert grouping: Group related alerts to reduce noise
Escalation policies: Define clear escalation procedures

Dashboard Design

User-focused: Design dashboards for specific audiences
Key metrics first: Most important metrics should be prominently displayed
Consistent layout: Use consistent colors, fonts, and layouts
Drill-down capability: Enable users to explore details
Regular updates: Keep dashboards current and relevant

Troubleshooting

Common Monitoring Issues

High cardinality metrics

# Check metric cardinality
curl http://localhost:9090/api/v1/label/__name__/values | jq '.data | length'

Missing metrics

# Verify scrape targets
curl http://localhost:9090/api/v1/targets

Alert fatigue

# Review alert frequency
- alert: HighAlertFrequency
  expr: increase(prometheus_notifications_total[1h]) > 10

Performance Optimization

# Prometheus optimization
global:
  scrape_interval: 30s       # Increase interval for less critical metrics
  evaluation_interval: 30s   # Match scrape interval

storage:
  tsdb:
    retention.time: 30d      # Adjust retention based on needs
    retention.size: 10GB     # Set size limits

Grafana Configuration - Dashboard and visualization setup
Container Monitoring - Docker-specific monitoring
Kubernetes Monitoring - K8s cluster monitoring
Infrastructure Security - Securing monitoring infrastructure

This guide provides comprehensive coverage of infrastructure monitoring from basic system metrics to enterprise-scale observability solutions. Choose the tools and approaches that best fit your infrastructure requirements and operational complexity.

Table of Contents