Grafana Dashboards | Crusader Two-One

This guide covers creating, managing, and optimizing Grafana dashboards for effective monitoring and visualization of metrics from Prometheus and other data sources.

Overview

Grafana dashboards provide a visual interface for monitoring metrics, logs, and traces. Effective dashboards enable quick identification of issues and understanding of system behavior.

Key Concepts:

Panels: Individual visualizations (graphs, tables, gauges)
Rows: Horizontal containers for organizing panels
Variables: Dynamic values for filtering and templating
Time Range: Control the data window being displayed
Annotations: Mark events on time series graphs
Links: Navigate between dashboards

Dashboard Design Principles

The Four Golden Signals

Monitor these key metrics for any system:

Latency: Time to service requests
Traffic: Demand on your system
Errors: Rate of failed requests
Saturation: Resource utilization

Dashboard Best Practices

Organization:

One dashboard per service or component
Group related metrics together
Use consistent naming conventions
Arrange panels logically (top-to-bottom, left-to-right)

Visualization:

Choose appropriate visualization types
Use consistent time ranges
Set meaningful Y-axis ranges
Add units to metrics
Use color coding consistently

Performance:

Limit panels to 15-20 per dashboard
Use recording rules for expensive queries
Set appropriate refresh intervals
Use template variables to reduce query count

Creating Dashboards

Manual Creation

Create New Dashboard

1. Click "+" icon → Dashboard
2. Click "Add new panel"
3. Select data source (Prometheus)
4. Write query
5. Choose visualization
6. Configure panel options
7. Click "Apply"

Dashboard Settings

{
  "title": "Service Overview",
  "tags": ["production", "service"],
  "timezone": "browser",
  "refresh": "30s",
  "time": {
    "from": "now-6h",
    "to": "now"
  }
}

Dashboard as Code

Create dashboard JSON for version control:

{
  "dashboard": {
    "title": "Node Exporter System Metrics",
    "uid": "node-exporter-system",
    "tags": ["infrastructure", "linux"],
    "timezone": "browser",
    "schemaVersion": 38,
    "refresh": "30s",
    "time": {
      "from": "now-6h",
      "to": "now"
    },
    "panels": [
      {
        "id": 1,
        "title": "CPU Usage",
        "type": "graph",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
        "datasource": {
          "type": "prometheus",
          "uid": "prometheus-uid"
        },
        "targets": [
          {
            "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "{{instance}}",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100,
            "color": {"mode": "palette-classic"}
          }
        }
      }
    ]
  }
}

Provisioning Dashboards

Create provisioning configuration:

# /etc/grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1

providers:
  - name: 'Infrastructure'
    orgId: 1
    folder: 'Infrastructure'
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    allowUiUpdates: true
    options:
      path: /var/lib/grafana/dashboards/infrastructure

Place dashboard JSON files in the specified path:

# Store dashboards in version control
mkdir -p /var/lib/grafana/dashboards/infrastructure
cp node-exporter.json /var/lib/grafana/dashboards/infrastructure/
cp postgres-metrics.json /var/lib/grafana/dashboards/infrastructure/
chown -R grafana:grafana /var/lib/grafana/dashboards

Popular Pre-Built Dashboards

Node Exporter Full (ID: 1860)

Comprehensive Linux system metrics:

Import from grafana.com/dashboards/1860

Metrics:
- CPU utilization per core
- Memory usage and swap
- Disk I/O and utilization
- Network traffic
- System load
- Filesystem usage

Customization Example:

# Modify CPU query to exclude idle
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle",instance="$instance"}[5m])) * 100)

# Add temperature monitoring
node_hwmon_temp_celsius{instance="$instance"}

# Add custom filesystem filtering
node_filesystem_avail_bytes{instance="$instance",fstype!~"tmpfs|fuse.lxcfs"}

Docker Container & Host Metrics (ID: 10619)

Container resource monitoring:

Import from grafana.com/dashboards/10619

Metrics:
- Container CPU usage
- Container memory usage
- Container network I/O
- Container filesystem I/O
- Host metrics

Kubernetes Cluster Monitoring (ID: 7249)

Kubernetes cluster overview:

Import from grafana.com/dashboards/7249

Metrics:
- Cluster CPU/Memory usage
- Pod status and restarts
- Node status
- Namespace resource usage
- Persistent volume usage

PostgreSQL Database (ID: 9628)

Database performance metrics:

Import from grafana.com/dashboards/9628

Metrics:
- Connections and sessions
- Transaction rates
- Query performance
- Cache hit ratio
- Disk I/O

PromQL Query Examples

System Metrics

CPU Usage by Core:

# Per-core CPU usage
100 - (avg by(instance, cpu) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Average CPU usage
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# CPU usage by mode
sum by(mode) (irate(node_cpu_seconds_total[5m])) * 100

Memory Usage:

# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Memory used in GB
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / 1024 / 1024 / 1024

# Swap usage
(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes * 100

Disk Usage:

# Disk usage percentage
100 - ((node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"} * 100) / node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs"})

# Disk I/O operations per second
rate(node_disk_io_time_seconds_total[5m])

# Disk read/write bytes per second
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])

Network Traffic:

# Network receive rate (MB/s)
rate(node_network_receive_bytes_total[5m]) / 1024 / 1024

# Network transmit rate (MB/s)
rate(node_network_transmit_bytes_total[5m]) / 1024 / 1024

# Network errors
rate(node_network_receive_errs_total[5m])
rate(node_network_transmit_errs_total[5m])

Container Metrics

Container CPU:

# Container CPU usage percentage
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (container, pod) * 100

# Container CPU throttling
rate(container_cpu_cfs_throttled_seconds_total[5m])

Container Memory:

# Container memory usage
container_memory_usage_bytes{container!=""}

# Container memory percentage
(container_memory_usage_bytes{container!=""} / container_spec_memory_limit_bytes{container!=""}) * 100

# Container memory cache
container_memory_cache{container!=""}

Container Network:

# Container network receive
rate(container_network_receive_bytes_total[5m])

# Container network transmit
rate(container_network_transmit_bytes_total[5m])

Application Metrics

HTTP Request Rate:

# Requests per second
sum(rate(http_requests_total[5m])) by (method, status)

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# Success rate percentage
(sum(rate(http_requests_total{status=~"2.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100

Request Latency:

# 95th percentile latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# 99th percentile latency
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Average latency
sum(rate(http_request_duration_seconds_sum[5m])) / sum(rate(http_request_duration_seconds_count[5m]))

Database Query Performance:

# Query rate
rate(pg_stat_database_xact_commit[5m])

# Active connections
pg_stat_database_numbackends

# Cache hit ratio
sum(rate(pg_stat_database_blks_hit[5m])) / (sum(rate(pg_stat_database_blks_hit[5m])) + sum(rate(pg_stat_database_blks_read[5m]))) * 100

Template Variables

Variable Types

Query Variable:

{
  "name": "instance",
  "type": "query",
  "datasource": "Prometheus",
  "query": "label_values(up, instance)",
  "refresh": 1,
  "multi": true,
  "includeAll": true
}

Custom Variable:

{
  "name": "environment",
  "type": "custom",
  "options": [
    {"text": "Production", "value": "prod"},
    {"text": "Staging", "value": "staging"},
    {"text": "Development", "value": "dev"}
  ]
}

Interval Variable:

{
  "name": "interval",
  "type": "interval",
  "options": ["1m", "5m", "15m", "30m", "1h"],
  "auto": true,
  "auto_min": "10s"
}

Using Variables in Queries

# Filter by instance variable
node_cpu_seconds_total{instance=~"$instance"}

# Filter by multiple variables
http_requests_total{instance=~"$instance",environment="$environment"}

# Use interval variable for rate
rate(http_requests_total[$interval])

# Use regex for filtering
node_filesystem_avail_bytes{mountpoint=~"$mountpoint",fstype!~"tmpfs|fuse.*"}

Chained Variables

[
  {
    "name": "datacenter",
    "query": "label_values(datacenter)"
  },
  {
    "name": "cluster",
    "query": "label_values(up{datacenter=\"$datacenter\"}, cluster)"
  },
  {
    "name": "instance",
    "query": "label_values(up{datacenter=\"$datacenter\",cluster=\"$cluster\"}, instance)"
  }
]

Panel Configuration

Time Series Graph

Configuration:

{
  "type": "timeseries",
  "title": "CPU Usage",
  "targets": [
    {
      "expr": "100 - (avg(irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
      "legendFormat": "CPU Usage"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "percent",
      "min": 0,
      "max": 100,
      "color": {
        "mode": "thresholds"
      },
      "thresholds": {
        "steps": [
          {"value": 0, "color": "green"},
          {"value": 70, "color": "yellow"},
          {"value": 90, "color": "red"}
        ]
      }
    }
  },
  "options": {
    "legend": {"displayMode": "list", "placement": "bottom"},
    "tooltip": {"mode": "multi"}
  }
}

Stat Panel

Single Value Display:

{
  "type": "stat",
  "title": "Total Requests",
  "targets": [
    {
      "expr": "sum(http_requests_total)"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "short",
      "color": {"mode": "thresholds"},
      "mappings": [],
      "thresholds": {
        "steps": [
          {"value": 0, "color": "blue"}
        ]
      }
    }
  },
  "options": {
    "graphMode": "area",
    "colorMode": "value",
    "textMode": "auto"
  }
}

Gauge Panel

Progress Indicator:

{
  "type": "gauge",
  "title": "Disk Usage",
  "targets": [
    {
      "expr": "100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "percent",
      "min": 0,
      "max": 100,
      "thresholds": {
        "steps": [
          {"value": 0, "color": "green"},
          {"value": 70, "color": "yellow"},
          {"value": 85, "color": "red"}
        ]
      }
    }
  },
  "options": {
    "showThresholdLabels": true,
    "showThresholdMarkers": true
  }
}

Table Panel

Tabular Data:

{
  "type": "table",
  "title": "Instance Status",
  "targets": [
    {
      "expr": "up",
      "format": "table",
      "instant": true
    }
  ],
  "fieldConfig": {
    "overrides": [
      {
        "matcher": {"id": "byName", "options": "Value"},
        "properties": [
          {
            "id": "custom.displayMode",
            "value": "color-background"
          },
          {
            "id": "mappings",
            "value": [
              {"type": "value", "value": "0", "text": "Down", "color": "red"},
              {"type": "value", "value": "1", "text": "Up", "color": "green"}
            ]
          }
        ]
      }
    ]
  },
  "transformations": [
    {
      "id": "organize",
      "options": {
        "excludeByName": {"Time": true},
        "renameByName": {"instance": "Instance", "Value": "Status"}
      }
    }
  ]
}

Heatmap Panel

Distribution Visualization:

{
  "type": "heatmap",
  "title": "Request Latency Distribution",
  "targets": [
    {
      "expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)",
      "format": "heatmap",
      "legendFormat": "{{le}}"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "s"
    }
  },
  "options": {
    "calculate": true,
    "cellGap": 2,
    "color": {
      "mode": "scheme",
      "scheme": "Spectral"
    }
  }
}

Dashboard Organization

Folder Structure

Infrastructure/
├── System Metrics/
│   ├── Node Exporter Overview
│   ├── CPU Analysis
│   ├── Memory Analysis
│   └── Disk Performance
├── Container Metrics/
│   ├── Docker Overview
│   ├── Kubernetes Cluster
│   └── Pod Performance
└── Network/
    ├── Network Traffic
    ├── UniFi Devices
    └── Blackbox Probes

Applications/
├── Backend Services/
│   ├── API Gateway
│   ├── Authentication Service
│   └── Database Connections
├── Frontend/
│   ├── Web Application
│   └── Mobile API
└── Batch Jobs/
    └── Job Monitoring

Business Metrics/
├── User Analytics
├── Revenue Metrics
└── SLA Compliance

Row Organization

{
  "panels": [
    {
      "type": "row",
      "title": "System Overview",
      "collapsed": false,
      "panels": [
        /* CPU, Memory, Disk panels */
      ]
    },
    {
      "type": "row",
      "title": "Network Performance",
      "collapsed": true,
      "panels": [
        /* Network panels */
      ]
    },
    {
      "type": "row",
      "title": "Application Metrics",
      "collapsed": true,
      "panels": [
        /* Application panels */
      ]
    }
  ]
}

Advanced Features

Annotations

Query-Based Annotations:

{
  "annotations": {
    "list": [
      {
        "datasource": "Prometheus",
        "name": "Deployments",
        "expr": "changes(process_start_time_seconds[1m]) > 0",
        "step": "60s",
        "tagKeys": "instance,version",
        "titleFormat": "Deployment",
        "textFormat": "{{instance}}"
      },
      {
        "datasource": "Prometheus",
        "name": "Alerts",
        "expr": "ALERTS{alertstate=\"firing\"}",
        "tagKeys": "alertname,severity",
        "titleFormat": "{{alertname}}",
        "textFormat": "{{severity}}: {{alertname}}"
      }
    ]
  }
}

Dashboard Links

Navigation Links:

{
  "links": [
    {
      "title": "System Dashboards",
      "type": "dashboards",
      "tags": ["system"],
      "icon": "external link"
    },
    {
      "title": "Related Dashboard",
      "type": "link",
      "url": "/d/xyz/other-dashboard",
      "targetBlank": false
    },
    {
      "title": "Prometheus",
      "type": "link",
      "url": "http://prometheus:9090",
      "targetBlank": true
    }
  ]
}

Transformations

Data Transformations:

{
  "transformations": [
    {
      "id": "merge",
      "options": {}
    },
    {
      "id": "organize",
      "options": {
        "excludeByName": {"Time": true},
        "indexByName": {"instance": 0, "Value": 1},
        "renameByName": {"Value": "CPU Usage"}
      }
    },
    {
      "id": "calculateField",
      "options": {
        "alias": "Usage %",
        "binary": {
          "left": "Value",
          "operator": "*",
          "right": "100"
        },
        "mode": "binary"
      }
    }
  ]
}

Performance Optimization

Query Optimization

Use Recording Rules:

# Prometheus recording rules
groups:
  - name: dashboard_rules
    interval: 30s
    rules:
      - record: instance:node_cpu_utilization:rate5m
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Use in dashboard:

# Instead of complex query
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Use recording rule
instance:node_cpu_utilization:rate5m

Limit Time Range:

{
  "targets": [
    {
      "expr": "rate(http_requests_total[5m])",
      "maxDataPoints": 1000
    }
  ]
}

Use Appropriate Step:

# Automatic step based on dashboard time range
rate(metric[5m])

# Manual step for specific resolution
rate(metric[5m])[$__interval]

Dashboard Settings

Optimize Refresh Rate:

{
  "refresh": "30s",  // Not "5s" for production
  "time": {
    "from": "now-6h",
    "to": "now"
  }
}

Limit Panel Count:

Maximum 15-20 panels per dashboard
Use rows to collapse less important metrics
Create separate dashboards for detailed views

Cache Settings:

{
  "targets": [
    {
      "expr": "up",
      "interval": "30s",
      "intervalFactor": 2
    }
  ]
}

Dashboard Export and Import

Export Dashboard

# Export via API
curl -H "Authorization: Bearer ${API_KEY}" \
  "http://grafana:3000/api/dashboards/uid/${DASHBOARD_UID}" \
  | jq '.dashboard' > dashboard.json

# Export via UI
# Dashboard → Settings → JSON Model → Copy to clipboard

Import Dashboard

# Import via API
curl -X POST \
  -H "Authorization: Bearer ${API_KEY}" \
  -H "Content-Type: application/json" \
  -d @dashboard.json \
  "http://grafana:3000/api/dashboards/db"

# Import via UI
# + → Import → Upload JSON file

Version Control

# Store dashboards in Git
mkdir -p dashboards/{infrastructure,applications,business}

# Export all dashboards
./scripts/export-dashboards.sh

# Commit to version control
git add dashboards/
git commit -m "Update dashboards"
git push

# Restore from version control
git pull
./scripts/import-dashboards.sh

Troubleshooting

Common Issues

No Data in Panels:

# Check data source connectivity
curl "http://grafana:3000/api/datasources/proxy/1/api/v1/query?query=up"

# Verify Prometheus has data
curl "http://prometheus:9090/api/v1/query?query=up"

# Check query syntax
# Use Prometheus UI to test queries first

Slow Dashboard Loading:

# Check query performance
# Use Prometheus → Status → Query Log

# Reduce time range
# Use smaller intervals

# Optimize queries with recording rules

Variables Not Loading:

# Check variable query syntax
label_values(metric_name, label_name)

# Verify data source is selected
# Check variable refresh settings

Best Practices Checklist

✅ Use meaningful dashboard titles and descriptions
✅ Add tags for organization
✅ Use template variables for flexibility
✅ Set appropriate time ranges and refresh intervals
✅ Add units to all metrics
✅ Use thresholds and color coding
✅ Group related panels in rows
✅ Add annotations for deployments and incidents
✅ Use recording rules for expensive queries
✅ Version control dashboard JSON
✅ Document custom queries and transformations
✅ Test dashboards before deploying
✅ Set up alerts for critical metrics
✅ Regularly review and optimize

Table of Contents