Kubernetes Monitoring and Logging
Effective monitoring and logging are essential for maintaining healthy Kubernetes clusters and applications. This guide covers comprehensive observability strategies, tools, and best practices for Kubernetes environments.
Why Monitor Kubernetes?
Kubernetes environments present unique monitoring challenges:
- Ephemeral nature - Pods can be created and destroyed frequently
- Dynamic scaling - Resources scale up and down automatically
- Distributed architecture - Applications span multiple nodes and namespaces
- Complex networking - Service mesh and network policies add complexity
- Resource optimization - Need to track resource usage and costs
Observability Pillars
1. Metrics
Quantitative measurements of system behavior:
- Infrastructure metrics - CPU, memory, disk, network
- Application metrics - Request rates, latency, error rates
- Business metrics - User registrations, transactions, revenue
2. Logs
Detailed records of events and operations:
- Application logs - Business logic events and errors
- System logs - Kubernetes events and component logs
- Access logs - HTTP requests and responses
3. Traces
Request flow through distributed systems:
- Distributed tracing - Track requests across microservices
- Performance analysis - Identify bottlenecks and latency issues
- Dependency mapping - Understand service relationships
Core Monitoring Stack
Prometheus + Grafana + AlertManager
The de facto standard for Kubernetes monitoring:
# prometheus-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
---
# prometheus-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
containers:
- name: prometheus
image: prom/prometheus:v2.40.0
ports:
- containerPort: 9090
volumeMounts:
- name: config-volume
mountPath: /etc/prometheus
- name: storage-volume
mountPath: /prometheus
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1000m"
volumes:
- name: config-volume
configMap:
name: prometheus-config
- name: storage-volume
persistentVolumeClaim:
claimName: prometheus-pvc
---
# prometheus-service.yaml
apiVersion: v1
kind: Service
metadata:
name: prometheus-service
namespace: monitoring
spec:
selector:
app: prometheus
ports:
- port: 9090
targetPort: 9090
type: ClusterIP
---
# prometheus-serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitoring
---
# prometheus-clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups:
- extensions
resources:
- ingresses
verbs: ["get", "list", "watch"]
---
# prometheus-clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring
Prometheus Configuration
# prometheus-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "kubernetes.rules"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- job_name: 'kubernetes-cadvisor'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
kubernetes.rules: |
groups:
- name: kubernetes.rules
rules:
- alert: KubernetesNodeReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 10m
labels:
severity: critical
annotations:
summary: Kubernetes Node ready (instance {{ $labels.instance }})
description: "Node {{ $labels.node }} has been unready for a long time"
- alert: KubernetesPodCrashLooping
expr: increase(kube_pod_container_status_restarts_total[1h]) > 3
for: 2m
labels:
severity: warning
annotations:
summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
description: "Pod {{ $labels.pod }} is crash looping"
Grafana Deployment
# grafana-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:9.3.0
ports:
- containerPort: 3000
env:
- name: GF_SECURITY_ADMIN_PASSWORD
valueFrom:
secretKeyRef:
name: grafana-secret
key: admin-password
volumeMounts:
- name: grafana-storage
mountPath: /var/lib/grafana
- name: grafana-config
mountPath: /etc/grafana/provisioning
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
volumes:
- name: grafana-storage
persistentVolumeClaim:
claimName: grafana-pvc
- name: grafana-config
configMap:
name: grafana-config
---
# grafana-service.yaml
apiVersion: v1
kind: Service
metadata:
name: grafana-service
namespace: monitoring
spec:
selector:
app: grafana
ports:
- port: 3000
targetPort: 3000
type: LoadBalancer
---
# grafana-secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: grafana-secret
namespace: monitoring
type: Opaque
data:
admin-password: YWRtaW4xMjM= # admin123 base64 encoded
Logging Architecture
Centralized Logging with ELK Stack
Elasticsearch Deployment
# elasticsearch-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: elasticsearch
namespace: logging
spec:
replicas: 1
selector:
matchLabels:
app: elasticsearch
template:
metadata:
labels:
app: elasticsearch
spec:
containers:
- name: elasticsearch
image: docker.elastic.co/elasticsearch/elasticsearch:8.5.0
ports:
- containerPort: 9200
env:
- name: discovery.type
value: single-node
- name: ES_JAVA_OPTS
value: "-Xms1g -Xmx1g"
- name: xpack.security.enabled
value: "false"
volumeMounts:
- name: elasticsearch-storage
mountPath: /usr/share/elasticsearch/data
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
volumes:
- name: elasticsearch-storage
persistentVolumeClaim:
claimName: elasticsearch-pvc
---
# elasticsearch-service.yaml
apiVersion: v1
kind: Service
metadata:
name: elasticsearch-service
namespace: logging
spec:
selector:
app: elasticsearch
ports:
- port: 9200
targetPort: 9200
Fluentd DaemonSet
# fluentd-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
namespace: logging
spec:
selector:
matchLabels:
app: fluentd
template:
metadata:
labels:
app: fluentd
spec:
serviceAccountName: fluentd
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
env:
- name: FLUENT_ELASTICSEARCH_HOST
value: "elasticsearch-service"
- name: FLUENT_ELASTICSEARCH_PORT
value: "9200"
- name: FLUENT_ELASTICSEARCH_SCHEME
value: "http"
- name: FLUENTD_SYSTEMD_CONF
value: disable
resources:
requests:
memory: "200Mi"
cpu: "100m"
limits:
memory: "400Mi"
cpu: "200m"
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: fluentd-config
mountPath: /fluentd/etc/conf.d
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: fluentd-config
configMap:
name: fluentd-config
---
# fluentd-serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: fluentd
namespace: logging
---
# fluentd-clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: fluentd
rules:
- apiGroups:
- ""
resources:
- pods
- namespaces
verbs:
- get
- list
- watch
---
# fluentd-clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: fluentd
roleRef:
kind: ClusterRole
name: fluentd
apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
name: fluentd
namespace: logging
Kibana Deployment
# kibana-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: kibana
namespace: logging
spec:
replicas: 1
selector:
matchLabels:
app: kibana
template:
metadata:
labels:
app: kibana
spec:
containers:
- name: kibana
image: docker.elastic.co/kibana/kibana:8.5.0
ports:
- containerPort: 5601
env:
- name: ELASTICSEARCH_HOSTS
value: "http://elasticsearch-service:9200"
- name: xpack.security.enabled
value: "false"
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1000m"
---
# kibana-service.yaml
apiVersion: v1
kind: Service
metadata:
name: kibana-service
namespace: logging
spec:
selector:
app: kibana
ports:
- port: 5601
targetPort: 5601
type: LoadBalancer
Application Metrics
Instrumenting Applications
Go Application Example
// main.go
package main
import (
"log"
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status_code"},
)
httpRequestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "Duration of HTTP requests",
},
[]string{"method", "endpoint"},
)
)
func init() {
prometheus.MustRegister(httpRequestsTotal)
prometheus.MustRegister(httpRequestDuration)
}
func instrumentedHandler(next http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
next.ServeHTTP(w, r)
duration := time.Since(start)
httpRequestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration.Seconds())
httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
}
}
func healthHandler(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
w.Write([]byte("OK"))
}
func main() {
http.Handle("/metrics", promhttp.Handler())
http.HandleFunc("/health", instrumentedHandler(healthHandler))
log.Println("Server starting on :8080")
log.Fatal(http.ListenAndServe(":8080", nil))
}
Deployment with Metrics Annotations
# app-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: sample-app
namespace: default
spec:
replicas: 3
selector:
matchLabels:
app: sample-app
template:
metadata:
labels:
app: sample-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: sample-app
image: sample-app:latest
ports:
- containerPort: 8080
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "200m"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Kubernetes Event Monitoring
Event Exporter
# event-exporter-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: event-exporter
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: event-exporter
template:
metadata:
labels:
app: event-exporter
spec:
serviceAccountName: event-exporter
containers:
- name: event-exporter
image: opsgenie/kubernetes-event-exporter:0.11
args:
- -conf=/data/config.yaml
volumeMounts:
- name: config-volume
mountPath: /data
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "200m"
volumes:
- name: config-volume
configMap:
name: event-exporter-config
---
# event-exporter-serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: event-exporter
namespace: monitoring
---
# event-exporter-clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: event-exporter
rules:
- apiGroups: [""]
resources: ["events"]
verbs: ["get", "watch", "list"]
---
# event-exporter-clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: event-exporter
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: event-exporter
subjects:
- kind: ServiceAccount
name: event-exporter
namespace: monitoring
Alerting Rules
Comprehensive Alert Rules
# alert-rules-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-alert-rules
namespace: monitoring
data:
alert.rules: |
groups:
- name: kubernetes-apps
rules:
- alert: KubernetesDeploymentReplicasMismatch
expr: kube_deployment_spec_replicas != kube_deployment_status_replicas_available
for: 10m
labels:
severity: warning
annotations:
summary: Kubernetes deployment replicas mismatch (instance {{ $labels.instance }})
description: "Deployment Replicas mismatch\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesStatefulsetReplicasMismatch
expr: kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas
for: 10m
labels:
severity: warning
annotations:
summary: Kubernetes StatefulSet replicas mismatch (instance {{ $labels.instance }})
description: "A StatefulSet does not match the expected number of replicas."
- alert: KubernetesPodCrashLooping
expr: increase(kube_pod_container_status_restarts_total[1h]) > 3
for: 2m
labels:
severity: warning
annotations:
summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
description: "Pod {{ $labels.pod }} is crash looping"
- alert: KubernetesPodNotReady
expr: kube_pod_status_phase{phase=~"Pending|Unknown"} > 0
for: 15m
labels:
severity: warning
annotations:
summary: Kubernetes pod not ready (instance {{ $labels.instance }})
description: "Pod {{ $labels.pod }} has been in a non-ready state for longer than 15 minutes."
- name: kubernetes-resources
rules:
- alert: KubernetesNodeMemoryPressure
expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
for: 2m
labels:
severity: critical
annotations:
summary: Kubernetes Node memory pressure (instance {{ $labels.instance }})
description: "Node {{ $labels.node }} has MemoryPressure condition"
- alert: KubernetesNodeDiskPressure
expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
for: 2m
labels:
severity: critical
annotations:
summary: Kubernetes Node disk pressure (instance {{ $labels.instance }})
description: "Node {{ $labels.node }} has DiskPressure condition"
- alert: KubernetesNodeNetworkUnavailable
expr: kube_node_status_condition{condition="NetworkUnavailable",status="true"} == 1
for: 2m
labels:
severity: critical
annotations:
summary: Kubernetes Node network unavailable (instance {{ $labels.instance }})
description: "Node {{ $labels.node }} has NetworkUnavailable condition"
- alert: KubernetesContainerOomKilled
expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1
for: 0m
labels:
severity: warning
annotations:
summary: Kubernetes container oom killed (instance {{ $labels.instance }})
description: "Container {{ $labels.container }} in pod {{ $labels.pod }} has been OOMKilled"
- name: kubernetes-system
rules:
- alert: KubernetesApiServerDown
expr: absent(up{job=~"kubernetes-apiservers"})
for: 0m
labels:
severity: critical
annotations:
summary: Kubernetes API server down (instance {{ $labels.instance }})
description: "KubernetesApiServerDown"
- alert: KubernetesEtcdDown
expr: absent(up{job=~"kubernetes-etcd"})
for: 0m
labels:
severity: critical
annotations:
summary: Kubernetes Etcd down (instance {{ $labels.instance }})
description: "KubernetesEtcdDown"
- alert: KubernetesNodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 10m
labels:
severity: critical
annotations:
summary: Kubernetes Node ready (instance {{ $labels.instance }})
description: "Node {{ $labels.node }} has been unready for a long time"
Grafana Dashboards
Kubernetes Cluster Dashboard
{
"dashboard": {
"id": null,
"title": "Kubernetes Cluster Monitoring",
"tags": ["kubernetes"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "Cluster CPU Usage",
"type": "stat",
"targets": [
{
"expr": "100 - (avg(irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "CPU Usage %"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent"
}
}
},
{
"id": 2,
"title": "Cluster Memory Usage",
"type": "stat",
"targets": [
{
"expr": "100 * (1 - ((avg_over_time(node_memory_MemFree_bytes[10m]) + avg_over_time(node_memory_Cached_bytes[10m]) + avg_over_time(node_memory_Buffers_bytes[10m])) / avg_over_time(node_memory_MemTotal_bytes[10m])))",
"legendFormat": "Memory Usage %"
}
]
},
{
"id": 3,
"title": "Pod Status",
"type": "piechart",
"targets": [
{
"expr": "sum(kube_pod_status_phase{phase=\"Running\"}) by (phase)",
"legendFormat": "{{phase}}"
},
{
"expr": "sum(kube_pod_status_phase{phase=\"Pending\"}) by (phase)",
"legendFormat": "{{phase}}"
},
{
"expr": "sum(kube_pod_status_phase{phase=\"Failed\"}) by (phase)",
"legendFormat": "{{phase}}"
}
]
}
],
"time": {
"from": "now-1h",
"to": "now"
},
"refresh": "30s"
}
}
Best Practices
Security Considerations
# security-context.yaml
apiVersion: v1
kind: Pod
metadata:
name: secure-monitoring-pod
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
containers:
- name: prometheus
image: prom/prometheus:latest
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
volumeMounts:
- name: tmp
mountPath: /tmp
- name: prometheus-data
mountPath: /prometheus
volumes:
- name: tmp
emptyDir: {}
- name: prometheus-data
persistentVolumeClaim:
claimName: prometheus-pvc
Resource Management
# resource-limits.yaml
apiVersion: v1
kind: LimitRange
metadata:
name: monitoring-limits
namespace: monitoring
spec:
limits:
- default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
type: Container
Data Retention and Storage
# storage-class.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-ssd
replication-type: none
allowVolumeExpansion: true
---
# prometheus-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus-pvc
namespace: monitoring
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
Troubleshooting
Common Issues
High Cardinality Metrics
# Check metric cardinality
curl -s http://prometheus:9090/api/v1/label/__name__/values | jq '.data[]' | wc -l
# Find high cardinality metrics
curl -s 'http://prometheus:9090/api/v1/query?query=topk(10, count by (__name__)({__name__=~".+"}))' | jq '.data.result'
Memory Issues
# Check Prometheus memory usage
kubectl top pod prometheus-xxx -n monitoring
# Reduce retention period
--storage.tsdb.retention.time=15d
# Limit memory
--storage.tsdb.retention.size=50GB
Monitoring Commands
# Check monitoring stack
kubectl get all -n monitoring
# View Prometheus targets
kubectl port-forward svc/prometheus-service 9090:9090 -n monitoring
# Open http://localhost:9090/targets
# Check Grafana dashboards
kubectl port-forward svc/grafana-service 3000:3000 -n monitoring
# Open http://localhost:3000
# View logs
kubectl logs -f deployment/prometheus -n monitoring
kubectl logs -f deployment/grafana -n monitoring
# Check events
kubectl get events -n monitoring --sort-by='.lastTimestamp'
Advanced Topics
Service Mesh Monitoring
- Istio metrics - Service mesh observability
- Envoy proxy metrics - Sidecar proxy monitoring
- Distributed tracing - Request flow tracking
Cost Monitoring
- Resource utilization - Cost per pod/namespace
- Waste detection - Unused resources
- Optimization recommendations - Right-sizing suggestions
AI/ML Monitoring
- Anomaly detection - Machine learning for alerting
- Predictive scaling - Forecast-based scaling
- Performance optimization - AI-driven recommendations
This comprehensive guide provides everything needed to implement effective monitoring and logging in Kubernetes environments.