Table of Contents

DevOps represents a set of practices that combines software development (Dev) and IT operations (Ops) to shorten the systems development life cycle and provide continuous delivery with high software quality. This guide covers essential DevOps practices and implementation strategies.

DevOps Fundamentals

Core Principles

Culture and Collaboration

  • Shared responsibility for the entire application lifecycle
  • Cross-functional teams working together
  • Breaking down silos between development and operations
  • Continuous learning and improvement mindset

Automation

  • Infrastructure as Code (IaC)
  • Automated testing at all levels
  • Continuous integration and deployment
  • Configuration management

Measurement and Monitoring

  • Application performance monitoring (APM)
  • Infrastructure monitoring
  • Business metrics and KPIs
  • Feedback loops for continuous improvement
┌─────────────────────────────────────────────────────────────────┐
│                    DevOps Lifecycle                             │
├─────────────────────────────────────────────────────────────────┤
│  Plan → Code → Build → Test → Release → Deploy → Operate → Monitor │
│    ↑                                                         ↓    │
│    └─────────────────── Feedback ──────────────────────────┘    │
└─────────────────────────────────────────────────────────────────┘

Continuous Integration/Continuous Deployment (CI/CD)

CI/CD Pipeline Design

Basic Pipeline Structure

# .github/workflows/ci-cd.yml (GitHub Actions)
name: CI/CD Pipeline

on:
  push:
    branches: [ main, develop ]
  pull_request:
    branches: [ main ]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        node-version: [16.x, 18.x, 20.x]
    
    steps:
    - uses: actions/checkout@v3
    
    - name: Setup Node.js
      uses: actions/setup-node@v3
      with:
        node-version: ${{ matrix.node-version }}
        cache: 'npm'
    
    - name: Install dependencies
      run: npm ci
    
    - name: Run linting
      run: npm run lint
    
    - name: Run unit tests
      run: npm test -- --coverage
    
    - name: Run integration tests
      run: npm run test:integration
    
    - name: Upload coverage reports
      uses: codecov/codecov-action@v3

  security:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    
    - name: Run security audit
      run: npm audit --audit-level=high
    
    - name: SAST with Semgrep
      uses: returntocorp/semgrep-action@v1
      with:
        config: auto

  build:
    needs: [test, security]
    runs-on: ubuntu-latest
    outputs:
      image: ${{ steps.image.outputs.image }}
      digest: ${{ steps.build.outputs.digest }}
    
    steps:
    - uses: actions/checkout@v3
    
    - name: Setup Docker Buildx
      uses: docker/setup-buildx-action@v2
    
    - name: Login to Container Registry
      uses: docker/login-action@v2
      with:
        registry: ${{ env.REGISTRY }}
        username: ${{ github.actor }}
        password: ${{ secrets.GITHUB_TOKEN }}
    
    - name: Extract metadata
      id: meta
      uses: docker/metadata-action@v4
      with:
        images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
        tags: |
          type=ref,event=branch
          type=ref,event=pr
          type=sha,prefix={{branch}}-
    
    - name: Build and push
      id: build
      uses: docker/build-push-action@v4
      with:
        context: .
        push: true
        tags: ${{ steps.meta.outputs.tags }}
        labels: ${{ steps.meta.outputs.labels }}
        cache-from: type=gha
        cache-to: type=gha,mode=max

  deploy-staging:
    needs: build
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/develop'
    environment: staging
    
    steps:
    - name: Deploy to staging
      run: |
        echo "Deploying ${{ needs.build.outputs.image }} to staging"
        # Add deployment commands here

  deploy-production:
    needs: build
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    environment: production
    
    steps:
    - name: Deploy to production
      run: |
        echo "Deploying ${{ needs.build.outputs.image }} to production"
        # Add deployment commands here

Advanced Pipeline Features

Multi-Environment Deployment

# Advanced deployment with approvals
deploy:
  strategy:
    matrix:
      environment: [staging, production]
  runs-on: ubuntu-latest
  environment: ${{ matrix.environment }}
  
  steps:
  - name: Configure environment
    run: |
      case "${{ matrix.environment }}" in
        staging)
          echo "CLUSTER_NAME=staging-cluster" >> $GITHUB_ENV
          echo "NAMESPACE=staging" >> $GITHUB_ENV
          ;;
        production)
          echo "CLUSTER_NAME=prod-cluster" >> $GITHUB_ENV
          echo "NAMESPACE=production" >> $GITHUB_ENV
          ;;
      esac
  
  - name: Deploy with Helm
    run: |
      helm upgrade --install myapp ./helm-chart \
        --namespace ${{ env.NAMESPACE }} \
        --set image.tag=${{ github.sha }} \
        --set environment=${{ matrix.environment }} \
        --wait --timeout=300s

Blue-Green Deployment

#!/bin/bash
# Blue-Green deployment script

CLUSTER="production"
APP_NAME="myapp"
NEW_VERSION=$1
CURRENT_SLOT=$(kubectl get service $APP_NAME -o jsonpath='{.spec.selector.slot}')

if [ "$CURRENT_SLOT" = "blue" ]; then
    DEPLOY_SLOT="green"
else
    DEPLOY_SLOT="blue"
fi

echo "Current slot: $CURRENT_SLOT"
echo "Deploying to slot: $DEPLOY_SLOT"

# Deploy to inactive slot
kubectl set image deployment/$APP_NAME-$DEPLOY_SLOT app=$APP_NAME:$NEW_VERSION
kubectl rollout status deployment/$APP_NAME-$DEPLOY_SLOT --timeout=300s

# Health check
kubectl wait --for=condition=ready pod -l app=$APP_NAME,slot=$DEPLOY_SLOT --timeout=300s

# Run smoke tests
curl -f http://$APP_NAME-$DEPLOY_SLOT.staging.svc.cluster.local/health

if [ $? -eq 0 ]; then
    echo "Health check passed. Switching traffic to $DEPLOY_SLOT"
    kubectl patch service $APP_NAME -p '{"spec":{"selector":{"slot":"'$DEPLOY_SLOT'"}}}'
    echo "Traffic switched successfully"
else
    echo "Health check failed. Rolling back"
    exit 1
fi

Infrastructure as Code (IaC)

Terraform Best Practices

Project Structure

terraform/
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── terraform.tfvars
│   │   └── backend.tf
│   ├── staging/
│   └── production/
├── modules/
│   ├── vpc/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── eks/
│   └── rds/
└── shared/
    ├── variables.tf
    └── outputs.tf

Infrastructure Module Example

# modules/vpc/main.tf
resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = merge(var.common_tags, {
    Name = "${var.project_name}-vpc"
  })
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id

  tags = merge(var.common_tags, {
    Name = "${var.project_name}-igw"
  })
}

resource "aws_subnet" "public" {
  count = length(var.public_subnet_cidrs)

  vpc_id                  = aws_vpc.main.id
  cidr_block              = var.public_subnet_cidrs[count.index]
  availability_zone       = data.aws_availability_zones.available.names[count.index]
  map_public_ip_on_launch = true

  tags = merge(var.common_tags, {
    Name = "${var.project_name}-public-${count.index + 1}"
    Type = "public"
  })
}

resource "aws_subnet" "private" {
  count = length(var.private_subnet_cidrs)

  vpc_id            = aws_vpc.main.id
  cidr_block        = var.private_subnet_cidrs[count.index]
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = merge(var.common_tags, {
    Name = "${var.project_name}-private-${count.index + 1}"
    Type = "private"
  })
}

# NAT Gateways for private subnet internet access
resource "aws_eip" "nat" {
  count = length(var.public_subnet_cidrs)

  domain = "vpc"
  depends_on = [aws_internet_gateway.main]

  tags = merge(var.common_tags, {
    Name = "${var.project_name}-nat-eip-${count.index + 1}"
  })
}

resource "aws_nat_gateway" "main" {
  count = length(var.public_subnet_cidrs)

  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id

  tags = merge(var.common_tags, {
    Name = "${var.project_name}-nat-${count.index + 1}"
  })
}

# Route tables
resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }

  tags = merge(var.common_tags, {
    Name = "${var.project_name}-public-rt"
  })
}

resource "aws_route_table" "private" {
  count = length(var.private_subnet_cidrs)

  vpc_id = aws_vpc.main.id

  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.main[count.index].id
  }

  tags = merge(var.common_tags, {
    Name = "${var.project_name}-private-rt-${count.index + 1}"
  })
}

# Route table associations
resource "aws_route_table_association" "public" {
  count = length(var.public_subnet_cidrs)

  subnet_id      = aws_subnet.public[count.index].id
  route_table_id = aws_route_table.public.id
}

resource "aws_route_table_association" "private" {
  count = length(var.private_subnet_cidrs)

  subnet_id      = aws_subnet.private[count.index].id
  route_table_id = aws_route_table.private[count.index].id
}

data "aws_availability_zones" "available" {
  state = "available"
}

Environment Configuration

# environments/production/main.tf
terraform {
  required_version = ">= 1.0"
  
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }

  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "us-west-2"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

provider "aws" {
  region = var.aws_region

  default_tags {
    tags = {
      Environment = "production"
      Project     = var.project_name
      ManagedBy   = "terraform"
    }
  }
}

module "vpc" {
  source = "../../modules/vpc"

  project_name = var.project_name
  vpc_cidr     = "10.0.0.0/16"
  
  public_subnet_cidrs  = ["10.0.1.0/24", "10.0.2.0/24"]
  private_subnet_cidrs = ["10.0.10.0/24", "10.0.20.0/24"]

  common_tags = {
    Environment = "production"
    Project     = var.project_name
  }
}

module "eks" {
  source = "../../modules/eks"

  cluster_name     = "${var.project_name}-prod"
  cluster_version  = "1.27"
  vpc_id           = module.vpc.vpc_id
  subnet_ids       = module.vpc.private_subnet_ids

  node_groups = {
    main = {
      instance_types = ["t3.large"]
      min_size       = 1
      max_size       = 10
      desired_size   = 3
    }
  }
}

Ansible for Configuration Management

Playbook Structure

# site.yml
---
- name: Configure web servers
  hosts: webservers
  become: yes
  vars:
    app_name: mywebapp
    app_port: 3000
    
  roles:
    - common
    - nginx
    - nodejs
    - application

- name: Configure database servers  
  hosts: dbservers
  become: yes
  vars:
    postgres_version: 13
    
  roles:
    - common
    - postgresql
    - monitoring

Application Deployment Role

# roles/application/tasks/main.yml
---
- name: Create application user
  user:
    name: "{{ app_user }}"
    system: yes
    shell: /bin/false
    home: "{{ app_home }}"
    create_home: yes

- name: Create application directories
  file:
    path: "{{ item }}"
    state: directory
    owner: "{{ app_user }}"
    group: "{{ app_user }}"
    mode: '0755'
  loop:
    - "{{ app_home }}"
    - "{{ app_home }}/releases"
    - "{{ app_home }}/shared"
    - "{{ app_home }}/shared/logs"

- name: Download application artifact
  get_url:
    url: "{{ artifact_url }}"
    dest: "{{ app_home }}/releases/{{ app_version }}.tar.gz"
    owner: "{{ app_user }}"
    group: "{{ app_user }}"

- name: Extract application
  unarchive:
    src: "{{ app_home }}/releases/{{ app_version }}.tar.gz"
    dest: "{{ app_home }}/releases/"
    owner: "{{ app_user }}"
    group: "{{ app_user }}"
    remote_src: yes
    creates: "{{ app_home }}/releases/{{ app_version }}"

- name: Install dependencies
  npm:
    path: "{{ app_home }}/releases/{{ app_version }}"
    production: yes
  become_user: "{{ app_user }}"

- name: Create current symlink
  file:
    src: "{{ app_home }}/releases/{{ app_version }}"
    dest: "{{ app_home }}/current"
    state: link
    owner: "{{ app_user }}"
    group: "{{ app_user }}"
  notify: restart application

- name: Template systemd service
  template:
    src: app.service.j2
    dest: "/etc/systemd/system/{{ app_name }}.service"
  notify:
    - reload systemd
    - restart application

- name: Start and enable application service
  systemd:
    name: "{{ app_name }}"
    state: started
    enabled: yes
    daemon_reload: yes

Testing Strategies

Testing Pyramid

┌─────────────────────────────────────────────────────────────────┐
│                    Testing Pyramid                              │
├─────────────────────────────────────────────────────────────────┤
│                    Manual Tests                                 │
│                ┌─────────────────────┐                         │
│                │   E2E Tests         │                         │
│            ┌───┴─────────────────────┴───┐                     │
│            │   Integration Tests         │                     │
│        ┌───┴─────────────────────────────┴───┐                 │
│        │        Unit Tests                   │                 │
│        └─────────────────────────────────────┘                 │
└─────────────────────────────────────────────────────────────────┘

Unit Testing Example

// test/user.test.js
const { expect } = require('chai');
const sinon = require('sinon');
const User = require('../src/models/User');
const userService = require('../src/services/userService');

describe('User Service', () => {
  let userStub;
  
  beforeEach(() => {
    userStub = sinon.stub(User, 'findById');
  });
  
  afterEach(() => {
    userStub.restore();
  });

  describe('getUserById', () => {
    it('should return user when found', async () => {
      const mockUser = { id: 1, name: 'John Doe', email: 'john@example.com' };
      userStub.resolves(mockUser);

      const result = await userService.getUserById(1);

      expect(result).to.deep.equal(mockUser);
      expect(userStub.calledOnceWith(1)).to.be.true;
    });

    it('should throw error when user not found', async () => {
      userStub.resolves(null);

      try {
        await userService.getUserById(999);
        expect.fail('Expected error to be thrown');
      } catch (error) {
        expect(error.message).to.equal('User not found');
      }
    });
  });
});

Integration Testing

// test/integration/api.test.js
const request = require('supertest');
const app = require('../../src/app');
const db = require('../../src/database');

describe('User API Integration Tests', () => {
  before(async () => {
    await db.migrate.latest();
  });

  beforeEach(async () => {
    await db.seed.run();
  });

  after(async () => {
    await db.migrate.rollback();
    await db.destroy();
  });

  describe('GET /api/users/:id', () => {
    it('should return user details', async () => {
      const response = await request(app)
        .get('/api/users/1')
        .expect(200);

      expect(response.body).to.have.property('id', 1);
      expect(response.body).to.have.property('name');
      expect(response.body).to.have.property('email');
    });

    it('should return 404 for non-existent user', async () => {
      await request(app)
        .get('/api/users/999')
        .expect(404);
    });
  });
});

End-to-End Testing with Cypress

// cypress/e2e/user-journey.cy.js
describe('User Journey', () => {
  beforeEach(() => {
    cy.visit('/');
  });

  it('should allow user to register, login, and access dashboard', () => {
    // Registration
    cy.get('[data-testid="register-link"]').click();
    cy.get('[data-testid="email-input"]').type('test@example.com');
    cy.get('[data-testid="password-input"]').type('password123');
    cy.get('[data-testid="register-button"]').click();

    cy.url().should('include', '/dashboard');
    cy.get('[data-testid="welcome-message"]').should('contain', 'Welcome');

    // Logout
    cy.get('[data-testid="logout-button"]').click();
    cy.url().should('eq', Cypress.config().baseUrl + '/');

    // Login
    cy.get('[data-testid="login-link"]').click();
    cy.get('[data-testid="email-input"]').type('test@example.com');
    cy.get('[data-testid="password-input"]').type('password123');
    cy.get('[data-testid="login-button"]').click();

    cy.url().should('include', '/dashboard');
  });
});

Monitoring and Observability

Application Performance Monitoring

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'application'
    static_configs:
      - targets: ['app:3000']
    metrics_path: /metrics
    scrape_interval: 5s

Application Metrics Example

// src/middleware/metrics.js
const prometheus = require('prom-client');

// Create metrics
const httpRequestDuration = new prometheus.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.1, 0.5, 1, 2, 5]
});

const httpRequestsTotal = new prometheus.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

const activeConnections = new prometheus.Gauge({
  name: 'active_connections',
  help: 'Number of active connections'
});

// Middleware to collect metrics
function metricsMiddleware(req, res, next) {
  const start = Date.now();
  
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const route = req.route ? req.route.path : req.path;
    
    httpRequestDuration
      .labels(req.method, route, res.statusCode)
      .observe(duration);
    
    httpRequestsTotal
      .labels(req.method, route, res.statusCode)
      .inc();
  });
  
  next();
}

module.exports = {
  metricsMiddleware,
  register: prometheus.register
};

Logging Best Practices

Structured Logging

// src/utils/logger.js
const winston = require('winston');

const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  defaultMeta: {
    service: 'myapp',
    version: process.env.APP_VERSION || '1.0.0'
  },
  transports: [
    new winston.transports.Console({
      format: winston.format.combine(
        winston.format.colorize(),
        winston.format.simple()
      )
    }),
    new winston.transports.File({
      filename: 'logs/error.log',
      level: 'error'
    }),
    new winston.transports.File({
      filename: 'logs/combined.log'
    })
  ]
});

// Add request ID to all logs
logger.addRequestId = (req, res, next) => {
  req.logger = logger.child({ requestId: req.id });
  next();
};

module.exports = logger;

ELK Stack Configuration

# docker-compose.elk.yml
version: '3.8'

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.15.0
    environment:
      - discovery.type=single-node
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ports:
      - "9200:9200"
    volumes:
      - elasticsearch_data:/usr/share/elasticsearch/data

  logstash:
    image: docker.elastic.co/logstash/logstash:7.15.0
    volumes:
      - ./logstash/pipeline:/usr/share/logstash/pipeline
      - ./logstash/config:/usr/share/logstash/config
    ports:
      - "5044:5044"
    depends_on:
      - elasticsearch

  kibana:
    image: docker.elastic.co/kibana/kibana:7.15.0
    ports:
      - "5601:5601"
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    depends_on:
      - elasticsearch

volumes:
  elasticsearch_data:

Security in DevOps (DevSecOps)

Security Scanning in CI/CD

# Security scanning job
security-scan:
  runs-on: ubuntu-latest
  steps:
  - uses: actions/checkout@v3
  
  - name: Run Trivy vulnerability scanner
    uses: aquasecurity/trivy-action@master
    with:
      image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
      format: 'sarif'
      output: 'trivy-results.sarif'
  
  - name: Upload Trivy scan results
    uses: github/codeql-action/upload-sarif@v2
    with:
      sarif_file: 'trivy-results.sarif'
  
  - name: Run Snyk security scan
    uses: snyk/actions/node@master
    env:
      SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
    with:
      args: --severity-threshold=high

Secrets Management

#!/bin/bash
# secrets-management.sh

# Using HashiCorp Vault
vault kv put secret/myapp \
  database_password="$(openssl rand -base64 32)" \
  api_key="$(openssl rand -hex 32)" \
  jwt_secret="$(openssl rand -base64 64)"

# Kubernetes secrets from Vault
vault auth -method=kubernetes role=myapp-role

vault kv get -field=database_password secret/myapp | \
  kubectl create secret generic app-secrets \
  --from-literal=database_password="$(cat -)"

Performance and Scalability

Auto-scaling Configuration

# Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: myapp-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  minReplicas: 2
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

---
# Vertical Pod Autoscaler
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: myapp-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: app
      maxAllowed:
        cpu: 2
        memory: 4Gi
      minAllowed:
        cpu: 100m
        memory: 128Mi