Deployment Strategies

Introduction

A deployment strategy determines how new code reaches production and what happens if problems are detected. The right strategy balances:

Risk: How much can go wrong?
Speed: How fast can we deploy?
Cost: What infrastructure is required?
Complexity: How hard is it to implement and operate?

No single strategy fits all situations. Mature organizations use different strategies for different services based on risk profile, infrastructure, and business requirements.

Hot Deploy (In-Place Update)

What It Is

Hot deploy replaces running instances with new versions directly, in place, with a brief service interruption.

Process:

Stop running application instance
Replace application files with new version
Start application with new code
Verify health checks
Repeat for additional instances (if multiple)

Characteristics:

Fastest deployment (seconds to minutes)
Brief downtime during replacement (< 30 seconds per instance)
Simple to implement
Low infrastructure cost (no extra resources)
Simple rollback (redeploy previous version)

When to Use

Best for:

Single-instance applications (no load balancing)
Acceptable brief downtime (internal tools, batch processing)
Fast-starting applications (startup < 10 seconds)
Small deployable modules (< 100 MB artifacts)
Immature infrastructure (limited resources)

Not suitable for:

User-facing applications requiring zero downtime
Long startup times (> 1 minute)
Stateful applications (active sessions lost)
High-availability requirements

Example Implementation

#!/bin/bash
# hot-deploy.sh

echo "Stopping application..."
systemctl stop myapp

echo "Backing up current version..."
cp /opt/myapp/app /opt/myapp/app.backup

echo "Deploying new version..."
cp /artifacts/app-v1.2.0 /opt/myapp/app
chmod +x /opt/myapp/app

echo "Starting application..."
systemctl start myapp

echo "Verifying health..."
sleep 5
curl -f http://localhost:8080/health || {
    echo "Health check failed, rolling back..."
    systemctl stop myapp
    cp /opt/myapp/app.backup /opt/myapp/app
    systemctl start myapp
    exit 1
}

echo "Deployment successful"

Tradeoffs

Advantages:

✅ Simple to understand and implement
✅ Fast (1-2 minutes total)
✅ Low infrastructure cost
✅ Works with simple infrastructure

Disadvantages:

❌ Brief downtime (< 30 seconds)
❌ All-or-nothing (no gradual rollout)
❌ Rollback requires redeployment (1-2 minutes)
❌ Stateful sessions lost

Rolling Deployment

What It Is

Rolling deployment updates instances gradually - one or a few at a time - while others continue serving traffic.

Process:

Remove instance #1 from load balancer
Update instance #1 to new version
Verify health checks on instance #1
Add instance #1 back to load balancer
Repeat for instance #2, #3, etc.

Characteristics:

Zero downtime
Gradual rollout (percentage increases over time)
Mixed versions running temporarily
Built-in health checks stop rollout if issues detected
Medium complexity

When to Use

Best for:

Multiple instances (≥ 2 instances behind load balancer)
Zero downtime required (user-facing applications)
Backward-compatible changes (new version can coexist with old)
Automated health checks (can detect deployment issues)

Not suitable for:

Single instance applications
Breaking changes (protocol changes, incompatible APIs)
Database migrations requiring all instances on same version

Example Implementation

#!/bin/bash
# rolling-deploy.sh

INSTANCES=("instance-1" "instance-2" "instance-3")

for instance in "${INSTANCES[@]}"; do
    echo "Deploying to $instance..."

    # Remove from load balancer
    aws elb deregister-instances-from-load-balancer \
        --load-balancer-name my-lb \
        --instances $instance

    # Wait for drain
    sleep 30

    # Deploy new version
    ssh $instance "sudo systemctl stop myapp && \
                   sudo cp /artifacts/app-v1.2.0 /opt/myapp/app && \
                   sudo systemctl start myapp"

    # Wait for startup
    sleep 10

    # Health check
    if ! ssh $instance "curl -f http://localhost:8080/health"; then
        echo "Health check failed on $instance, aborting rollout"
        # Re-register healthy instances
        aws elb register-instances-with-load-balancer \
            --load-balancer-name my-lb \
            --instances $instance
        exit 1
    fi

    # Add back to load balancer
    aws elb register-instances-with-load-balancer \
        --load-balancer-name my-lb \
        --instances $instance

    echo "$instance deployed successfully"
done

echo "Rolling deployment complete"

Tradeoffs

Advantages:

✅ Zero downtime
✅ Gradual rollout (issues affect fewer instances)
✅ Automatic halt on health check failures
✅ Lower infrastructure cost (no extra instances)

Disadvantages:

❌ Requires multiple instances
❌ Mixed versions during rollout (must be compatible)
❌ Slower rollback (must roll forward or roll back each instance)
❌ Session affinity issues (if stateful)

Blue-Green Deployment

What It Is

Blue-green deployment maintains two identical production environments.

It is the preferred way to stage any deployment, even when we are doing rolling, we will blue/green individual rings.

One color serves traffic (active), while the other is idle (inactive).

Deploy to inactive color, test, then switch traffic to new active.

Environments:

Blue and Green, or 001, 002, 003 etc.

active: Current production version, serving live traffic
inactive: New version deployed, idle (or receiving test traffic)

Process:

Blue environment serves 100% production traffic
Deploy new version to Green environment
Run smoke tests against Green (0% production traffic)
Switch load balancer/DNS from Blue to Green (instant cutover)
Monitor Green with 100% production traffic
Blue becomes idle (ready for next deployment or instant rollback)

Characteristics:

Zero downtime
Instant traffic switch (seconds)
Instant rollback (switch back to inactive)
High infrastructure cost (2x resources)
Requires infrastructure automation

When to Use

Best for:

Instant rollback critical (high-risk deployments)
Large artifacts or slow startup (can't afford multiple rolling updates)
Database migrations (need to test migration fully before traffic switch)
High-stakes releases (major version updates)
Infrastructure as Code (can automate environment creation)

Not suitable for:

License-constrained environments (requires 2x prod infrastructure).
Most cloud supports pay-for-play, so does not increase costs if designed correctly
Have to do special steps to support Stateful applications (dbs etc.)

Example Implementation

#!/bin/bash
# blue-green-deploy.sh

BLUE_ENV="myapp-blue"
GREEN_ENV="myapp-green"
LB_TARGET_GROUP_ARN="arn:aws:elasticloadbalancing:..."

# Current production is Blue
echo "Current production: $BLUE_ENV"

# Deploy to Green
echo "Deploying to $GREEN_ENV..."
terraform apply -var="environment=green" -var="version=v1.2.0"

# Wait for Green to be healthy
echo "Waiting for $GREEN_ENV health checks..."
aws elbv2 wait target-in-service \
    --target-group-arn $LB_TARGET_GROUP_ARN

# Run smoke tests against Green
echo "Running smoke tests on $GREEN_ENV..."
./smoke-tests.sh https://green.myapp.internal

if [ $? -ne 0 ]; then
    echo "Smoke tests failed, aborting deployment"
    exit 1
fi

# Switch traffic to Green
echo "Switching traffic to $GREEN_ENV..."
aws elbv2 modify-listener \
    --listener-arn $LISTENER_ARN \
    --default-actions Type=forward,TargetGroupArn=$GREEN_TARGET_GROUP_ARN

echo "Traffic switched to $GREEN_ENV"
echo "$BLUE_ENV is now idle (ready for rollback or next deployment)"

Rollback

Instant rollback:

# Switch traffic back to Blue
aws elbv2 modify-listener \
    --listener-arn $LISTENER_ARN \
    --default-actions Type=forward,TargetGroupArn=$BLUE_TARGET_GROUP_ARN

Rollback completes in seconds - just switch traffic back.

Tradeoffs

Advantages:

✅ Instant rollback (seconds)
✅ Zero downtime
✅ Test in production-like environment before traffic switch
✅ Clean separation (no mixed versions)

Disadvantages:

❌ Higher complixity cost (2x infrastructure running during deployment + load balancing)
❌ Database migrations complex (state transfers between Blue/Green)
❌ Requires infrastructure automation
❌ Some licenses make it double cost (hot swappable prod)

Canary Deployment

What It Is

Canary deployment routes a small percentage of traffic to the new version, monitors metrics, and gradually increases traffic if healthy.

Process:

Deploy new version alongside current version
Route 1-5% traffic to new version (canary)
Monitor canary metrics (errors, latency, business KPIs)
If healthy: Gradually increase traffic (10% → 25% → 50% → 100%)
If unhealthy: Route 0% traffic to canary (instant rollback)

Characteristics:

Zero downtime
Gradual, metrics-driven rollout
Early warning with minimal blast radius (1-5% affected)
Requires robust monitoring and metrics
Requires traffic routing capability
Medium infrastructure cost

When to Use

Best for:

Need production validation (can't fully test in staging)
Metrics-driven decisions (error rates, latency, business KPIs)
Risk-averse organizations (gradual increases confidence)
A/B testing infrastructure (already have traffic routing)
Large user base (1% still significant sample size)

Not suitable for:

Small user base (1% is too few users for meaningful metrics)
Infrastructure without traffic routing capability
Applications without good observability
Breaking changes (can't coexist with old version)

Example Implementation

# canary-deployment.yaml (using Kubernetes)

apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  selector:
    app: myapp
  ports:
    - port: 80

---
# Stable version (95% traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-stable
spec:
  replicas: 19  # 95% of traffic
  selector:
    matchLabels:
      app: myapp
      version: v1.1.0
  template:
    metadata:
      labels:
        app: myapp
        version: v1.1.0
    spec:
      containers:
      - name: myapp
        image: myapp:v1.1.0

---
# Canary version (5% traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-canary
spec:
  replicas: 1  # 5% of traffic
  selector:
    matchLabels:
      app: myapp
      version: v1.2.0
  template:
    metadata:
      labels:
        app: myapp
        version: v1.2.0
    spec:
      containers:
      - name: myapp
        image: myapp:v1.2.0

Automated progressive rollout:

# canary-controller.py

def canary_rollout():
    percentages = [5, 10, 25, 50, 100]

    for pct in percentages:
        print(f"Routing {pct}% traffic to canary...")
        update_traffic_split(canary_percent=pct)

        # Monitor for 15 minutes
        time.sleep(900)

        metrics = get_canary_metrics()

        if metrics['error_rate'] > THRESHOLD_ERROR_RATE:
            print("Error rate too high, rolling back")
            update_traffic_split(canary_percent=0)
            return FAILED

        if metrics['p95_latency'] > THRESHOLD_LATENCY:
            print("Latency too high, rolling back")
            update_traffic_split(canary_percent=0)
            return FAILED

        print(f"{pct}% canary healthy, proceeding")

    print("Canary fully rolled out")
    return SUCCESS

Metrics to Monitor

Technical metrics:

Error rate (4xx, 5xx responses)
P50, P95, P99 latency
Request throughput
Resource utilization (CPU, memory)

Business metrics:

Conversion rate
User engagement
Revenue impact
Customer complaints

Tradeoffs

Advantages:

✅ Minimal blast radius (1-5% affected initially)
✅ Production validation before full rollout
✅ Instant rollback (route 0% to canary)
✅ Metrics-driven confidence building

Disadvantages:

❌ Requires traffic routing infrastructure
❌ Requires robust observability
❌ Complex to implement
❌ Slower rollout (hours, not minutes)

Feature Flags with Deployment

What It Is

Feature flags decouple deployment (code reaches production) from release (features enabled for users).

Deploy code with features disabled, enable via runtime flags.

Process:

Deploy new code to production (features OFF via flags)
Gradually enable features for user segments:
Internal users (0.1%)
Early adopters (1%)
Standard users (10% → 50% → 100%)
Monitor metrics per enabled segment
Instant disable if issues detected (toggle flag OFF)

Characteristics:

Zero downtime deployments
Independent feature rollout per feature
Instant feature disable (no redeployment)
Gradual rollout per feature
Enables A/B testing
Requires feature flag infrastructure

When to Use

Best for:

Continuous Deployment (CDe) pattern (required for automated production deployment)
A/B testing (compare feature variants)
Gradual rollouts (enable features progressively)
Kill switches (instant disable capability)
Beta programs (enable for specific users)

Not suitable for:

Infrastructure changes (can't flag infrastructure)
Database schema changes (flags don't help)
Very simple applications (overhead not worth it)

Example Implementation

Feature flag service:

// featureflags/featureflags.go

package featureflags

type Service struct {
    client *flagd.Client
}

func (s *Service) IsEnabled(ctx context.Context, feature string, userID string) bool {
    // Check if feature is enabled for this user
    return s.client.BoolValue(ctx, feature, false, flagd.EvaluationContext{
        "userID": userID,
    })
}

Application code:

// handler.go

func (h *Handler) HandleRequest(w http.ResponseWriter, r *http.Request) {
    userID := getUserID(r)

    if h.flags.IsEnabled(r.Context(), "new-checkout-flow", userID) {
        // New feature
        h.handleNewCheckout(w, r)
    } else {
        // Old feature
        h.handleOldCheckout(w, r)
    }
}

Feature flag configuration:

# flags.yaml

new-checkout-flow:
  state: ENABLED
  variants:
    on: true
    off: false
  defaultVariant: off
  targeting:
    - if:
        - in:
          - userID
          - ["user-1", "user-2", "user-3"]  # Internal users
      variant: on
    - if:
        - startsWith:
          - email
          - "beta-"  # Beta users
      variant: on
    - if:
        - percentage:
          - flagKey: new-checkout-flow
          - 10  # 10% rollout
      variant: on

Progressive Rollout with Flags

Stage 12 (Release Toggling):

Deploy code to production (Stage 10) - flags OFF
Enable for internal users (0.1%)
Monitor metrics, if healthy: enable for 1%
Monitor metrics, if healthy: enable for 10%
Monitor metrics, if healthy: enable for 50%
Monitor metrics, if healthy: enable for 100%

Each step is independent of deployment - just toggle configuration.

Tradeoffs

Advantages:

✅ Decouple deployment from release
✅ Instant feature disable (no redeployment)
✅ Gradual rollout per feature
✅ A/B testing capability
✅ Beta programs (enable for specific users)

Disadvantages:

❌ Code complexity (if/else branches)
❌ Feature flag infrastructure required
❌ Technical debt (must remove old code paths)
❌ Testing complexity (all flag combinations)

Strategy Selection Framework

Decision Tree

Question 1: Can you tolerate any downtime?

No → Rolling, Blue-Green, Canary, or Feature Flags
Yes (< 30s) → Hot Deploy

Question 2: Do you need instant rollback?

Yes → Blue-Green or Canary
No → Rolling or Hot Deploy

Question 3: Do you have robust monitoring/metrics?

Yes → Canary (metrics-driven)
No → Blue-Green (instant switch)

Question 4: Do you need gradual feature rollout?

Yes → Feature Flags (with any deployment strategy)
No → Any strategy

Question 5: What's your infrastructure budget?

Limited → Hot Deploy or Rolling
Moderate → Canary or Feature Flags
High → Blue-Green

Strategy Comparison

Strategy	Downtime	Rollback Speed	Complexity	Cost	Use Case
Hot Deploy	Brief	Fast (1-2 min)	Low	Low	Internal tools, batch jobs
Rolling	None	Medium (5 min)	Medium	Low	Standard web services
Blue-Green	None	Instant	Medium	High	High-risk, large deployments
Canary	None	Instant	High	Medium	Production validation needed
Feature Flags	None	Instant	High	Medium	Gradual rollouts, A/B testing

RA vs CDe Pattern Recommendations

Release Approval (RA) Pattern

Recommended strategies:

Blue-Green: Manual approval at Stage 9, then instant switch to production
Rolling: Manual approval at Stage 9, then gradual rollout

Why these work:

Manual approval provides explicit go/no-go decision
Clear cut-over point (before or after traffic switch)
Aligns with manual approval workflow

Typical flow:

Deploy to Green (or first instances)
Test in production (or on subset)
Manual approval (Stage 9)
Switch traffic (Blue-Green) or continue rolling

Continuous Deployment (CDe) Pattern

Recommended strategies:

Canary: Automated with gradual validation
Feature Flags: Deploy continuously, release independently

Why these work:

Automated progression (no manual approval)
Metrics-driven decisions
Gradual increases confidence
Instant rollback on threshold breaches

Typical flow:

Deploy to production (automated)
Route 1% traffic (Canary) or deploy with flags OFF
Monitor metrics (automated)
Gradually increase (automated if metrics healthy)
Full rollout or instant rollback

CDe requires:

Robust automated testing (Stages 5-6)
Comprehensive monitoring and metrics
Automated rollback triggers
Feature flags (for independent release)

Implementation Considerations

Infrastructure Requirements

Hot Deploy:

Minimal (single server, basic automation)

Rolling:

Load balancer
Multiple instances (≥ 2)
Health check endpoints
Orchestration (Kubernetes, AWS ECS, etc.)

Blue-Green:

Infrastructure as Code (Terraform, CloudFormation)
Load balancer with traffic switching
2x production capacity
Automated environment provisioning

Canary:

Traffic routing capability (service mesh, load balancer)
Observability platform (Prometheus, Datadog)
Automated metrics collection
Rollout automation (Flagger, Argo Rollouts)

Feature Flags:

Feature flag service (LaunchDarkly, Flagsmith, Unleash)
Application integration
Configuration management
Flag lifecycle management

Monitoring Requirements

All strategies require:

Health check endpoints (/health, /ready)
Application metrics (errors, latency, throughput)
Infrastructure metrics (CPU, memory, disk)
Logging (structured, centralized)

Canary and Feature Flags additionally require:

Per-version metrics (compare canary to stable)
Business metrics (conversion, revenue)
Automated alerting on threshold breaches
Real-time dashboards

Rollback Procedures

Every deployment strategy must have documented rollback:

Rollback triggers (when to rollback)
Rollback steps (how to execute)
Rollback time (how long it takes)
Database rollback (if applicable)
Verification (how to confirm rollback success)

Anti-Patterns

Anti-Pattern 1: No Rollback Plan

Problem: Deploy without defined rollback procedure

Impact: When issues arise, scramble to figure out how to recover

Solution: Document rollback procedure before deploying

Anti-Pattern 2: All-at-Once Deployments (Big Bang)

Problem: Deploy to all instances simultaneously

Impact: All users affected if deployment fails

Solution: Use gradual rollout (Rolling, Canary, Rings)

Anti-Pattern 3: Ignoring Canary Metrics

Problem: Roll out canary despite elevated error rates

Impact: Problems affect all users when fully rolled out

Solution: Automated rollback on threshold breaches

Anti-Pattern 4: Feature Flag Debt

Problem: Accumulating old feature flags, never removing

Impact: Code complexity, testing burden, technical debt

Solution: Remove flags after full rollout (set lifecycle policy)

Anti-Pattern 5: No Health Checks

Problem: Deploying without health check validation

Impact: Broken deployments serve traffic, users affected

Solution: Implement health checks, verify before adding to load balancer

Best Practices Summary

Match strategy to risk: Higher risk → More gradual rollout
Always have rollback: Document and test rollback procedures
Automate deployments: Reduce human error, increase consistency
Monitor actively: Watch metrics during and after deployment
Health checks required: Never serve traffic from unhealthy instances
Start simple: Hot Deploy → Rolling → Blue-Green → Canary → Feature Flags
CDe requires maturity: Comprehensive testing, monitoring, feature flags
RA for high-risk: Manual approval for critical systems
Gradual rollouts: Minimize blast radius, build confidence
Document everything: Runbooks, rollback procedures, decision criteria

Next Steps

Deployment Rings - Progressive user group rollout pattern
CD Model Stages 8-12 - See deployment in CD Model context
Implementation Patterns - RA vs CDe pattern selection
Environments Architecture - Deploy Agents and production environments

Quick Reference

Strategy Comparison

Strategy	Rollback Speed	Downtime	Complexity	Resource Cost
Hot Deploy	Fast (1-2 min)	Brief (< 30s)	Low	Low
Rolling	Fast (< 1 min)	None	Medium	Low
Blue-Green	Instant	None	Medium	High (2x)
Canary	Instant	None	High	Medium
Rings	Gradual	None	High	Medium

Tutorials | How-to Guides | Explanation | Reference

You are here: Explanation — understanding-oriented discussion that clarifies concepts.