Continuous Integration and Continuous Deployment (CI/CD) transforms how teams ship software, enabling multiple deployments per day with confidence. Companies like Netflix (4,000+ deployments/day), Spotify (1,000+/day), and Etsy (50+/day) rely on sophisticated CI/CD pipelines to deliver features rapidly while maintaining stability.
This guide covers production-ready CI/CD patterns, from automated testing and build optimization to deployment strategies (blue-green, canary, rolling), rollback mechanisms, and zero-downtime deployments. We'll explore real-world implementations and learn how to build pipelines that ship code safely at scale.
CI/CD Fundamentals
Continuous Integration (CI): Automatically build and test code on every commit. Catch bugs early, ensure code quality, maintain deployable main branch.
Continuous Deployment (CD): Automatically deploy passing builds to production. Ship features faster, reduce manual errors, enable rapid iteration.
CI/CD Pipeline Stages
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Source │ ──>│ Build │ ──>│ Test │ ──>│ Deploy │
│ (Git Push) │ │ (Compile/ │ │ (Unit/Int/ │ │ (Staging/ │
│ │ │ Bundle) │ │ E2E) │ │ Production) │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
│
v
┌──────────────────┐
│ Quality Gates │
│ (Coverage/Lint/ │
│ Security Scan) │
└──────────────────┘
Building a Robust CI Pipeline
1. Source Control Triggers
Trigger builds on every commit to ensure continuous integration:
# GitHub Actions
name: CI Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'
Best practice: Run CI on feature branches and main branch. Block PR merges if CI fails.
2. Parallel Test Execution
Speed up pipelines by parallelizing tests:
# Run tests in parallel across multiple jobs
jobs:
test:
strategy:
matrix:
test-suite: [unit, integration, e2e]
node-version: [18, 20]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node ${{ matrix.node-version }}
uses: actions/setup-node@v4
with:
node-version: ${{ matrix.node-version }}
- name: Install dependencies
run: npm ci
- name: Run ${{ matrix.test-suite }} tests
run: npm run test:${{ matrix.test-suite }}
- name: Upload coverage
uses: codecov/codecov-action@v4
if: matrix.test-suite == 'unit'
Result: 6 parallel jobs (3 test suites × 2 Node versions) finish in 5 minutes vs 30 minutes sequential.
3. Build Caching
Cache dependencies and build artifacts to speed up pipelines:
# Docker layer caching
- name: Build Docker image
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: myapp:${{ github.sha }}
cache-from: type=registry,ref=myapp:cache
cache-to: type=registry,ref=myapp:cache,mode=max
NPM dependency caching
- name: Cache dependencies
uses: actions/cache@v4
with:
path: ~/.npm
key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
restore-keys: |
${{ runner.os }}-node-
Speedup: Reduces build time from 12 minutes to 3 minutes with warm cache.
Netflix caches Docker layers and achieves 80% cache hit rate across 4,000 daily builds.
4. Quality Gates
Enforce code quality standards before deployment:
quality-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
# Linting
- name: Run ESLint
run: npm run lint
# Type checking
- name: TypeScript check
run: npm run type-check
# Code coverage
- name: Check coverage threshold
run: npm run test:coverage
- name: Enforce 80% coverage
run: |
COVERAGE=$(jq '.total.lines.pct' coverage/coverage-summary.json)
if (( $(echo "$COVERAGE < 80" | bc -l) )); then
echo "Coverage $COVERAGE% is below 80% threshold"
exit 1
fi
# Security scanning
- name: Run Snyk security scan
uses: snyk/actions/node@master
env:
SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
# Dependency audit
- name: Audit dependencies
run: npm audit --audit-level=high
Enforcement: Block deployments if any check fails. Spotify requires 90% test coverage and zero high-severity vulnerabilities.
5. Build Artifacts and Container Images
Package application for deployment:
# Multi-stage Docker build for optimized images
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build
FROM node:20-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY package*.json ./
ENV NODE_ENV=production
EXPOSE 3000
USER node
CMD ["node", "dist/index.js"]
# Build and push to registry
build-image:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Login to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_TOKEN }}
- name: Build and push
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: |
myapp:${{ github.sha }}
myapp:latest
platforms: linux/amd64,linux/arm64
Best practice: Tag images with git SHA for traceability. Etsy can rollback to any commit from the past 6 months.
Deployment Strategies
1. Blue-Green Deployment
Run two identical production environments (blue = current, green = new). Switch traffic instantly.
# Deploy to green environment
deploy-green:
runs-on: ubuntu-latest
steps:
- name: Deploy to green environment
run: |
kubectl set image deployment/app-green \
app=myapp:${{ github.sha }} \
-n production
# Wait for rollout
kubectl rollout status deployment/app-green -n production
- name: Run smoke tests against green
run: |
curl -f https://green.myapp.com/health || exit 1
npm run test:smoke -- --url=https://green.myapp.com
- name: Switch traffic to green
run: |
# Update load balancer to point to green
kubectl patch service app \
-p '{"spec":{"selector":{"version":"green"}}}' \
-n production
- name: Monitor for 5 minutes
run: |
# Watch error rates
./scripts/monitor-deployment.sh --duration=5m
- name: Mark blue as previous
run: |
kubectl label deployment/app-blue previous=true -n production
Rollback: Switch load balancer back to blue environment instantly.
Pros:
- Instant rollback (flip load balancer)
- Zero downtime
- Full environment testing before cutover
Cons:
- 2x infrastructure cost (two full environments)
- Database migrations tricky (must be backward compatible)
Netflix uses blue-green for major releases, enabling instant rollback for 4,000 services.
2. Canary Deployment
Gradually shift traffic to new version, monitor metrics, roll forward or back.
# Canary deployment with gradual traffic shift
deploy-canary:
runs-on: ubuntu-latest
steps:
- name: Deploy canary (10% traffic)
run: |
# Deploy new version with canary label
kubectl apply -f k8s/deployment-canary.yaml
# Configure traffic split: 90% stable, 10% canary
kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: app
spec:
hosts:
- app.myapp.com
http:
- match:
- headers:
canary:
exact: "true"
route:
- destination:
host: app
subset: canary
- route:
- destination:
host: app
subset: stable
weight: 90
- destination:
host: app
subset: canary
weight: 10
EOF
- name: Monitor canary for 10 minutes
run: |
# Watch error rate, latency, CPU
./scripts/canary-monitor.sh \
--duration=10m \
--error-threshold=0.01 \
--latency-p99-threshold=500ms
- name: Promote to 50% traffic
if: success()
run: |
kubectl patch virtualservice app \
--type merge \
-p '{"spec":{"http":[{"route":[
{"destination":{"subset":"stable"},"weight":50},
{"destination":{"subset":"canary"},"weight":50}
]}]}}'
# Monitor for 10 more minutes
./scripts/canary-monitor.sh --duration=10m
- name: Promote to 100% (full rollout)
if: success()
run: |
kubectl patch virtualservice app \
--type merge \
-p '{"spec":{"http":[{"route":[
{"destination":{"subset":"canary"},"weight":100}
]}]}}'
# Update stable deployment to canary version
kubectl set image deployment/app-stable \
app=myapp:${{ github.sha }}
- name: Rollback canary
if: failure()
run: |
kubectl patch virtualservice app \
--type merge \
-p '{"spec":{"http":[{"route":[
{"destination":{"subset":"stable"},"weight":100}
]}]}}'
kubectl delete deployment app-canary
Canary Monitoring:
// scripts/canary-monitor.sh
async function monitorCanary(duration, thresholds) {
const startTime = Date.now();
while (Date.now() - startTime < duration) {
const metrics = await getMetrics(['stable', 'canary']);
// Compare error rates
const stableErrorRate = metrics.stable.errors / metrics.stable.requests;
const canaryErrorRate = metrics.canary.errors / metrics.canary.requests;
if (canaryErrorRate > stableErrorRate * 1.5 || canaryErrorRate > thresholds.errorRate) {
console.error(`Canary error rate ${canaryErrorRate} exceeds threshold`);
process.exit(1);
}
// Compare latency
if (metrics.canary.latencyP99 > thresholds.latencyP99) {
console.error(`Canary P99 latency ${metrics.canary.latencyP99}ms exceeds threshold`);
process.exit(1);
}
// Compare CPU/memory
if (metrics.canary.cpuUsage > metrics.stable.cpuUsage * 1.5) {
console.error(`Canary CPU usage ${metrics.canary.cpuUsage}% significantly higher`);
process.exit(1);
}
await sleep(30000); // Check every 30 seconds
}
console.log('Canary deployment healthy, proceeding with rollout');
}
Traffic progression: 10% → 25% → 50% → 100% over 30 minutes, auto-rollback on metric violations.
Pros:
- Limited blast radius (only affects canary traffic)
- Gradual validation with real users
- Automated rollback on metric degradation
Cons:
- Longer deployment time (30-60 minutes)
- Requires sophisticated monitoring
- Complex for stateful services
Spotify uses canary deployments for all backend services, automatically rolling back 15% of canaries that fail metric thresholds.
3. Rolling Deployment
Update pods/instances one at a time, ensuring minimum availability.
# Kubernetes rolling update
apiVersion: apps/v1
kind: Deployment
metadata:
name: app
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # At most 1 pod down
maxSurge: 2 # At most 2 extra pods during rollout
template:
metadata:
labels:
app: myapp
version: v2
spec:
containers:
- name: app
image: myapp:${{ github.sha }}
readinessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
Process:
- Start 2 new pods (v2)
- Wait for readiness probes to pass
- Terminate 1 old pod (v1)
- Repeat until all pods updated
Pros:
- Zero downtime
- No extra infrastructure
- Built into Kubernetes
Cons:
- Gradual rollout (5-10 minutes for 50 pods)
- Mixed versions during rollout (requires backward compatibility)
- Rollback slower (reverse the process)
GitHub uses rolling deployments for stateless services across 100+ Kubernetes clusters.
4. Feature Flags for Progressive Rollout
Decouple deployment from release using feature flags:
// Feature flag service
import { FeatureFlags } from '@launchdarkly/node-server-sdk';
const featureFlags = FeatureFlags.init(process.env.LAUNCHDARKLY_KEY);
app.get('/api/products', async (req, res) => {
const user = { key: req.user.id, email: req.user.email };
// Check feature flag
const useNewRecommendations = await featureFlags.variation(
'new-recommendation-engine',
user,
false // default value
);
if (useNewRecommendations) {
return res.json(await getRecommendationsV2(req.user.id));
} else {
return res.json(await getRecommendationsV1(req.user.id));
}
});
Progressive rollout:
- Deploy code with flag disabled (0% users)
- Enable for internal employees (100 users)
- Enable for 1% of users
- Gradual ramp: 5% → 10% → 25% → 50% → 100%
- Remove flag after full rollout
Rollback: Disable flag instantly (no redeployment needed).
Facebook uses feature flags for all releases, enabling instant rollback and A/B testing without deployments.
Zero-Downtime Database Migrations
Backward-Compatible Migrations
// Step 1: Add new column (nullable)
await db.schema.alterTable('users', (table) => {
table.string('email_verified_at').nullable();
});
// Deploy application code that writes to both old and new columns
app.post('/verify-email', async (req, res) => {
await db('users').where({ id: req.user.id }).update({
email_verified: true, // Old column
email_verified_at: new Date() // New column
});
});
// Step 2: Backfill old data (run as background job)
async function backfillEmailVerifiedAt() {
const users = await db('users')
.where({ email_verified: true })
.whereNull('email_verified_at');
for (const user of users) {
await db('users').where({ id: user.id }).update({
email_verified_at: user.updated_at // Best guess
});
}
}
// Step 3: Deploy application code that only uses new column
app.get('/profile', async (req, res) => {
const user = await db('users').where({ id: req.user.id }).first();
res.json({
emailVerified: user.email_verified_at !== null
});
});
// Step 4: Remove old column (separate migration after full rollout)
await db.schema.alterTable('users', (table) => {
table.dropColumn('email_verified');
});
Process: Add → Dual-write → Backfill → Switch reads → Remove old.
Stripe requires all database migrations to be backward-compatible for zero-downtime deployments across 100+ database shards.
Automated Rollback Strategies
1. Health Check-Based Rollback
deploy:
steps:
- name: Deploy new version
run: kubectl set image deployment/app app=myapp:${{ github.sha }}
- name: Wait for rollout
run: kubectl rollout status deployment/app --timeout=5m
- name: Health check loop
run: |
for i in {1..10}; do
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" https://myapp.com/health)
if [ $HTTP_CODE -ne 200 ]; then
echo "Health check failed with code $HTTP_CODE"
kubectl rollout undo deployment/app
exit 1
fi
sleep 10
done
- name: Monitor error rate
run: |
ERROR_RATE=$(curl -s "https://api.datadog.com/api/v1/query?query=sum:app.errors{*}.as_rate()" \
-H "DD-API-KEY: ${{ secrets.DATADOG_API_KEY }}" | jq '.series[0].pointlist[-1][1]')
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "Error rate $ERROR_RATE exceeds 1% threshold"
kubectl rollout undo deployment/app
exit 1
fi
2. Automatic Rollback with Argo Rollouts
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: app
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 5m}
- setWeight: 50
- pause: {duration: 10m}
- setWeight: 100
# Automatic rollback on metric violations
analysis:
templates:
- templateName: error-rate
- templateName: latency-p99
args:
- name: service-name
value: app
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: error-rate
spec:
metrics:
- name: error-rate
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total[5m]))
/
sum(rate(http_requests_total[5m]))
successCondition: result < 0.01 # Error rate < 1%
interval: 1m
failureLimit: 3 # Rollback after 3 consecutive failures
Automatic rollback if error rate > 1% for 3 consecutive minutes.
Netflix rolls back 10% of deployments automatically based on metric violations, preventing user impact.
CI/CD Security Best Practices
1. Secrets Management
# Store secrets in secure vault
- name: Retrieve secrets
run: |
# AWS Secrets Manager
aws secretsmanager get-secret-value \
--secret-id prod/app/secrets \
--query SecretString \
--output text > secrets.json
# Or HashiCorp Vault
vault kv get -format=json secret/prod/app > secrets.json
Inject secrets as environment variables (never commit to code)
- name: Deploy with secrets
env:
DATABASE_URL: ${{ secrets.DATABASE_URL }}
API_KEY: ${{ secrets.API_KEY }}
run: |
kubectl create secret generic app-secrets
--from-literal=DATABASE_URL=$DATABASE_URL
--from-literal=API_KEY=$API_KEY
--dry-run=client -o yaml | kubectl apply -f -
Never commit secrets to Git. Use secret managers and inject at deployment time.
2. Supply Chain Security
# Sign container images
- name: Sign image with Cosign
run: |
cosign sign --key cosign.key myapp:${{ github.sha }}
Verify signatures before deployment
- name: Verify image signature
run: |
cosign verify --key cosign.pub myapp:${{ github.sha }}
SBOM generation
-
name: Generate Software Bill of Materials
run: |
syft myapp:${{ github.sha }} -o spdx-json > sbom.json
Scan SBOM for vulnerabilities
grype sbom:./sbom.json --fail-on high
Google requires all images to be signed and verified before deployment to GKE clusters.
3. Least Privilege Access
# Use OIDC tokens instead of long-lived credentials
permissions:
id-token: write # For OIDC
contents: read
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789:role/github-actions-deploy
aws-region: us-east-1
Deploy with time-limited token (expires after job)
- name: Deploy to ECS
run: |
aws ecs update-service
--cluster production
--service app
--force-new-deployment
Monitoring and Observability in CI/CD
Deployment Metrics Dashboard
// Track deployment metrics
async function recordDeployment(version, status) {
await prometheus.recordHistogram('deployment_duration_seconds', deploymentDuration);
await prometheus.incrementCounter('deployments_total', { status, version });
// Calculate deployment frequency
const deploymentsToday = await db('deployments')
.where('created_at', '>', new Date(Date.now() - 86400000))
.count();
console.log(Deployments today: ${deploymentsToday});
// Calculate Mean Time To Recovery (MTTR)
if (status === 'rollback') {
const lastSuccessfulDeploy = await db('deployments')
.where({ status: 'success' })
.orderBy('created_at', 'desc')
.first();
const mttr = Date.now() - lastSuccessfulDeploy.created_at;
await prometheus.recordHistogram('mttr_seconds', mttr / 1000);
}
}
DORA Metrics (DevOps Research and Assessment):
- Deployment Frequency: How often you deploy (Netflix: 4,000/day)
- Lead Time for Changes: Commit to production (Etsy: < 1 hour)
- Mean Time to Recovery (MTTR): Time to recover from failure (Spotify: < 10 minutes)
- Change Failure Rate: % of deployments causing incidents (Google: < 5%)
Real-World Examples
Netflix - Spinnaker for Multi-Cloud Deployments
- 4,000 deployments/day across AWS regions
- Automated canary analysis with 9 metrics (error rate, latency, CPU)
- Chaos engineering in production (Chaos Monkey kills instances during deployments)
- Rollback in 2 minutes via blue-green deployment
Etsy - Continuous Deployment Culture
- 50+ deployments/day (developer-driven)
- Feature flags for all releases (instant rollback without deployment)
- Automated rollback if error rate spikes > 0.5%
- Lead time: Code commit to production in < 30 minutes
Spotify - Trunk-Based Development
- 1,000+ deployments/day across 200+ services
- Canary deployments with 15% automatic rollback rate
- 95% automated test coverage prevents regressions
- No staging environment (test in production with feature flags)
CI/CD Pipeline Optimization
1. Parallelize Everything
# Run independent jobs in parallel
jobs:
lint:
runs-on: ubuntu-latest
steps:
- run: npm run lint
test-unit:
runs-on: ubuntu-latest
steps:
- run: npm run test:unit
test-integration:
runs-on: ubuntu-latest
steps:
- run: npm run test:integration
test-e2e:
runs-on: ubuntu-latest
steps:
- run: npm run test:e2e
security-scan:
runs-on: ubuntu-latest
steps:
- run: npm audit
Wait for all checks before deploying
deploy:
needs: [lint, test-unit, test-integration, test-e2e, security-scan]
runs-on: ubuntu-latest
steps:
- run: ./deploy.sh
Result: 5 parallel jobs complete in 8 minutes vs 25 minutes sequential.
2. Fail Fast
# Run fast checks first, expensive checks last
jobs:
lint: # 30 seconds
steps:
- run: npm run lint
type-check: # 1 minute
needs: [lint]
steps:
- run: npm run type-check
test-unit: # 3 minutes
needs: [type-check]
steps:
- run: npm run test:unit
test-integration: # 8 minutes
needs: [test-unit]
steps:
- run: npm run test:integration
test-e2e: # 15 minutes (slowest, run last)
needs: [test-integration]
steps:
- run: npm run test:e2e
Result: Fail in 30 seconds (lint error) instead of waiting 15 minutes for E2E tests.
Conclusion - Shipping Code Safely at Scale
CI/CD pipelines enable rapid, reliable software delivery. Key takeaways:
- Automate everything: Build, test, deploy, rollback should be fully automated
- Use progressive deployment strategies: Canary and blue-green minimize risk
- Monitor deployments: Track error rates, latency, CPU—auto-rollback on violations
- Feature flags decouple deploy from release: Ship code dark, enable progressively
- Backward-compatible database migrations: Zero-downtime schema changes
- Optimize for speed: Parallel jobs, caching, fail-fast reduce cycle time from 30 min to 5 min
- Security at every stage: Secret management, image signing, vulnerability scanning
Netflix, Spotify, and Etsy prove that with robust CI/CD pipelines, teams can deploy thousands of times per day with confidence. Start with automated testing, implement canary deployments, and iterate based on deployment metrics (DORA) to achieve continuous deployment excellence.
Related Articles
CI/CD Pipelines - Production-Ready Deployment Strategies and Best Practices
Master production CI/CD pipelines with automated testing, blue-green deployments, canary releases, rollback strategies, and zero-downtime deployments. Learn from Netflix, Spotify, and Etsy to build reliable deployment pipelines that ship code safely at scale with comprehensive patterns and real-world examples.
Rate Limiting and API Throttling - Production Strategies for Scalable APIs
Master rate limiting and API throttling strategies for production systems. Learn token bucket, leaky bucket, sliding window algorithms, distributed rate limiting with Redis, per-user and per-endpoint throttling, graceful degradation patterns, and real-world implementations from Stripe, GitHub, and Twitter APIs.
Database Sharding and Partitioning Strategies - Production-Ready Scalability Solutions
Master database sharding and partitioning for horizontal scalability. Learn shard key selection, consistent hashing, cross-shard queries, rebalancing strategies, and real-world patterns from Discord (trillions of messages) and Instagram (billions of users) to scale beyond single-server limits.
Written by StaticBlock
StaticBlock is a technical writer and software engineer specializing in web development, performance optimization, and developer tooling.