Rate Limiting and API Throttling - Production Strategies for Scalable APIs

APIs are the backbone of modern applications, but without proper rate limiting and throttling, they become vulnerable to abuse, overload, and degraded performance. Companies like Stripe (processing millions of payments), GitHub (serving millions of API requests), and Twitter (handling billions of API calls daily) rely on sophisticated rate limiting to ensure fair usage, prevent abuse, and maintain system stability.

This guide covers production-ready rate limiting and throttling strategies, from foundational algorithms (token bucket, leaky bucket, sliding window) to distributed implementations with Redis, per-user and per-endpoint controls, graceful degradation patterns, and monitoring approaches. We'll explore real-world implementations and learn when to apply each strategy.

Why Rate Limiting Matters

Rate limiting controls the rate at which clients can make API requests. It's essential for:

Preventing abuse: Malicious actors, scrapers, and bots can overwhelm your API
Ensuring fair usage: Prevent single users from monopolizing resources
Protecting infrastructure: Avoid cascading failures from traffic spikes
Cost management: Control cloud costs from unexpected usage
SLA compliance: Guarantee performance for paying customers

GitHub rate limits unauthenticated requests to 60/hour and authenticated to 5,000/hour to prevent scraping while serving legitimate developers. Stripe implements sophisticated per-endpoint limits (100 reads/sec, 10 writes/sec) to protect payment processing infrastructure.

Rate Limiting vs Throttling

Rate Limiting: Hard limits on request counts (e.g., 1000 requests per hour). Requests exceeding limits are rejected with HTTP 429.

Throttling: Gradual slowdown of responses as limits approach. Requests aren't rejected but processing slows down (e.g., adding delays).

Most APIs use rate limiting for simplicity and predictability. Throttling is useful for graceful degradation.

Rate Limiting Algorithms

Token Bucket Algorithm (Most Common)

Tokens are added to a bucket at a fixed rate. Each request consumes one token. If no tokens available, request is denied.

Parameters:

Bucket capacity: Maximum burst size
Refill rate: Tokens added per second

class TokenBucket {
  constructor(capacity, refillRate) {
    this.capacity = capacity;
    this.tokens = capacity;
    this.refillRate = refillRate; // tokens per second
    this.lastRefill = Date.now();
  }
refill() {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000; // seconds
const tokensToAdd = elapsed * this.refillRate;
this.tokens = Math.min(this.capacity, this.tokens + tokensToAdd);
this.lastRefill = now;

}
consume(tokens = 1) {
this.refill();
if (this.tokens &gt;= tokens) {
  this.tokens -= tokens;
  return true;
}

return false;

}
getWaitTime() {
if (this.tokens >= 1) return 0;
return Math.ceil((1 - this.tokens) / this.refillRate * 1000); // milliseconds
}
}
// Usage
const bucket = new TokenBucket(100, 10); // 100 capacity, 10 tokens/sec
function handleRequest(req, res) {
if (bucket.consume()) {
res.json({ success: true });
} else {
const retryAfter = Math.ceil(bucket.getWaitTime() / 1000);
res.status(429).json({
error: 'Rate limit exceeded',
retryAfter
});
}
}

Pros:

Allows bursts up to bucket capacity
Simple to implement
Memory efficient (only stores token count and timestamp)

Cons:

Doesn't handle distributed systems (needs external store)

Stripe uses token bucket for per-user rate limiting, allowing legitimate bursts while preventing sustained abuse.

Leaky Bucket Algorithm

Requests added to a FIFO queue processed at a fixed rate. Queue overflow rejects requests.

class LeakyBucket {
  constructor(capacity, leakRate) {
    this.capacity = capacity;
    this.queue = [];
    this.leakRate = leakRate; // requests per second
    this.lastLeak = Date.now();
  }
leak() {
const now = Date.now();
const elapsed = (now - this.lastLeak) / 1000;
const leakCount = Math.floor(elapsed * this.leakRate);
this.queue.splice(0, leakCount);
this.lastLeak = now;

}
addRequest(request) {
this.leak();
if (this.queue.length &lt; this.capacity) {
  this.queue.push(request);
  return true;
}

return false;

}
}

Pros:

Smooths traffic spikes
Guarantees constant output rate

Cons:

Higher memory usage (stores queue)
Delayed processing (requests wait in queue)

Used by network routers and traffic shaping, less common for APIs.

Fixed Window Counter

Count requests in fixed time windows (e.g., per minute).

class FixedWindowCounter {
  constructor(limit, windowSize) {
    this.limit = limit;
    this.windowSize = windowSize; // milliseconds
    this.counter = 0;
    this.windowStart = Date.now();
  }
allow() {
const now = Date.now();
// Reset window if expired
if (now - this.windowStart &gt;= this.windowSize) {
  this.counter = 0;
  this.windowStart = now;
}

if (this.counter &lt; this.limit) {
  this.counter++;
  return true;
}

return false;

}
}
// Usage
const limiter = new FixedWindowCounter(1000, 60000); // 1000 req/min

Pros:

Extremely simple
Low memory usage

Cons:

Boundary problem: Users can send 2x limit by making requests at window edges (999 at 11:59:59, 999 at 12:00:01)

Twitter API v1.1 used fixed windows, leading to burst issues.

Sliding Window Log

Store timestamps of each request, count requests in sliding window.

class SlidingWindowLog {
  constructor(limit, windowSize) {
    this.limit = limit;
    this.windowSize = windowSize; // milliseconds
    this.log = []; // Array of timestamps
  }
allow() {
const now = Date.now();
const cutoff = now - this.windowSize;
// Remove old entries
this.log = this.log.filter(timestamp =&gt; timestamp &gt; cutoff);

if (this.log.length &lt; this.limit) {
  this.log.push(now);
  return true;
}

return false;

}
}

Pros:

Accurate (no boundary problem)
Smooth rate limiting

Cons:

High memory usage (stores all timestamps)
Doesn't scale to high-volume APIs

Cloudflare uses sliding window for precise rate limiting on enterprise plans.

Sliding Window Counter (Hybrid - Best for Production)

Combines fixed window simplicity with sliding window accuracy.

class SlidingWindowCounter {
  constructor(limit, windowSize) {
    this.limit = limit;
    this.windowSize = windowSize;
    this.currentWindow = { start: Date.now(), count: 0 };
    this.previousWindow = { start: 0, count: 0 };
  }
allow() {
const now = Date.now();
const currentWindowStart = Math.floor(now / this.windowSize) * this.windowSize;
// New window started
if (currentWindowStart &gt; this.currentWindow.start) {
  this.previousWindow = this.currentWindow;
  this.currentWindow = { start: currentWindowStart, count: 0 };
}

// Calculate weighted count
const elapsedInCurrentWindow = now - this.currentWindow.start;
const weightOfPreviousWindow = 1 - (elapsedInCurrentWindow / this.windowSize);
const estimatedCount =
  this.previousWindow.count * weightOfPreviousWindow +
  this.currentWindow.count;

if (estimatedCount &lt; this.limit) {
  this.currentWindow.count++;
  return true;
}

return false;

}
}

Pros:

Accurate (solves boundary problem)
Memory efficient (only 2 counters)
Smooth rate limiting

Cons:

Slightly more complex

Best choice for most production APIs. GitHub and Twitter API v2 use sliding window counters.

Distributed Rate Limiting with Redis

For multi-server deployments, rate limiting state must be shared. Redis provides atomic operations ideal for distributed rate limiting.

Redis Token Bucket Implementation

const Redis = require('ioredis');
const redis = new Redis();
class RedisTokenBucket {
constructor(key, capacity, refillRate)
async consume(tokens = 1) {
const script = `
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refillRate = tonumber(ARGV[2])
local tokens = tonumber(ARGV[3])
local now = tonumber(ARGV[4])
  local bucket = redis.call('HMGET', key, 'tokens', 'lastRefill')
  local currentTokens = tonumber(bucket[1]) or capacity
  local lastRefill = tonumber(bucket[2]) or now

  -- Refill tokens
  local elapsed = now - lastRefill
  local tokensToAdd = elapsed * refillRate
  currentTokens = math.min(capacity, currentTokens + tokensToAdd)

  -- Try to consume
  if currentTokens &gt;= tokens then
    currentTokens = currentTokens - tokens
    redis.call('HMSET', key, 'tokens', currentTokens, 'lastRefill', now)
    redis.call('EXPIRE', key, 3600) -- Expire after 1 hour of inactivity
    return 1
  else
    return 0
  end
`;

const result = await redis.eval(
  script,
  1,
  this.key,
  this.capacity,
  this.refillRate,
  tokens,
  Date.now() / 1000
);

return result === 1;

}
}
// Usage
async function handleRequest(req, res) {
const userId = req.user.id;
const limiter = new RedisTokenBucket(
rate_limit:${userId},
100,  // capacity
10    // 10 tokens/sec
);
if (await limiter.consume()) {
res.json({ success: true });
} else {
res.status(429).json({ error: 'Rate limit exceeded' });
}
}

Why Lua script? Redis executes Lua scripts atomically, ensuring race-free rate limiting across servers.

Redis Sliding Window Implementation

async function slidingWindowRateLimit(userId, limit, windowSec) {
  const key = `rate_limit:${userId}`;
  const now = Date.now();
  const windowStart = now - (windowSec * 1000);
const pipeline = redis.pipeline();
// Remove old entries
pipeline.zremrangebyscore(key, 0, windowStart);
// Add current request
pipeline.zadd(key, now, ${now}-${Math.random()});
// Count requests in window
pipeline.zcard(key);
// Set expiration
pipeline.expire(key, windowSec * 2);
const results = await pipeline.exec();
const count = results[2][1];
return count <= limit;
}
// Usage
const allowed = await slidingWindowRateLimit('user_123', 1000, 60); // 1000 req/min

Performance: Redis can handle 100K+ rate limit checks per second on a single instance.

Per-User and Per-Endpoint Rate Limiting

Real-world APIs need different limits for different contexts.

Multi-Tier Rate Limiting

class MultiTierRateLimiter {
  constructor(redis) {
    this.redis = redis;
    this.tiers = {
      free: { requestsPerHour: 100, burstSize: 10 },
      pro: { requestsPerHour: 10000, burstSize: 100 },
      enterprise: { requestsPerHour: 100000, burstSize: 1000 }
    };
  }
async checkLimit(userId, userTier, endpoint) {
const tier = this.tiers[userTier];
// Global user limit
const globalKey = `limit:user:${userId}`;
const globalLimiter = new RedisTokenBucket(
  globalKey,
  tier.burstSize,
  tier.requestsPerHour / 3600 // per second
);

// Per-endpoint limit (stricter for write operations)
const endpointKey = `limit:user:${userId}:${endpoint}`;
const endpointLimit = this.getEndpointLimit(endpoint, tier);
const endpointLimiter = new RedisTokenBucket(
  endpointKey,
  endpointLimit.burstSize,
  endpointLimit.requestsPerHour / 3600
);

// Must pass both checks
const globalAllowed = await globalLimiter.consume();
const endpointAllowed = await endpointLimiter.consume();

return globalAllowed &amp;&amp; endpointAllowed;

}
getEndpointLimit(endpoint, userTier) {
// Write endpoints get 10x stricter limits
if (endpoint.startsWith('POST') || endpoint.startsWith('PUT') || endpoint.startsWith('DELETE')) {
return {
requestsPerHour: userTier.requestsPerHour / 10,
burstSize: userTier.burstSize / 10
};
}
return userTier;
}
}

Stripe uses this pattern: 100 reads/sec but only 10 writes/sec to protect critical payment infrastructure.

IP-Based Rate Limiting

async function ipRateLimit(req, res, next) {
  const ip = req.ip;
  const key = `rate_limit:ip:${ip}`;
// Allow 1000 requests per hour per IP
const allowed = await slidingWindowRateLimit(key, 1000, 3600);
if (!allowed) {
return res.status(429).json({
error: 'Rate limit exceeded',
retryAfter: 3600
});
}
next();
}

GitHub rate limits by IP for unauthenticated requests to prevent scraping.

Rate Limit Headers and Client Communication

Communicate rate limit status to clients via headers:

async function addRateLimitHeaders(req, res, limiter) {
  const limit = 1000;
  const remaining = await limiter.getRemainingTokens();
  const resetTime = await limiter.getResetTime();
res.set({
'X-RateLimit-Limit': limit,
'X-RateLimit-Remaining': remaining,
'X-RateLimit-Reset': resetTime,
'X-RateLimit-Used': limit - remaining
});
if (remaining === 0) {
res.set('Retry-After', Math.ceil((resetTime - Date.now()) / 1000));
}
}

Standard headers (used by GitHub, Twitter, Stripe):

X-RateLimit-Limit: Total requests allowed
X-RateLimit-Remaining: Requests remaining in window
X-RateLimit-Reset: Unix timestamp when limit resets
Retry-After: Seconds until retry (for 429 responses)

Graceful Degradation and Circuit Breakers

When rate limits are hit, degrade gracefully rather than failing hard.

Adaptive Rate Limiting

class AdaptiveRateLimiter {
  constructor() {
    this.baseLimit = 1000;
    this.currentLimit = 1000;
    this.errorRate = 0;
    this.checkInterval = setInterval(() => this.adjust(), 10000);
  }
async adjust() {
const metrics = await getSystemMetrics();
// Reduce limits if system under stress
if (metrics.cpuUsage &gt; 80 || metrics.errorRate &gt; 0.05) {
  this.currentLimit = Math.max(100, this.currentLimit * 0.9);
  console.log(`Reducing rate limit to ${this.currentLimit}`);
}
// Gradually increase limits if healthy
else if (metrics.cpuUsage &lt; 50 &amp;&amp; metrics.errorRate &lt; 0.01) {
  this.currentLimit = Math.min(this.baseLimit, this.currentLimit * 1.1);
}

}
async allow(userId) {
const limiter = new RedisTokenBucket(
adaptive:${userId},
this.currentLimit,
this.currentLimit / 3600
);
return await limiter.consume();
}
}

Twitter uses adaptive rate limiting during incidents, automatically reducing limits to protect infrastructure.

Priority Queues for Critical Requests

class PriorityRateLimiter {
  async allow(userId, priority = 'normal') {
    const limits = {
      critical: 10000,  // Always allow critical requests
      high: 5000,
      normal: 1000,
      low: 100
    };
const limit = limits[priority];
const limiter = new RedisTokenBucket(`priority:${userId}:${priority}`, limit, limit / 3600);

return await limiter.consume();

}
}
// Usage
app.post('/payment', async (req, res) => {
// Payment requests are critical
if (!await priorityLimiter.allow(req.user.id, 'critical')) {
return res.status(503).json({ error: 'Service temporarily unavailable' });
}
processPayment(req.body);
});

Stripe prioritizes payment processing requests over read operations.

Rate Limiting Middleware for Express/Node.js

Production-ready middleware:

const rateLimit = require('express-rate-limit');
const RedisStore = require('rate-limit-redis');
const Redis = require('ioredis');
const redis = new Redis({
host: 'localhost',
port: 6379
});
const limiter = rateLimit({
store: new RedisStore({
client: redis,
prefix: 'rl:'
}),
windowMs: 60 * 1000, // 1 minute
max: async (req) => {
// Dynamic limits based on user tier
const user = req.user;
if (!user) return 100; // Anonymous
switch (user.tier) {
  case 'enterprise': return 10000;
  case 'pro': return 1000;
  default: return 100;
}

},
keyGenerator: (req) => {
// Rate limit by user ID if authenticated, otherwise by IP
return req.user?.id || req.ip;
},
handler: (req, res) => {
res.status(429).json({
error: 'Too many requests',
message: 'You have exceeded the rate limit. Please try again later.',
retryAfter: req.rateLimit.resetTime
});
},
skip: (req) => {
// Skip rate limiting for internal services
return req.headers['x-internal-service'] === 'true';
},
onLimitReached: (req, res, options) => {
// Log rate limit violations
logger.warn(Rate limit exceeded for ${req.user?.id || req.ip});
}
});
// Apply to all routes
app.use('/api/', limiter);
// Stricter limits for specific endpoints
const strictLimiter = rateLimit({
store: new RedisStore({ client: redis }),
windowMs: 60 * 1000,
max: 10, // Only 10 requests per minute
});
app.post('/api/charge', strictLimiter, handlePayment);

Monitoring and Alerting

Track rate limit metrics to detect abuse and optimize limits:

class RateLimitMonitor {
  async recordMetrics(userId, endpoint, allowed) {
    const timestamp = Date.now();
// Store in time-series database (e.g., InfluxDB, Prometheus)
await influxDB.writePoints([
  {
    measurement: 'rate_limit',
    tags: {
      user_id: userId,
      endpoint: endpoint,
      allowed: allowed
    },
    fields: {
      count: 1
    },
    timestamp: timestamp
  }
]);

// Alert if rejection rate &gt; 10%
const rejectionRate = await this.getRejectionRate(userId);
if (rejectionRate &gt; 0.1) {
  await this.sendAlert(userId, rejectionRate);
}

}
async getRejectionRate(userId) {
const allowed = await redis.get(metrics:${userId}:allowed);
const rejected = await redis.get(metrics:${userId}:rejected);
if (!allowed &amp;&amp; !rejected) return 0;
return rejected / (allowed + rejected);

}
}

Key metrics:

Requests per second (by user, endpoint, tier)
Rate limit rejections (429 responses)
Token bucket fill levels
Latency of rate limit checks

Datadog, Prometheus, or Grafana dashboards visualize these metrics.

Real-World Implementations

Stripe - Multi-Tier, Per-Endpoint Limits

Global limits: 100 requests/sec (reads), 10 requests/sec (writes)
Per-resource limits: 25 card creations/sec, 100 customer lookups/sec
Tiered limits: Higher limits for enterprise customers
Implementation: Redis token bucket with Lua scripts

Result: Processes millions of payments daily with 99.99% uptime.

GitHub - Sliding Window with Multiple Tiers

Unauthenticated: 60 requests/hour (IP-based)
Authenticated: 5,000 requests/hour (user-based)
GitHub Actions: 1,000 requests/hour (separate limit pool)
Implementation: Sliding window counter in Redis

Special handling: GraphQL API uses point system (complex queries cost more points).

Twitter API v2 - App-Level and User-Level Limits

App-level limits: 300 requests/15 min (shared across all users of an app)
User-level limits: 900 requests/15 min (per authenticated user)
Endpoint-specific: Tweet creation limited to 300/3 hours
Implementation: Distributed sliding window

Innovation: Separate app and user limits prevent single misbehaving app from consuming all quota.

Best Practices

Start permissive, tighten gradually: Begin with high limits, reduce based on actual usage patterns
Communicate limits clearly: Document limits in API docs, return limits in headers
Different limits for different endpoints: Read-heavy endpoints can have higher limits than writes
Whitelist internal services: Skip rate limiting for authenticated internal microservices
Implement retry logic on client: Exponential backoff with jitter for 429 responses
Monitor and alert: Track rejection rates, alert on anomalies
Use distributed rate limiting: Don't rely on in-memory counters in multi-server deployments
Test rate limits: Load test to ensure limits protect your infrastructure

Choosing the Right Algorithm

Algorithm	Use Case	Pros	Cons
Token Bucket	General-purpose APIs	Allows bursts, efficient	Need distributed store
Leaky Bucket	Traffic shaping	Smooth output	High memory, delays
Fixed Window	Simple rate limiting	Very simple	Boundary problem
Sliding Window Log	Strict accuracy needed	Perfect accuracy	High memory
Sliding Window Counter	Production APIs	Accurate + efficient	Slightly complex

Recommendation: Use sliding window counter for most production APIs, implemented in Redis for distributed systems.

Conclusion - Building Resilient APIs

Rate limiting and throttling are essential for production APIs. Key takeaways:

Choose the right algorithm: Sliding window counter balances accuracy and efficiency
Distribute state with Redis: Share rate limit counters across servers with atomic Lua scripts
Multi-tier limits: Different limits for different user tiers and endpoints
Communicate clearly: Use standard headers to inform clients of limits
Degrade gracefully: Adaptive limits and priority queues during high load
Monitor continuously: Track rejection rates and adjust limits based on actual usage

Stripe, GitHub, and Twitter demonstrate that sophisticated rate limiting enables serving billions of requests while maintaining stability and fair usage. Start with conservative limits, monitor usage, and iterate based on real-world patterns to build APIs that scale reliably.

Rate Limiting and API Throttling - Production Strategies for Scalable APIs

Why Rate Limiting Matters

Rate Limiting vs Throttling

Rate Limiting Algorithms

Token Bucket Algorithm (Most Common)

Leaky Bucket Algorithm

Fixed Window Counter

Sliding Window Log

Sliding Window Counter (Hybrid - Best for Production)

Distributed Rate Limiting with Redis

Redis Token Bucket Implementation

Redis Sliding Window Implementation

Per-User and Per-Endpoint Rate Limiting

Multi-Tier Rate Limiting

IP-Based Rate Limiting

Rate Limit Headers and Client Communication

Graceful Degradation and Circuit Breakers

Adaptive Rate Limiting

Priority Queues for Critical Requests

Rate Limiting Middleware for Express/Node.js

Monitoring and Alerting

Real-World Implementations

Stripe - Multi-Tier, Per-Endpoint Limits

GitHub - Sliding Window with Multiple Tiers

Twitter API v2 - App-Level and User-Level Limits

Best Practices

Choosing the Right Algorithm

Conclusion - Building Resilient APIs

Related Articles

Rate Limiting and API Throttling - Production Strategies for Scalable APIs

Database Sharding and Partitioning Strategies - Production-Ready Scalability Solutions

Event-Driven Architecture - Building Scalable, Loosely Coupled Production Systems

Written by StaticBlock