0% read
Skip to main content
Rate Limiting and API Throttling - Production Strategies for Scalable APIs

Rate Limiting and API Throttling - Production Strategies for Scalable APIs

S
StaticBlock
24 min read

APIs are the backbone of modern applications, but without proper rate limiting and throttling, they become vulnerable to abuse, overload, and degraded performance. Companies like Stripe (processing millions of payments), GitHub (serving millions of API requests), and Twitter (handling billions of API calls daily) rely on sophisticated rate limiting to ensure fair usage, prevent abuse, and maintain system stability.

This guide covers production-ready rate limiting and throttling strategies, from foundational algorithms (token bucket, leaky bucket, sliding window) to distributed implementations with Redis, per-user and per-endpoint controls, graceful degradation patterns, and monitoring approaches. We'll explore real-world implementations and learn when to apply each strategy.

Why Rate Limiting Matters

Rate limiting controls the rate at which clients can make API requests. It's essential for:

  1. Preventing abuse: Malicious actors, scrapers, and bots can overwhelm your API
  2. Ensuring fair usage: Prevent single users from monopolizing resources
  3. Protecting infrastructure: Avoid cascading failures from traffic spikes
  4. Cost management: Control cloud costs from unexpected usage
  5. SLA compliance: Guarantee performance for paying customers

GitHub rate limits unauthenticated requests to 60/hour and authenticated to 5,000/hour to prevent scraping while serving legitimate developers. Stripe implements sophisticated per-endpoint limits (100 reads/sec, 10 writes/sec) to protect payment processing infrastructure.

Rate Limiting vs Throttling

Rate Limiting: Hard limits on request counts (e.g., 1000 requests per hour). Requests exceeding limits are rejected with HTTP 429.

Throttling: Gradual slowdown of responses as limits approach. Requests aren't rejected but processing slows down (e.g., adding delays).

Most APIs use rate limiting for simplicity and predictability. Throttling is useful for graceful degradation.

Rate Limiting Algorithms

Token Bucket Algorithm (Most Common)

Tokens are added to a bucket at a fixed rate. Each request consumes one token. If no tokens available, request is denied.

Parameters:

  • Bucket capacity: Maximum burst size
  • Refill rate: Tokens added per second
class TokenBucket {
  constructor(capacity, refillRate) {
    this.capacity = capacity;
    this.tokens = capacity;
    this.refillRate = refillRate; // tokens per second
    this.lastRefill = Date.now();
  }

refill() { const now = Date.now(); const elapsed = (now - this.lastRefill) / 1000; // seconds const tokensToAdd = elapsed * this.refillRate;

this.tokens = Math.min(this.capacity, this.tokens + tokensToAdd);
this.lastRefill = now;

}

consume(tokens = 1) { this.refill();

if (this.tokens >= tokens) {
  this.tokens -= tokens;
  return true;
}

return false;

}

getWaitTime() { if (this.tokens >= 1) return 0; return Math.ceil((1 - this.tokens) / this.refillRate * 1000); // milliseconds } }

// Usage const bucket = new TokenBucket(100, 10); // 100 capacity, 10 tokens/sec

function handleRequest(req, res) { if (bucket.consume()) { res.json({ success: true }); } else { const retryAfter = Math.ceil(bucket.getWaitTime() / 1000); res.status(429).json({ error: 'Rate limit exceeded', retryAfter }); } }

Pros:

  • Allows bursts up to bucket capacity
  • Simple to implement
  • Memory efficient (only stores token count and timestamp)

Cons:

  • Doesn't handle distributed systems (needs external store)

Stripe uses token bucket for per-user rate limiting, allowing legitimate bursts while preventing sustained abuse.

Leaky Bucket Algorithm

Requests added to a FIFO queue processed at a fixed rate. Queue overflow rejects requests.

class LeakyBucket {
  constructor(capacity, leakRate) {
    this.capacity = capacity;
    this.queue = [];
    this.leakRate = leakRate; // requests per second
    this.lastLeak = Date.now();
  }

leak() { const now = Date.now(); const elapsed = (now - this.lastLeak) / 1000; const leakCount = Math.floor(elapsed * this.leakRate);

this.queue.splice(0, leakCount);
this.lastLeak = now;

}

addRequest(request) { this.leak();

if (this.queue.length < this.capacity) {
  this.queue.push(request);
  return true;
}

return false;

} }

Pros:

  • Smooths traffic spikes
  • Guarantees constant output rate

Cons:

  • Higher memory usage (stores queue)
  • Delayed processing (requests wait in queue)

Used by network routers and traffic shaping, less common for APIs.

Fixed Window Counter

Count requests in fixed time windows (e.g., per minute).

class FixedWindowCounter {
  constructor(limit, windowSize) {
    this.limit = limit;
    this.windowSize = windowSize; // milliseconds
    this.counter = 0;
    this.windowStart = Date.now();
  }

allow() { const now = Date.now();

// Reset window if expired
if (now - this.windowStart >= this.windowSize) {
  this.counter = 0;
  this.windowStart = now;
}

if (this.counter < this.limit) {
  this.counter++;
  return true;
}

return false;

} }

// Usage const limiter = new FixedWindowCounter(1000, 60000); // 1000 req/min

Pros:

  • Extremely simple
  • Low memory usage

Cons:

  • Boundary problem: Users can send 2x limit by making requests at window edges (999 at 11:59:59, 999 at 12:00:01)

Twitter API v1.1 used fixed windows, leading to burst issues.

Sliding Window Log

Store timestamps of each request, count requests in sliding window.

class SlidingWindowLog {
  constructor(limit, windowSize) {
    this.limit = limit;
    this.windowSize = windowSize; // milliseconds
    this.log = []; // Array of timestamps
  }

allow() { const now = Date.now(); const cutoff = now - this.windowSize;

// Remove old entries
this.log = this.log.filter(timestamp => timestamp > cutoff);

if (this.log.length < this.limit) {
  this.log.push(now);
  return true;
}

return false;

} }

Pros:

  • Accurate (no boundary problem)
  • Smooth rate limiting

Cons:

  • High memory usage (stores all timestamps)
  • Doesn't scale to high-volume APIs

Cloudflare uses sliding window for precise rate limiting on enterprise plans.

Sliding Window Counter (Hybrid - Best for Production)

Combines fixed window simplicity with sliding window accuracy.

class SlidingWindowCounter {
  constructor(limit, windowSize) {
    this.limit = limit;
    this.windowSize = windowSize;
    this.currentWindow = { start: Date.now(), count: 0 };
    this.previousWindow = { start: 0, count: 0 };
  }

allow() { const now = Date.now(); const currentWindowStart = Math.floor(now / this.windowSize) * this.windowSize;

// New window started
if (currentWindowStart > this.currentWindow.start) {
  this.previousWindow = this.currentWindow;
  this.currentWindow = { start: currentWindowStart, count: 0 };
}

// Calculate weighted count
const elapsedInCurrentWindow = now - this.currentWindow.start;
const weightOfPreviousWindow = 1 - (elapsedInCurrentWindow / this.windowSize);
const estimatedCount =
  this.previousWindow.count * weightOfPreviousWindow +
  this.currentWindow.count;

if (estimatedCount < this.limit) {
  this.currentWindow.count++;
  return true;
}

return false;

} }

Pros:

  • Accurate (solves boundary problem)
  • Memory efficient (only 2 counters)
  • Smooth rate limiting

Cons:

  • Slightly more complex

Best choice for most production APIs. GitHub and Twitter API v2 use sliding window counters.

Distributed Rate Limiting with Redis

For multi-server deployments, rate limiting state must be shared. Redis provides atomic operations ideal for distributed rate limiting.

Redis Token Bucket Implementation

const Redis = require('ioredis');
const redis = new Redis();

class RedisTokenBucket { constructor(key, capacity, refillRate)

async consume(tokens = 1) { const script = ` local key = KEYS[1] local capacity = tonumber(ARGV[1]) local refillRate = tonumber(ARGV[2]) local tokens = tonumber(ARGV[3]) local now = tonumber(ARGV[4])

  local bucket = redis.call('HMGET', key, 'tokens', 'lastRefill')
  local currentTokens = tonumber(bucket[1]) or capacity
  local lastRefill = tonumber(bucket[2]) or now

  -- Refill tokens
  local elapsed = now - lastRefill
  local tokensToAdd = elapsed * refillRate
  currentTokens = math.min(capacity, currentTokens + tokensToAdd)

  -- Try to consume
  if currentTokens >= tokens then
    currentTokens = currentTokens - tokens
    redis.call('HMSET', key, 'tokens', currentTokens, 'lastRefill', now)
    redis.call('EXPIRE', key, 3600) -- Expire after 1 hour of inactivity
    return 1
  else
    return 0
  end
`;

const result = await redis.eval(
  script,
  1,
  this.key,
  this.capacity,
  this.refillRate,
  tokens,
  Date.now() / 1000
);

return result === 1;

} }

// Usage async function handleRequest(req, res) { const userId = req.user.id; const limiter = new RedisTokenBucket( rate_limit:${userId}, 100, // capacity 10 // 10 tokens/sec );

if (await limiter.consume()) { res.json({ success: true }); } else { res.status(429).json({ error: 'Rate limit exceeded' }); } }

Why Lua script? Redis executes Lua scripts atomically, ensuring race-free rate limiting across servers.

Redis Sliding Window Implementation

async function slidingWindowRateLimit(userId, limit, windowSec) {
  const key = `rate_limit:${userId}`;
  const now = Date.now();
  const windowStart = now - (windowSec * 1000);

const pipeline = redis.pipeline();

// Remove old entries pipeline.zremrangebyscore(key, 0, windowStart);

// Add current request pipeline.zadd(key, now, ${now}-${Math.random()});

// Count requests in window pipeline.zcard(key);

// Set expiration pipeline.expire(key, windowSec * 2);

const results = await pipeline.exec(); const count = results[2][1];

return count <= limit; }

// Usage const allowed = await slidingWindowRateLimit('user_123', 1000, 60); // 1000 req/min

Performance: Redis can handle 100K+ rate limit checks per second on a single instance.

Per-User and Per-Endpoint Rate Limiting

Real-world APIs need different limits for different contexts.

Multi-Tier Rate Limiting

class MultiTierRateLimiter {
  constructor(redis) {
    this.redis = redis;
    this.tiers = {
      free: { requestsPerHour: 100, burstSize: 10 },
      pro: { requestsPerHour: 10000, burstSize: 100 },
      enterprise: { requestsPerHour: 100000, burstSize: 1000 }
    };
  }

async checkLimit(userId, userTier, endpoint) { const tier = this.tiers[userTier];

// Global user limit
const globalKey = `limit:user:${userId}`;
const globalLimiter = new RedisTokenBucket(
  globalKey,
  tier.burstSize,
  tier.requestsPerHour / 3600 // per second
);

// Per-endpoint limit (stricter for write operations)
const endpointKey = `limit:user:${userId}:${endpoint}`;
const endpointLimit = this.getEndpointLimit(endpoint, tier);
const endpointLimiter = new RedisTokenBucket(
  endpointKey,
  endpointLimit.burstSize,
  endpointLimit.requestsPerHour / 3600
);

// Must pass both checks
const globalAllowed = await globalLimiter.consume();
const endpointAllowed = await endpointLimiter.consume();

return globalAllowed &amp;&amp; endpointAllowed;

}

getEndpointLimit(endpoint, userTier) { // Write endpoints get 10x stricter limits if (endpoint.startsWith('POST') || endpoint.startsWith('PUT') || endpoint.startsWith('DELETE')) { return { requestsPerHour: userTier.requestsPerHour / 10, burstSize: userTier.burstSize / 10 }; } return userTier; } }

Stripe uses this pattern: 100 reads/sec but only 10 writes/sec to protect critical payment infrastructure.

IP-Based Rate Limiting

async function ipRateLimit(req, res, next) {
  const ip = req.ip;
  const key = `rate_limit:ip:${ip}`;

// Allow 1000 requests per hour per IP const allowed = await slidingWindowRateLimit(key, 1000, 3600);

if (!allowed) { return res.status(429).json({ error: 'Rate limit exceeded', retryAfter: 3600 }); }

next(); }

GitHub rate limits by IP for unauthenticated requests to prevent scraping.

Rate Limit Headers and Client Communication

Communicate rate limit status to clients via headers:

async function addRateLimitHeaders(req, res, limiter) {
  const limit = 1000;
  const remaining = await limiter.getRemainingTokens();
  const resetTime = await limiter.getResetTime();

res.set({ 'X-RateLimit-Limit': limit, 'X-RateLimit-Remaining': remaining, 'X-RateLimit-Reset': resetTime, 'X-RateLimit-Used': limit - remaining });

if (remaining === 0) { res.set('Retry-After', Math.ceil((resetTime - Date.now()) / 1000)); } }

Standard headers (used by GitHub, Twitter, Stripe):

  • X-RateLimit-Limit: Total requests allowed
  • X-RateLimit-Remaining: Requests remaining in window
  • X-RateLimit-Reset: Unix timestamp when limit resets
  • Retry-After: Seconds until retry (for 429 responses)

Graceful Degradation and Circuit Breakers

When rate limits are hit, degrade gracefully rather than failing hard.

Adaptive Rate Limiting

class AdaptiveRateLimiter {
  constructor() {
    this.baseLimit = 1000;
    this.currentLimit = 1000;
    this.errorRate = 0;
    this.checkInterval = setInterval(() => this.adjust(), 10000);
  }

async adjust() { const metrics = await getSystemMetrics();

// Reduce limits if system under stress
if (metrics.cpuUsage &gt; 80 || metrics.errorRate &gt; 0.05) {
  this.currentLimit = Math.max(100, this.currentLimit * 0.9);
  console.log(`Reducing rate limit to ${this.currentLimit}`);
}
// Gradually increase limits if healthy
else if (metrics.cpuUsage &lt; 50 &amp;&amp; metrics.errorRate &lt; 0.01) {
  this.currentLimit = Math.min(this.baseLimit, this.currentLimit * 1.1);
}

}

async allow(userId) { const limiter = new RedisTokenBucket( adaptive:${userId}, this.currentLimit, this.currentLimit / 3600 ); return await limiter.consume(); } }

Twitter uses adaptive rate limiting during incidents, automatically reducing limits to protect infrastructure.

Priority Queues for Critical Requests

class PriorityRateLimiter {
  async allow(userId, priority = 'normal') {
    const limits = {
      critical: 10000,  // Always allow critical requests
      high: 5000,
      normal: 1000,
      low: 100
    };
const limit = limits[priority];
const limiter = new RedisTokenBucket(`priority:${userId}:${priority}`, limit, limit / 3600);

return await limiter.consume();

} }

// Usage app.post('/payment', async (req, res) => { // Payment requests are critical if (!await priorityLimiter.allow(req.user.id, 'critical')) { return res.status(503).json({ error: 'Service temporarily unavailable' }); }

processPayment(req.body); });

Stripe prioritizes payment processing requests over read operations.

Rate Limiting Middleware for Express/Node.js

Production-ready middleware:

const rateLimit = require('express-rate-limit');
const RedisStore = require('rate-limit-redis');
const Redis = require('ioredis');

const redis = new Redis({ host: 'localhost', port: 6379 });

const limiter = rateLimit({ store: new RedisStore({ client: redis, prefix: 'rl:' }), windowMs: 60 * 1000, // 1 minute max: async (req) => { // Dynamic limits based on user tier const user = req.user; if (!user) return 100; // Anonymous

switch (user.tier) {
  case 'enterprise': return 10000;
  case 'pro': return 1000;
  default: return 100;
}

}, keyGenerator: (req) => { // Rate limit by user ID if authenticated, otherwise by IP return req.user?.id || req.ip; }, handler: (req, res) => { res.status(429).json({ error: 'Too many requests', message: 'You have exceeded the rate limit. Please try again later.', retryAfter: req.rateLimit.resetTime }); }, skip: (req) => { // Skip rate limiting for internal services return req.headers['x-internal-service'] === 'true'; }, onLimitReached: (req, res, options) => { // Log rate limit violations logger.warn(Rate limit exceeded for ${req.user?.id || req.ip}); } });

// Apply to all routes app.use('/api/', limiter);

// Stricter limits for specific endpoints const strictLimiter = rateLimit({ store: new RedisStore({ client: redis }), windowMs: 60 * 1000, max: 10, // Only 10 requests per minute });

app.post('/api/charge', strictLimiter, handlePayment);

Monitoring and Alerting

Track rate limit metrics to detect abuse and optimize limits:

class RateLimitMonitor {
  async recordMetrics(userId, endpoint, allowed) {
    const timestamp = Date.now();
// Store in time-series database (e.g., InfluxDB, Prometheus)
await influxDB.writePoints([
  {
    measurement: 'rate_limit',
    tags: {
      user_id: userId,
      endpoint: endpoint,
      allowed: allowed
    },
    fields: {
      count: 1
    },
    timestamp: timestamp
  }
]);

// Alert if rejection rate &gt; 10%
const rejectionRate = await this.getRejectionRate(userId);
if (rejectionRate &gt; 0.1) {
  await this.sendAlert(userId, rejectionRate);
}

}

async getRejectionRate(userId) { const allowed = await redis.get(metrics:${userId}:allowed); const rejected = await redis.get(metrics:${userId}:rejected);

if (!allowed &amp;&amp; !rejected) return 0;
return rejected / (allowed + rejected);

} }

Key metrics:

  • Requests per second (by user, endpoint, tier)
  • Rate limit rejections (429 responses)
  • Token bucket fill levels
  • Latency of rate limit checks

Datadog, Prometheus, or Grafana dashboards visualize these metrics.

Real-World Implementations

Stripe - Multi-Tier, Per-Endpoint Limits

  • Global limits: 100 requests/sec (reads), 10 requests/sec (writes)
  • Per-resource limits: 25 card creations/sec, 100 customer lookups/sec
  • Tiered limits: Higher limits for enterprise customers
  • Implementation: Redis token bucket with Lua scripts

Result: Processes millions of payments daily with 99.99% uptime.

GitHub - Sliding Window with Multiple Tiers

  • Unauthenticated: 60 requests/hour (IP-based)
  • Authenticated: 5,000 requests/hour (user-based)
  • GitHub Actions: 1,000 requests/hour (separate limit pool)
  • Implementation: Sliding window counter in Redis

Special handling: GraphQL API uses point system (complex queries cost more points).

Twitter API v2 - App-Level and User-Level Limits

  • App-level limits: 300 requests/15 min (shared across all users of an app)
  • User-level limits: 900 requests/15 min (per authenticated user)
  • Endpoint-specific: Tweet creation limited to 300/3 hours
  • Implementation: Distributed sliding window

Innovation: Separate app and user limits prevent single misbehaving app from consuming all quota.

Best Practices

  1. Start permissive, tighten gradually: Begin with high limits, reduce based on actual usage patterns
  2. Communicate limits clearly: Document limits in API docs, return limits in headers
  3. Different limits for different endpoints: Read-heavy endpoints can have higher limits than writes
  4. Whitelist internal services: Skip rate limiting for authenticated internal microservices
  5. Implement retry logic on client: Exponential backoff with jitter for 429 responses
  6. Monitor and alert: Track rejection rates, alert on anomalies
  7. Use distributed rate limiting: Don't rely on in-memory counters in multi-server deployments
  8. Test rate limits: Load test to ensure limits protect your infrastructure

Choosing the Right Algorithm

Algorithm Use Case Pros Cons
Token Bucket General-purpose APIs Allows bursts, efficient Need distributed store
Leaky Bucket Traffic shaping Smooth output High memory, delays
Fixed Window Simple rate limiting Very simple Boundary problem
Sliding Window Log Strict accuracy needed Perfect accuracy High memory
Sliding Window Counter Production APIs Accurate + efficient Slightly complex

Recommendation: Use sliding window counter for most production APIs, implemented in Redis for distributed systems.

Conclusion - Building Resilient APIs

Rate limiting and throttling are essential for production APIs. Key takeaways:

  1. Choose the right algorithm: Sliding window counter balances accuracy and efficiency
  2. Distribute state with Redis: Share rate limit counters across servers with atomic Lua scripts
  3. Multi-tier limits: Different limits for different user tiers and endpoints
  4. Communicate clearly: Use standard headers to inform clients of limits
  5. Degrade gracefully: Adaptive limits and priority queues during high load
  6. Monitor continuously: Track rejection rates and adjust limits based on actual usage

Stripe, GitHub, and Twitter demonstrate that sophisticated rate limiting enables serving billions of requests while maintaining stability and fair usage. Start with conservative limits, monitor usage, and iterate based on real-world patterns to build APIs that scale reliably.

Found this helpful? Share it!

Related Articles

API

Rate Limiting and API Throttling - Production Strategies for Scalable APIs

Master rate limiting and API throttling strategies for production systems. Learn token bucket, leaky bucket, sliding window algorithms, distributed rate limiting with Redis, per-user and per-endpoint throttling, graceful degradation patterns, and real-world implementations from Stripe, GitHub, and Twitter APIs.

24 min read
Backend

Event-Driven Architecture - Building Scalable, Loosely Coupled Production Systems

Master event-driven architecture covering event sourcing, CQRS pattern, event streaming with Kafka, publish-subscribe messaging, event choreography vs orchestration, eventual consistency patterns, and production implementation strategies for building scalable, resilient distributed systems.

22 min read
S

Written by StaticBlock

StaticBlock is a technical writer and software engineer specializing in web development, performance optimization, and developer tooling.