APIs are the backbone of modern applications, but without proper rate limiting and throttling, they become vulnerable to abuse, overload, and degraded performance. Companies like Stripe (processing millions of payments), GitHub (serving millions of API requests), and Twitter (handling billions of API calls daily) rely on sophisticated rate limiting to ensure fair usage, prevent abuse, and maintain system stability.
This guide covers production-ready rate limiting and throttling strategies, from foundational algorithms (token bucket, leaky bucket, sliding window) to distributed implementations with Redis, per-user and per-endpoint controls, graceful degradation patterns, and monitoring approaches. We'll explore real-world implementations and learn when to apply each strategy.
Why Rate Limiting Matters
Rate limiting controls the rate at which clients can make API requests. It's essential for:
- Preventing abuse: Malicious actors, scrapers, and bots can overwhelm your API
- Ensuring fair usage: Prevent single users from monopolizing resources
- Protecting infrastructure: Avoid cascading failures from traffic spikes
- Cost management: Control cloud costs from unexpected usage
- SLA compliance: Guarantee performance for paying customers
GitHub rate limits unauthenticated requests to 60/hour and authenticated to 5,000/hour to prevent scraping while serving legitimate developers. Stripe implements sophisticated per-endpoint limits (100 reads/sec, 10 writes/sec) to protect payment processing infrastructure.
Rate Limiting vs Throttling
Rate Limiting: Hard limits on request counts (e.g., 1000 requests per hour). Requests exceeding limits are rejected with HTTP 429.
Throttling: Gradual slowdown of responses as limits approach. Requests aren't rejected but processing slows down (e.g., adding delays).
Most APIs use rate limiting for simplicity and predictability. Throttling is useful for graceful degradation.
Rate Limiting Algorithms
Token Bucket Algorithm (Most Common)
Tokens are added to a bucket at a fixed rate. Each request consumes one token. If no tokens available, request is denied.
Parameters:
- Bucket capacity: Maximum burst size
- Refill rate: Tokens added per second
class TokenBucket {
constructor(capacity, refillRate) {
this.capacity = capacity;
this.tokens = capacity;
this.refillRate = refillRate; // tokens per second
this.lastRefill = Date.now();
}
refill() {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000; // seconds
const tokensToAdd = elapsed * this.refillRate;
this.tokens = Math.min(this.capacity, this.tokens + tokensToAdd);
this.lastRefill = now;
}
consume(tokens = 1) {
this.refill();
if (this.tokens >= tokens) {
this.tokens -= tokens;
return true;
}
return false;
}
getWaitTime() {
if (this.tokens >= 1) return 0;
return Math.ceil((1 - this.tokens) / this.refillRate * 1000); // milliseconds
}
}
// Usage
const bucket = new TokenBucket(100, 10); // 100 capacity, 10 tokens/sec
function handleRequest(req, res) {
if (bucket.consume()) {
res.json({ success: true });
} else {
const retryAfter = Math.ceil(bucket.getWaitTime() / 1000);
res.status(429).json({
error: 'Rate limit exceeded',
retryAfter
});
}
}
Pros:
- Allows bursts up to bucket capacity
- Simple to implement
- Memory efficient (only stores token count and timestamp)
Cons:
- Doesn't handle distributed systems (needs external store)
Stripe uses token bucket for per-user rate limiting, allowing legitimate bursts while preventing sustained abuse.
Leaky Bucket Algorithm
Requests added to a FIFO queue processed at a fixed rate. Queue overflow rejects requests.
class LeakyBucket {
constructor(capacity, leakRate) {
this.capacity = capacity;
this.queue = [];
this.leakRate = leakRate; // requests per second
this.lastLeak = Date.now();
}
leak() {
const now = Date.now();
const elapsed = (now - this.lastLeak) / 1000;
const leakCount = Math.floor(elapsed * this.leakRate);
this.queue.splice(0, leakCount);
this.lastLeak = now;
}
addRequest(request) {
this.leak();
if (this.queue.length < this.capacity) {
this.queue.push(request);
return true;
}
return false;
}
}
Pros:
- Smooths traffic spikes
- Guarantees constant output rate
Cons:
- Higher memory usage (stores queue)
- Delayed processing (requests wait in queue)
Used by network routers and traffic shaping, less common for APIs.
Fixed Window Counter
Count requests in fixed time windows (e.g., per minute).
class FixedWindowCounter {
constructor(limit, windowSize) {
this.limit = limit;
this.windowSize = windowSize; // milliseconds
this.counter = 0;
this.windowStart = Date.now();
}
allow() {
const now = Date.now();
// Reset window if expired
if (now - this.windowStart >= this.windowSize) {
this.counter = 0;
this.windowStart = now;
}
if (this.counter < this.limit) {
this.counter++;
return true;
}
return false;
}
}
// Usage
const limiter = new FixedWindowCounter(1000, 60000); // 1000 req/min
Pros:
- Extremely simple
- Low memory usage
Cons:
- Boundary problem: Users can send 2x limit by making requests at window edges (999 at 11:59:59, 999 at 12:00:01)
Twitter API v1.1 used fixed windows, leading to burst issues.
Sliding Window Log
Store timestamps of each request, count requests in sliding window.
class SlidingWindowLog {
constructor(limit, windowSize) {
this.limit = limit;
this.windowSize = windowSize; // milliseconds
this.log = []; // Array of timestamps
}
allow() {
const now = Date.now();
const cutoff = now - this.windowSize;
// Remove old entries
this.log = this.log.filter(timestamp => timestamp > cutoff);
if (this.log.length < this.limit) {
this.log.push(now);
return true;
}
return false;
}
}
Pros:
- Accurate (no boundary problem)
- Smooth rate limiting
Cons:
- High memory usage (stores all timestamps)
- Doesn't scale to high-volume APIs
Cloudflare uses sliding window for precise rate limiting on enterprise plans.
Sliding Window Counter (Hybrid - Best for Production)
Combines fixed window simplicity with sliding window accuracy.
class SlidingWindowCounter {
constructor(limit, windowSize) {
this.limit = limit;
this.windowSize = windowSize;
this.currentWindow = { start: Date.now(), count: 0 };
this.previousWindow = { start: 0, count: 0 };
}
allow() {
const now = Date.now();
const currentWindowStart = Math.floor(now / this.windowSize) * this.windowSize;
// New window started
if (currentWindowStart > this.currentWindow.start) {
this.previousWindow = this.currentWindow;
this.currentWindow = { start: currentWindowStart, count: 0 };
}
// Calculate weighted count
const elapsedInCurrentWindow = now - this.currentWindow.start;
const weightOfPreviousWindow = 1 - (elapsedInCurrentWindow / this.windowSize);
const estimatedCount =
this.previousWindow.count * weightOfPreviousWindow +
this.currentWindow.count;
if (estimatedCount < this.limit) {
this.currentWindow.count++;
return true;
}
return false;
}
}
Pros:
- Accurate (solves boundary problem)
- Memory efficient (only 2 counters)
- Smooth rate limiting
Cons:
- Slightly more complex
Best choice for most production APIs. GitHub and Twitter API v2 use sliding window counters.
Distributed Rate Limiting with Redis
For multi-server deployments, rate limiting state must be shared. Redis provides atomic operations ideal for distributed rate limiting.
Redis Token Bucket Implementation
const Redis = require('ioredis');
const redis = new Redis();
class RedisTokenBucket {
constructor(key, capacity, refillRate)
async consume(tokens = 1) {
const script = `
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refillRate = tonumber(ARGV[2])
local tokens = tonumber(ARGV[3])
local now = tonumber(ARGV[4])
local bucket = redis.call('HMGET', key, 'tokens', 'lastRefill')
local currentTokens = tonumber(bucket[1]) or capacity
local lastRefill = tonumber(bucket[2]) or now
-- Refill tokens
local elapsed = now - lastRefill
local tokensToAdd = elapsed * refillRate
currentTokens = math.min(capacity, currentTokens + tokensToAdd)
-- Try to consume
if currentTokens >= tokens then
currentTokens = currentTokens - tokens
redis.call('HMSET', key, 'tokens', currentTokens, 'lastRefill', now)
redis.call('EXPIRE', key, 3600) -- Expire after 1 hour of inactivity
return 1
else
return 0
end
`;
const result = await redis.eval(
script,
1,
this.key,
this.capacity,
this.refillRate,
tokens,
Date.now() / 1000
);
return result === 1;
}
}
// Usage
async function handleRequest(req, res) {
const userId = req.user.id;
const limiter = new RedisTokenBucket(
rate_limit:${userId},
100, // capacity
10 // 10 tokens/sec
);
if (await limiter.consume()) {
res.json({ success: true });
} else {
res.status(429).json({ error: 'Rate limit exceeded' });
}
}
Why Lua script? Redis executes Lua scripts atomically, ensuring race-free rate limiting across servers.
Redis Sliding Window Implementation
async function slidingWindowRateLimit(userId, limit, windowSec) {
const key = `rate_limit:${userId}`;
const now = Date.now();
const windowStart = now - (windowSec * 1000);
const pipeline = redis.pipeline();
// Remove old entries
pipeline.zremrangebyscore(key, 0, windowStart);
// Add current request
pipeline.zadd(key, now, ${now}-${Math.random()});
// Count requests in window
pipeline.zcard(key);
// Set expiration
pipeline.expire(key, windowSec * 2);
const results = await pipeline.exec();
const count = results[2][1];
return count <= limit;
}
// Usage
const allowed = await slidingWindowRateLimit('user_123', 1000, 60); // 1000 req/min
Performance: Redis can handle 100K+ rate limit checks per second on a single instance.
Per-User and Per-Endpoint Rate Limiting
Real-world APIs need different limits for different contexts.
Multi-Tier Rate Limiting
class MultiTierRateLimiter {
constructor(redis) {
this.redis = redis;
this.tiers = {
free: { requestsPerHour: 100, burstSize: 10 },
pro: { requestsPerHour: 10000, burstSize: 100 },
enterprise: { requestsPerHour: 100000, burstSize: 1000 }
};
}
async checkLimit(userId, userTier, endpoint) {
const tier = this.tiers[userTier];
// Global user limit
const globalKey = `limit:user:${userId}`;
const globalLimiter = new RedisTokenBucket(
globalKey,
tier.burstSize,
tier.requestsPerHour / 3600 // per second
);
// Per-endpoint limit (stricter for write operations)
const endpointKey = `limit:user:${userId}:${endpoint}`;
const endpointLimit = this.getEndpointLimit(endpoint, tier);
const endpointLimiter = new RedisTokenBucket(
endpointKey,
endpointLimit.burstSize,
endpointLimit.requestsPerHour / 3600
);
// Must pass both checks
const globalAllowed = await globalLimiter.consume();
const endpointAllowed = await endpointLimiter.consume();
return globalAllowed && endpointAllowed;
}
getEndpointLimit(endpoint, userTier) {
// Write endpoints get 10x stricter limits
if (endpoint.startsWith('POST') || endpoint.startsWith('PUT') || endpoint.startsWith('DELETE')) {
return {
requestsPerHour: userTier.requestsPerHour / 10,
burstSize: userTier.burstSize / 10
};
}
return userTier;
}
}
Stripe uses this pattern: 100 reads/sec but only 10 writes/sec to protect critical payment infrastructure.
IP-Based Rate Limiting
async function ipRateLimit(req, res, next) {
const ip = req.ip;
const key = `rate_limit:ip:${ip}`;
// Allow 1000 requests per hour per IP
const allowed = await slidingWindowRateLimit(key, 1000, 3600);
if (!allowed) {
return res.status(429).json({
error: 'Rate limit exceeded',
retryAfter: 3600
});
}
next();
}
GitHub rate limits by IP for unauthenticated requests to prevent scraping.
Rate Limit Headers and Client Communication
Communicate rate limit status to clients via headers:
async function addRateLimitHeaders(req, res, limiter) {
const limit = 1000;
const remaining = await limiter.getRemainingTokens();
const resetTime = await limiter.getResetTime();
res.set({
'X-RateLimit-Limit': limit,
'X-RateLimit-Remaining': remaining,
'X-RateLimit-Reset': resetTime,
'X-RateLimit-Used': limit - remaining
});
if (remaining === 0) {
res.set('Retry-After', Math.ceil((resetTime - Date.now()) / 1000));
}
}
Standard headers (used by GitHub, Twitter, Stripe):
X-RateLimit-Limit: Total requests allowedX-RateLimit-Remaining: Requests remaining in windowX-RateLimit-Reset: Unix timestamp when limit resetsRetry-After: Seconds until retry (for 429 responses)
Graceful Degradation and Circuit Breakers
When rate limits are hit, degrade gracefully rather than failing hard.
Adaptive Rate Limiting
class AdaptiveRateLimiter {
constructor() {
this.baseLimit = 1000;
this.currentLimit = 1000;
this.errorRate = 0;
this.checkInterval = setInterval(() => this.adjust(), 10000);
}
async adjust() {
const metrics = await getSystemMetrics();
// Reduce limits if system under stress
if (metrics.cpuUsage > 80 || metrics.errorRate > 0.05) {
this.currentLimit = Math.max(100, this.currentLimit * 0.9);
console.log(`Reducing rate limit to ${this.currentLimit}`);
}
// Gradually increase limits if healthy
else if (metrics.cpuUsage < 50 && metrics.errorRate < 0.01) {
this.currentLimit = Math.min(this.baseLimit, this.currentLimit * 1.1);
}
}
async allow(userId) {
const limiter = new RedisTokenBucket(
adaptive:${userId},
this.currentLimit,
this.currentLimit / 3600
);
return await limiter.consume();
}
}
Twitter uses adaptive rate limiting during incidents, automatically reducing limits to protect infrastructure.
Priority Queues for Critical Requests
class PriorityRateLimiter {
async allow(userId, priority = 'normal') {
const limits = {
critical: 10000, // Always allow critical requests
high: 5000,
normal: 1000,
low: 100
};
const limit = limits[priority];
const limiter = new RedisTokenBucket(`priority:${userId}:${priority}`, limit, limit / 3600);
return await limiter.consume();
}
}
// Usage
app.post('/payment', async (req, res) => {
// Payment requests are critical
if (!await priorityLimiter.allow(req.user.id, 'critical')) {
return res.status(503).json({ error: 'Service temporarily unavailable' });
}
processPayment(req.body);
});
Stripe prioritizes payment processing requests over read operations.
Rate Limiting Middleware for Express/Node.js
Production-ready middleware:
const rateLimit = require('express-rate-limit');
const RedisStore = require('rate-limit-redis');
const Redis = require('ioredis');
const redis = new Redis({
host: 'localhost',
port: 6379
});
const limiter = rateLimit({
store: new RedisStore({
client: redis,
prefix: 'rl:'
}),
windowMs: 60 * 1000, // 1 minute
max: async (req) => {
// Dynamic limits based on user tier
const user = req.user;
if (!user) return 100; // Anonymous
switch (user.tier) {
case 'enterprise': return 10000;
case 'pro': return 1000;
default: return 100;
}
},
keyGenerator: (req) => {
// Rate limit by user ID if authenticated, otherwise by IP
return req.user?.id || req.ip;
},
handler: (req, res) => {
res.status(429).json({
error: 'Too many requests',
message: 'You have exceeded the rate limit. Please try again later.',
retryAfter: req.rateLimit.resetTime
});
},
skip: (req) => {
// Skip rate limiting for internal services
return req.headers['x-internal-service'] === 'true';
},
onLimitReached: (req, res, options) => {
// Log rate limit violations
logger.warn(Rate limit exceeded for ${req.user?.id || req.ip});
}
});
// Apply to all routes
app.use('/api/', limiter);
// Stricter limits for specific endpoints
const strictLimiter = rateLimit({
store: new RedisStore({ client: redis }),
windowMs: 60 * 1000,
max: 10, // Only 10 requests per minute
});
app.post('/api/charge', strictLimiter, handlePayment);
Monitoring and Alerting
Track rate limit metrics to detect abuse and optimize limits:
class RateLimitMonitor {
async recordMetrics(userId, endpoint, allowed) {
const timestamp = Date.now();
// Store in time-series database (e.g., InfluxDB, Prometheus)
await influxDB.writePoints([
{
measurement: 'rate_limit',
tags: {
user_id: userId,
endpoint: endpoint,
allowed: allowed
},
fields: {
count: 1
},
timestamp: timestamp
}
]);
// Alert if rejection rate > 10%
const rejectionRate = await this.getRejectionRate(userId);
if (rejectionRate > 0.1) {
await this.sendAlert(userId, rejectionRate);
}
}
async getRejectionRate(userId) {
const allowed = await redis.get(metrics:${userId}:allowed);
const rejected = await redis.get(metrics:${userId}:rejected);
if (!allowed && !rejected) return 0;
return rejected / (allowed + rejected);
}
}
Key metrics:
- Requests per second (by user, endpoint, tier)
- Rate limit rejections (429 responses)
- Token bucket fill levels
- Latency of rate limit checks
Datadog, Prometheus, or Grafana dashboards visualize these metrics.
Real-World Implementations
Stripe - Multi-Tier, Per-Endpoint Limits
- Global limits: 100 requests/sec (reads), 10 requests/sec (writes)
- Per-resource limits: 25 card creations/sec, 100 customer lookups/sec
- Tiered limits: Higher limits for enterprise customers
- Implementation: Redis token bucket with Lua scripts
Result: Processes millions of payments daily with 99.99% uptime.
GitHub - Sliding Window with Multiple Tiers
- Unauthenticated: 60 requests/hour (IP-based)
- Authenticated: 5,000 requests/hour (user-based)
- GitHub Actions: 1,000 requests/hour (separate limit pool)
- Implementation: Sliding window counter in Redis
Special handling: GraphQL API uses point system (complex queries cost more points).
Twitter API v2 - App-Level and User-Level Limits
- App-level limits: 300 requests/15 min (shared across all users of an app)
- User-level limits: 900 requests/15 min (per authenticated user)
- Endpoint-specific: Tweet creation limited to 300/3 hours
- Implementation: Distributed sliding window
Innovation: Separate app and user limits prevent single misbehaving app from consuming all quota.
Best Practices
- Start permissive, tighten gradually: Begin with high limits, reduce based on actual usage patterns
- Communicate limits clearly: Document limits in API docs, return limits in headers
- Different limits for different endpoints: Read-heavy endpoints can have higher limits than writes
- Whitelist internal services: Skip rate limiting for authenticated internal microservices
- Implement retry logic on client: Exponential backoff with jitter for 429 responses
- Monitor and alert: Track rejection rates, alert on anomalies
- Use distributed rate limiting: Don't rely on in-memory counters in multi-server deployments
- Test rate limits: Load test to ensure limits protect your infrastructure
Choosing the Right Algorithm
| Algorithm | Use Case | Pros | Cons |
|---|---|---|---|
| Token Bucket | General-purpose APIs | Allows bursts, efficient | Need distributed store |
| Leaky Bucket | Traffic shaping | Smooth output | High memory, delays |
| Fixed Window | Simple rate limiting | Very simple | Boundary problem |
| Sliding Window Log | Strict accuracy needed | Perfect accuracy | High memory |
| Sliding Window Counter | Production APIs | Accurate + efficient | Slightly complex |
Recommendation: Use sliding window counter for most production APIs, implemented in Redis for distributed systems.
Conclusion - Building Resilient APIs
Rate limiting and throttling are essential for production APIs. Key takeaways:
- Choose the right algorithm: Sliding window counter balances accuracy and efficiency
- Distribute state with Redis: Share rate limit counters across servers with atomic Lua scripts
- Multi-tier limits: Different limits for different user tiers and endpoints
- Communicate clearly: Use standard headers to inform clients of limits
- Degrade gracefully: Adaptive limits and priority queues during high load
- Monitor continuously: Track rejection rates and adjust limits based on actual usage
Stripe, GitHub, and Twitter demonstrate that sophisticated rate limiting enables serving billions of requests while maintaining stability and fair usage. Start with conservative limits, monitor usage, and iterate based on real-world patterns to build APIs that scale reliably.
Related Articles
Rate Limiting and API Throttling - Production Strategies for Scalable APIs
Master rate limiting and API throttling strategies for production systems. Learn token bucket, leaky bucket, sliding window algorithms, distributed rate limiting with Redis, per-user and per-endpoint throttling, graceful degradation patterns, and real-world implementations from Stripe, GitHub, and Twitter APIs.
Database Sharding and Partitioning Strategies - Production-Ready Scalability Solutions
Master database sharding and partitioning for horizontal scalability. Learn shard key selection, consistent hashing, cross-shard queries, rebalancing strategies, and real-world patterns from Discord (trillions of messages) and Instagram (billions of users) to scale beyond single-server limits.
Event-Driven Architecture - Building Scalable, Loosely Coupled Production Systems
Master event-driven architecture covering event sourcing, CQRS pattern, event streaming with Kafka, publish-subscribe messaging, event choreography vs orchestration, eventual consistency patterns, and production implementation strategies for building scalable, resilient distributed systems.
Written by StaticBlock
StaticBlock is a technical writer and software engineer specializing in web development, performance optimization, and developer tooling.