Building a Sliding Window Rate Limiter in Redis for a Multi-Region Video API

Twice in the same week last quarter, our YouTube Data API quota hit zero before 09:00 UTC, and the trending feeds for half our regions went stale for the rest of the day. I run TrendVidStream, a global video streaming discovery site that ingests trending-video data for 8 regions on a cron schedule, and the culprit was not traffic growth. It was our rate limiter. A fixed-window counter let two full bursts of API calls through within seconds of each other whenever a window boundary lined up with the cron schedule — and with 8 regional fetch jobs firing at staggered offsets, boundaries lined up far more often than intuition suggested they would.

This post walks through how we replaced that fixed-window counter with a sliding window log in Redis: the data model, the Lua script that makes it atomic, weighted costs for APIs where one call can be 100x more expensive than another, and the failure modes we hit running it across a multi-region ingestion pipeline. The stack is PHP 8.4 talking to Redis through phpredis, with a Python harness for verifying boundary behavior, but every part of the design transfers to any language with a Redis client.

Why the Fixed Window Failed Us

The original limiter was the one everybody writes first, because it is three lines of Redis and it mostly works:

<?php

function allowFixedWindow(Redis $redis, string $key, int $limit, int $windowSec): bool
{
    $bucket = sprintf('%s:%d', $key, intdiv(time(), $windowSec));
    $count  = $redis->incr($bucket);

    if ($count === 1) {
        $redis->expire($bucket, $windowSec * 2);
    }

    return $count <= $limit;
}

Every request increments a counter named after the current window bucket. When the counter exceeds the limit, requests are rejected until the clock rolls into the next bucket and a fresh counter starts at zero.

The problem is exactly that rollover. Suppose the limit is 100 requests per minute. A client — or in our case, a burst of regional cron jobs — can send 100 requests at 11:59:59 and another 100 at 12:00:01. Both bursts are accepted, because each lands in a different bucket. The effective rate over those two seconds is 200 requests, double the configured limit, and the limiter considers everything fine.

For a public API this is an annoyance. For an ingestion pipeline metered by a hard daily quota, it is a budget leak. The YouTube Data API gives you 10,000 units per day per key. Our fetch pipeline runs for 8 regions (US, GB, DE, FR, IN, BR, AU, CA), and each regional pass makes a sequence of paginated calls. When two regional jobs straddled a window boundary, the limiter happily authorized roughly twice the intended spend in a few seconds. Do that a handful of times overnight and the quota is gone before the morning fetch cycle even starts.

The fix is not a smaller window or a fudge factor on the limit. The fix is a limiter whose window actually slides.

A Sliding Window Log on Sorted Sets

A sliding window log stores a timestamp for every accepted request and answers one question: how many requests happened in the last N seconds, measured from right now — not from the last round-number boundary. Redis sorted sets are almost suspiciously well shaped for this:

ZADD key timestamp member records a request, with the timestamp as the score.
ZREMRANGEBYSCORE key 0 (now - window) evicts everything older than the window in one call.
ZCARD key counts what remains — the exact number of requests inside the sliding window.
PEXPIRE key window lets idle keys delete themselves, so abandoned limiters cost nothing.

There is no boundary to straddle because there is no boundary. At any instant, the set contains precisely the requests from the trailing window, and the count is exact rather than the interpolated estimate you get from the two-bucket sliding window counter approach that Cloudflare popularized.

The trade-off is memory: one sorted-set member per accepted request. People reflexively reject the log approach over this, but run the numbers for your actual workload first. A sorted-set entry costs on the order of 100 bytes with overhead. Our API key budget allows roughly 10,000 units per day, so a fully saturated daily window holds about 10,000 members — around 1 MB. For a public per-IP limiter at 60 requests per minute, each IP key tops out at 60 members. Unless you are limiting millions of requests per second against a single key, the exactness is cheap. If you ever do reach that scale, the upgrade path is a bucketed counter, and I will come back to that at the end.

The naive implementation, though, has a race. If you run ZREMRANGEBYSCORE, then ZCARD, then decide in application code whether to ZADD, two concurrent clients can both read a count of 99 against a limit of 100 and both insert. Pipelining does not save you, because the decision depends on a value read mid-sequence. This is exactly what Redis server-side Lua scripting is for: the whole check-and-record becomes one atomic unit.

Making It Atomic With Lua

Here is the limiter we run in production, trimmed of logging. PHP 8.4, phpredis, one Lua script:

<?php

declare(strict_types=1);

final class SlidingWindowLimiter
{
    private const string LUA = <<<'LUA'
        local key    = KEYS[1]
        local window = tonumber(ARGV[1]) -- microseconds
        local limit  = tonumber(ARGV[2])
        local cost   = tonumber(ARGV[3])
        local member = ARGV[4]

        local t   = redis.call('TIME')
        local now = t[1] * 1000000 + t[2]

        redis.call('ZREMRANGEBYSCORE', key, 0, now - window)

        local used = redis.call('ZCARD', key)
        if used + cost > limit then
            local oldest = redis.call('ZRANGE', key, 0, 0, 'WITHSCORES')
            local retryAfter = 1
            if oldest[2] then
                retryAfter = math.ceil((tonumber(oldest[2]) + window - now) / 1000000)
            end
            return {0, limit - used, retryAfter}
        end

        for i = 1, cost do
            redis.call('ZADD', key, now, member .. ':' .. i)
        end
        redis.call('PEXPIRE', key, math.ceil(window / 1000) + 1000)

        return {1, limit - used - cost, 0}
        LUA;

    private ?string $sha = null;

    public function __construct(
        private readonly Redis $redis,
        private readonly int $limit,
        private readonly int $windowSeconds,
    ) {
        if ($limit < 1 || $windowSeconds < 1) {
            throw new InvalidArgumentException('limit and window must be positive');
        }
    }

    public function attempt(string $key, int $cost = 1): RateLimitResult
    {
        if ($cost > $this->limit) {
            throw new InvalidArgumentException('cost exceeds limit; request can never succeed');
        }

        $args = [
            $key,
            (string) ($this->windowSeconds * 1_000_000),
            (string) $this->limit,
            (string) $cost,
            bin2hex(random_bytes(8)),
        ];

        $this->sha ??= $this->redis->script('load', self::LUA);

        try {
            $reply = $this->redis->evalSha($this->sha, $args, 1);
        } catch (RedisException) {
            // script cache flushed (failover, restart) — reload transparently
            $reply = $this->redis->eval(self::LUA, $args, 1);
        }

        return new RateLimitResult(
            allowed: (bool) $reply[0],
            remaining: max(0, (int) $reply[1]),
            retryAfterSeconds: (int) $reply[2],
        );
    }
}

final readonly class RateLimitResult
{
    public function __construct(
        public bool $allowed,
        public int $remaining,
        public int $retryAfterSeconds,
    ) {}
}

A few decisions in there earned their place the hard way:

The clock lives inside Redis. The script calls TIME on the server instead of accepting a timestamp from PHP. Our PHP runs on shared LiteSpeed hosting boxes that we deploy to over FTP; we do not control their NTP discipline, and we once watched two web nodes disagree by 1.4 seconds. With eight cron sources writing to one limiter, a single authoritative clock removes a whole category of bugs. One caveat: calling TIME before a write requires Redis 5 or newer, where scripts replicate by effects rather than verbatim. On anything older the script is rejected as non-deterministic.
Microsecond scores. With second-resolution scores, two requests in the same second from the same caller could collide on score and member. Microseconds plus a random member suffix make every entry unique.
Retry-After is computed, not guessed. The oldest surviving entry plus the window length tells you exactly when capacity frees up. For the public API we surface that in a real Retry-After header on the 429; for cron, it tells the job whether to sleep or abandon the run.
EVALSHA with EVAL fallback. Script caching saves bandwidth on every call, and the fallback path means a Redis restart or failover never turns into an application error.
The cost-versus-limit guard. A request whose cost exceeds the limit would be rejected forever while looking like a transient throttle. Failing loudly at the call site beats an infinite retry loop at 3 a.m.

Weighted Costs Because API Calls Are Not Equal

That cost parameter is doing more work than it looks like. Upstream quotas are rarely flat per-request. On the YouTube Data API, a videos.list call costs 1 unit while a search.list call costs 100 — a two-orders-of-magnitude spread within one quota pool. A limiter that counts requests instead of units will confidently approve you into a quota lockout.

The script handles weight by inserting cost members per call, each sharing a timestamp and differing by suffix. ZCARD then equals units consumed in the window, and eviction stays a single ZREMRANGEBYSCORE. Inserting 100 members for one expensive search call sounds wasteful, but our pipeline issues a handful of search calls per fetch cycle — the loop runs in microseconds and the memory ceiling is still bounded by the quota itself. If your costs ran into the tens of thousands per call, you would store the cost as part of the member payload and switch the counting strategy; for realistic API quota weights, brute force is simpler and exact.

Here is the shape of the ingestion loop that consumes it, condensed from our fetch job:

<?php

const QUOTA_COSTS = [
    'search.list'   => 100,
    'videos.list'   => 1,
    'channels.list' => 1,
];

const REGIONS = ['US', 'GB', 'DE', 'FR', 'IN', 'BR', 'AU', 'CA'];

// 9500, not 10000: headroom for backfill jobs and manual debugging
$limiter = new SlidingWindowLimiter($redis, limit: 9500, windowSeconds: 86400);

foreach (REGIONS as $region) {
    $result = $limiter->attempt('yt:quota:key1', QUOTA_COSTS['videos.list']);

    if (!$result->allowed) {
        error_log(sprintf(
            'quota exhausted before %s; %d units left, retry in %ds — ending run',
            $region,
            $result->remaining,
            $result->retryAfterSeconds,
        ));
        break; // the next cron tick picks up where we stopped
    }

    fetchTrendingPage($region);
}

Two details matter here. The limit is set below the real quota, because the limiter should run out before the upstream does — a 403 from Google is a worse failure signal than your own clean rejection, and the reserve absorbs ad-hoc API usage during debugging. And on rejection the job breaks instead of sleeping: a daily-window limiter can return a Retry-After of hours, and a cron job that sleeps for hours is a process leak. Stop cleanly, let the schedule resume.

One Limiter, Eight Regions

The multi-region wrinkle is mostly about choosing the right key, and the right key follows the scope of the resource you are protecting:

Upstream API quota is per API key, not per region, so all 8 regional jobs share one limiter key per credential: yt:quota:key1. Splitting it per region would just hide overspend until the upstream said no.
Public endpoint limits are per client, so the watch and discovery endpoints use api:ip:{addr} keys with a 60-per-minute window.
Expensive internal operations get their own narrow keys — search.list calls have a dedicated limiter on top of the shared quota one, because a bug in search pagination once ate 4,000 units in ten minutes and we never want a single endpoint to be able to do that again.

Our deployment model makes the case for Redis sharper than usual. The PHP application is deployed over FTP to LiteSpeed shared hosting — there is no long-lived daemon, no APCu shared across workers that survives meaningfully, and each request or cron invocation is an isolated process. Limiter state has to live outside the PHP lifecycle entirely, so it lives on a small Redis VPS that all environments reach. The regional cron jobs are staggered by minute offsets so their bursts interleave rather than stack, but staggering is an optimization, not a correctness mechanism — the sliding window is what holds the line when schedules drift or a job overruns.

It is also worth engineering down the number of metered calls you need at all. Our user-facing search never touches the upstream API: ingested metadata lands in SQLite with an FTS5 index, and queries are served locally for free. The entire API budget goes to ingestion, where the limiter governs a small number of well-understood jobs instead of unpredictable user traffic. The cheapest rate-limited call is the one you replaced with a local index.

Proving the Boundary Behavior

Claims about window boundaries deserve a demonstration, not a diagram. This Python script lines both limiters up against a window boundary and fires a full-limit burst on each side of it:

import time
import redis

r = redis.Redis(decode_responses=True)

WINDOW = 10   # seconds
LIMIT = 50

sliding = r.register_script('''
local key    = KEYS[1]
local now    = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit  = tonumber(ARGV[3])
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)
if redis.call('ZCARD', key) >= limit then
    return 0
end
redis.call('ZADD', key, now, tostring(now) .. '-' .. tostring(math.random()))
redis.call('EXPIRE', key, window + 1)
return 1
''')

def fixed_allow(key):
    bucket = f'{key}:{int(time.time() // WINDOW)}'
    n = r.incr(bucket)
    r.expire(bucket, WINDOW * 2)
    return 1 if n <= LIMIT else 0

def sliding_allow(key):
    return int(sliding(keys=[key], args=[time.time(), WINDOW, LIMIT]))

def burst(n, fn):
    return sum(fn() for _ in range(n))

# park ourselves one second before the next window boundary
now = time.time()
boundary = (int(now // WINDOW) + 1) * WINDOW
time.sleep(max(0.0, boundary - 1 - now))

before_f = burst(LIMIT, lambda: fixed_allow('demo:fixed'))
before_s = burst(LIMIT, lambda: sliding_allow('demo:sliding'))

time.sleep(2)  # cross the boundary

after_f = burst(LIMIT, lambda: fixed_allow('demo:fixed'))
after_s = burst(LIMIT, lambda: sliding_allow('demo:sliding'))

print(f'fixed   accepted {before_f + after_f} requests in ~3s (limit {LIMIT}/{WINDOW}s)')
print(f'sliding accepted {before_s + after_s} requests in ~3s (limit {LIMIT}/{WINDOW}s)')

On my machine the output is the whole argument in two lines:

fixed accepted 100 requests in ~3s (limit 50/10s)
sliding accepted 50 requests in ~3s (limit 50/10s)

The fixed window approved exactly double its stated limit across the boundary, which is the production incident from the introduction reproduced in miniature. The sliding window held the line at 50 regardless of where the burst landed relative to the clock. Same Redis, same limit, same traffic — the only difference is whether the window slides.

Failure Modes Worth Planning For

A rate limiter sits on the critical path of every request it guards, so its own failures need decisions made in advance rather than at incident time:

Redis is unreachable. Decide fail-open versus fail-closed per call site, not globally. Our ingestion cron fails open for 1-unit calls — a brief unmetered trickle is cheaper than a stale site — but fails closed for 100-unit search calls, where a limiter outage coinciding with a pagination bug is exactly the scenario that drains a quota. Wrap the attempt in a try/catch and make the policy explicit in code.
Clock skew between writers. Solved here by using Redis TIME inside the script. If you instead pass timestamps from application servers, every server's skew becomes window distortion, and the limiter silently becomes more or less permissive per node.
Cost greater than limit. Validate at the call site. A request that can never be approved looks identical to a throttled one from the outside and will retry forever.
Unbounded key growth. The PEXPIRE on every touch means keys for departed clients evaporate on their own. Without it, a per-IP limiter is a slow memory leak indexed by every address that ever visited.
Hot keys at high scale. All operations on one limiter key hit one shard. At our volume — thousands of decisions per hour — this is irrelevant, but past several thousand decisions per second on a single key you would shard the key by a hash suffix and sum counts, or move to a bucketed sliding counter: fixed sub-buckets (say, 1-second hash fields) summed over the trailing window. You trade exactness at the bucket edge for O(1) memory per limiter. Build the exact version first; you will know from your metrics if you ever need the approximation.
Script cache eviction. SCRIPT FLUSH, failover, or a restart empties the EVALSHA cache. The EVAL fallback in the PHP class turns that from an error spike into one slightly slower call.

Conclusion

The fixed-window limiter is fine right up until the cost of a boundary burst exceeds the cost of doing it properly — and with hard upstream quotas, that point arrives early. The sliding window log on Redis sorted sets gives you exact counting, honest Retry-After values, weighted costs for unequal calls, and a shared decision point for however many regions, cron jobs, and web nodes are spending from the same budget. The whole thing is one Lua script and a thin client class.

Since swapping it in, the quota has not been exhausted once, regional fetch jobs degrade by stopping cleanly instead of slamming into upstream 403s, and the 429s our public API returns now come with a Retry-After header that tells the truth. If your limiter resets on a round number of the clock, you do not have a rate limit — you have a rate suggestion with a loophole twice per window.