Redis Cache Failure: When the DB Connection Pool Died, I Went to Fix Redis

At 2:40 a.m., I was jolted awake by an alert. The monitoring dashboard was a sea of red—user service timeouts, database connection pool exhausted. My first thought: "Redis is down again?" I checked. Redis was perfectly healthy: plenty of memory, low CPU, zero slow queries. The real culprit was hiding deeper than I expected: cache inconsistency with the database. Hot queries punched straight through the cache, hammering MySQL with a flood of requests. After fixing the bug, I spent two weeks building an automated verification suite. Now I can sleep through a 3 a.m. alarm—because Pytest already tested every consistency scenario for me.

Breaking Down the Problem: Where Does Cache Inconsistency Come From?

Our scenario was a user points system. Points lived in MySQL, Redis served as the cache, and we used the cache-aside pattern: update the database first, then delete the cache key so reads rebuild it. Textbook answer, right? Under high concurrency, it’s a trap.

The live request timeline looked like this:

Thread A updates the database (points 100 → 200) and commits.
Just as the cache key expires, Thread B issues a read. Cache miss. It queries the DB and fetches the old value of 100.
Thread A deletes the cache key.
Thread B writes its fetched old value (100) back into the cache.

From then on, every request sees 100 until the next expiry or accidental update. The database has 200, the cache holds 100. The inconsistency silently settled in. Even with a TTL, during the window users saw points drop and bounce back, generating support tickets.

Why didn’t the usual fixes work? Distributed locks crushed throughput. Delayed double-delete couldn’t fully close the window. And eventual consistency with a TTL didn’t satisfy the business requirement of “points must show immediately.” The root cause? We had never written a test for cache consistency. We manually clicked around in staging and shipped.

Solution Design: Dragging the Race Condition into CI with Pytest + Docker

I needed an automated test that could reproduce this concurrent race condition and run inside CI. Here’s what drove the tech choices:

Why not just mock Redis/DB in unit tests? Mocks can’t simulate real network latency, serialization overhead, or parallel execution timing—the exact details where races happen.
Why not integration tests against real infrastructure? Nobody wants to pollute dev data or maintain a dedicated Redis/MySQL just for testing.
Why Pytest + Docker? python-on-whales lets Pytest spin up Redis containers directly. testcontainers’ RedisContainer handles lifecycle cleanly. The whole environment launches in seconds, runs, and is destroyed automatically, keeping data perfectly isolated.

Architecturally, I split the scenario into three layers:

Infrastructure layer: Pytest fixtures use Docker to start Redis (with fault injection) and initialize MySQL tables.
Business simulation layer: threading concurrently runs update and query logic, with precisely controlled sleep points to reproduce the race window.
Assertion layer: Compares the real database value with the value returned by the cache, allowing a maximum 500ms eventual consistency window; anything beyond that is flagged as inconsistent.

Core Implementation

1. Starting Redis with Testcontainers and Injecting Latency

This code answers: “How do I give my local tests network conditions similar to production?” Redis on localhost has near-zero latency—many issues simply won’t surface. I used tcpshield inside the container to place a TCP proxy in front of Redis, simulating 2ms of delay.

# conftest.py
import pytest
import time
from testcontainers.redis import RedisContainer
from testcontainers.core.waiting_utils import wait_for_logs

@pytest.fixture(scope="session")
def redis_with_latency():
    """Redis container with a latency proxy to mimic real network"""
    redis = RedisContainer("redis:7-alpine", port=6379)
    redis.start()
    # Inject 2ms delay via socat to prevent localhost from hiding race conditions
    redis.exec(["apt-get", "update", "&&", "apt-get", "install", "-y", "socat"])
    redis.exec([
        "socat", "TCP-LISTEN:6380,fork,reuseaddr",
        "TCP:localhost:6379", "tspipe", "delay=2"
    ])  # All commands run inside the container; port 6380 exposed with latency
    time.sleep(0.5)
    yield redis.get_connection_url().replace("6379", "6380")
    redis.stop()

2. Simulating Concurrent Update and Read to Trigger the Race Window

This code answers: “How can I reliably trigger database-cache inconsistency?” The key is to let the reader thread fetch the old value from the DB, then pause just before writing back to the cache, giving the writer thread exactly enough time to delete the key.

# test_cache_consistency.py
import threading
import time
import redis
import pymysql

def test_update_then_read_consistency(redis_with_latency, mysql_conn):
    """Concurrent update and read to verify eventual cache consistency"""
    r = redis.Redis.from_url(redis_with_latency)
    mysql_conn.execute("UPDATE users SET points = 100 WHERE id=1")
    r.delete("user:1:points")

    # Control gate to pause the reader after its DB fetch, before the cache write
    read_gate = threading.Event()
    old_value_read = None

    def reader_thread():
        nonlocal old_value_read
        # Simulate cache miss, query the database
        old_value_read = mysql_conn.execute("SELECT points FROM users WHERE id=1").fetchone()[0]
        read_gate.set()          # Tell the writer: I have the old value, ready to write cache
        # Deliberately not writing back immediately, giving the writer time to delete first
        time.sleep(0.1)          # Yield to the writer thread
        r.set("user:1:points", old_value_