How We Stopped Tenants From Hogging the Job Queue - Tenant-Level Parallel Job Limiting in Sidekiq

The Context: What is Commerce Connect?

At Casa Retail AI, we have an internal platform called Commerce Connect (CC).

Commerce Connect acts as the central Product Information Management (PIM) system — the single source of truth for product data across our entire retail ecosystem. Built on top of a customized version of the open-source e-commerce platform Spree Commerce, it is extended with multi-vendor and multi-tenant capabilities.

Its job is straightforward:
Collect product information from multiple retail ecosystems and distribute it to every Casa product that needs it — PostgreSQL for operational modules, ClickHouse for analytics, and external B2B integrations for partner systems.

All of this synchronization work happens in the background using Sidekiq.

The Problem: A Shared Queue With No Per-Tenant Guardrails

Sidekiq processes jobs concurrently using a fixed pool of worker threads. At any given time, there is a hard cap on how many jobs can run simultaneously across the entire system.

In practice, this meant:

A tenant uploading multiple CSV files in quick succession
A tenant triggering multiple sync operations to different external systems
Multiple tenants doing all of this at the same time

There was nothing stopping a single active tenant from claiming the majority of worker slots. If one tenant fired several heavy jobs back-to-back, every other tenant's jobs sat in the queue waiting — with no visibility into why.

This wasn't theoretical. As we onboarded more tenants, the pattern became real:

Smaller tenants experienced noticeable delays during high-activity periods from larger ones.
Large catalog syncs from one tenant could hold up all other background work.
There was no way to offer differentiated service — every tenant was treated identically, regardless of their plan or size.

We needed a way to restrict how many jobs a single tenant could run in parallel, at the queue level.

The Solution: Per-Tenant Job Slot Tracking

The idea was simple:

Each tenant gets a configurable maximum number of concurrent jobs per queue. If a job tries to run and the tenant is already at their limit, reschedule it for later and let the next tenant go.

We introduced a database record per tenant per queue, tracking two things:

max_job — how many parallel jobs this tenant is allowed
current_job — how many are currently running

Before any background job executes, it must acquire a slot. When it finishes — or fails — it releases the slot.

For most tenants, the default is max_job: 1. Single-job concurrency. Simple, predictable, fair.
For premium tenants with higher throughput requirements, we raise the limit — no code change required, just a configuration update.

The limit is also queue-aware. A tenant's slot count on the CSV import queue is tracked independently from the sync queue. One slow upload doesn't block a faster downstream sync for the same tenant.

The Flow

When a job is picked up by a Sidekiq worker, before any real business logic runs:

Look up the tenant's config record for this queue. Create it with defaults if it doesn't exist yet — new tenants need no manual setup.
Check if a slot is available (current_job < max_job).
If not — reschedule the job for 1 minute later and return. The work still gets done; it just waits its turn.
If yes — increment current_job and proceed with the actual job.
When the job finishes (or raises an exception), decrement current_job back.

The decrement in step 5 always happens — even on failures. This is the guarantee that prevents a crashed job from permanently consuming a slot and slowly starving the tenant out of all future work.

The Race Condition — and How We Handle It

This is where it gets interesting.

At first glance, "check if slots are available, then increment" sounds simple. But in a concurrent system with multiple Sidekiq threads running simultaneously, this is a classic read-modify-write race condition.

Imagine two jobs for the same tenant land in the queue at almost the same time. Both threads check current_job < max_job. Both see the count is within limits. Both decide to proceed. Both increment. Now the tenant is running more concurrent jobs than allowed — the limit is silently violated.

The naive fix — read the value, check it in application code, then write it back — doesn't hold up under concurrency. There's always a window between the read and the write where another thread can slip through.

The correct fix is to make the check and the increment a single atomic operation at the database level.

Instead of reading the value and then updating it separately, we issue a conditional SQL update that does both in one statement: "increment current_job by 1, but only if current_job is currently less than max_job." The database processes this as one indivisible operation. If two threads race to the same record simultaneously, only one of them will satisfy the condition — the other will see that zero rows were updated, and knows it lost the race.

No Redis locks. No application-level mutexes. No external coordination. The database's own row-level locking gives us the guarantee we need, for free.

The same pattern applies on release: the decrement is conditioned on current_job > 0, preventing the counter from going negative if release is somehow called more times than acquire in an unexpected error flow.

The Business Impact

This was a small module by any measure. But the effect on fairness across tenants was immediate.

No single tenant can monopolize the job queue, regardless of how aggressively they upload or sync.
Most tenants run comfortably on the default single-slot limit — simple, first-in-first-out behavior.
Premium tenants with higher throughput needs get their limit raised with a simple config update.
No new infrastructure. No Redis, no distributed lock service — the existing database was enough.

What I Learned

The most effective solutions are often the ones that fit the shape of the system you already have.

We had PostgreSQL. We had Sidekiq. We didn't need a new component.

One database table. One atomic update pattern. One shared module any worker can include.

That was enough to give every tenant a fair share of the job queue — regardless of how active their neighbors were.

Tags: #rails #sidekiq #ruby #backgroundjobs #softwareengineering