How We Stopped Tenants From Hogging the Job Queue - Tenant-Level Parallel Job Limiting in Sidekiq

ruby dev.to

The Context: What is Commerce Connect?

At Casa Retail AI, we have an internal platform called Commerce Connect (CC).

Commerce Connect acts as the central Product Information Management (PIM) system — the single source of truth for product data across our entire retail ecosystem. Built on top of a customized version of the open-source e-commerce platform Spree Commerce, it is extended with multi-vendor and multi-tenant capabilities.

Its job is straightforward:
Collect product information from multiple retail ecosystems and distribute it to every Casa product that needs it — PostgreSQL for operational modules, ClickHouse for analytics, and external B2B integrations for partner systems.

All of this synchronization work happens in the background using Sidekiq.


The Problem: A Shared Queue With No Per-Tenant Guardrails

Sidekiq processes jobs concurrently using a fixed pool of worker threads. At any given time, there is a hard cap on how many jobs can run simultaneously across the entire system.

In practice, this meant:

  • A tenant uploading multiple CSV files in quick succession
  • A tenant triggering multiple sync operations to different external systems
  • Multiple tenants doing all of this at the same time

There was nothing stopping a single active tenant from claiming the majority of worker slots. If one tenant fired several heavy jobs back-to-back, every other tenant's jobs sat in the queue waiting — with no visibility into why.

This wasn't theoretical. As we onboarded more tenants, the pattern became real:

  • Smaller tenants experienced noticeable delays during high-activity periods from larger ones.
  • Large catalog syncs from one tenant could hold up all other background work.
  • There was no way to offer differentiated service — every tenant was treated identically, regardless of their plan or size.

We needed a way to restrict how many jobs a single tenant could run in parallel, at the queue level.


The Solution: Per-Tenant Job Slot Tracking

The idea was simple:

Each tenant gets a configurable maximum number of concurrent jobs per queue. If a job tries to run and the tenant is already at their limit, reschedule it for later and let the next tenant go.

We introduced a database record per tenant per queue, tracking two things:

  • max_job — how many parallel jobs this tenant is allowed
  • current_job — how many are currently running

Before any background job executes, it must acquire a slot. When it finishes — or fails — it releases the slot.

For most tenants, the default is max_job: 1. Single-job concurrency. Simple, predictable, fair.
For premium tenants with higher throughput requirements, we raise the limit — no code change required, just a configuration update.

The limit is also queue-aware. A tenant's slot count on the CSV import queue is tracked independently from the sync queue. One slow upload doesn't block a faster downstream sync for the same tenant.


The Flow

When a job is picked up by a Sidekiq worker, before any real business logic runs:

  1. Look up the tenant's config record for this queue. Create it with defaults if it doesn't exist yet — new tenants need no manual setup.
  2. Check if a slot is available (current_job < max_job).
  3. If not — reschedule the job for 1 minute later and return. The work still gets done; it just waits its turn.
  4. If yes — increment current_job and proceed with the actual job.
  5. When the job finishes (or raises an exception), decrement current_job back.

The decrement in step 5 always happens — even on failures. This is the guarantee that prevents a crashed job from permanently consuming a slot and slowly starving the tenant out of all future work.


The Race Condition — and How We Handle It

This is where it gets interesting.

At first glance, "check if slots are available, then increment" sounds simple. But in a concurrent system with multiple Sidekiq threads running simultaneously, this is a classic read-modify-write race condition.

Imagine two jobs for the same tenant land in the queue at almost the same time. Both threads check current_job < max_job. Both see the count is within limits. Both decide to proceed. Both increment. Now the tenant is running more concurrent jobs than allowed — the limit is silently violated.

The naive fix — read the value, check it in application code, then write it back — doesn't hold up under concurrency. There's always a window between the read and the write where another thread can slip through.

The correct fix is to make the check and the increment a single atomic operation at the database level.

Instead of reading the value and then updating it separately, we issue a conditional SQL update that does both in one statement: "increment current_job by 1, but only if current_job is currently less than max_job." The database processes this as one indivisible operation. If two threads race to the same record simultaneously, only one of them will satisfy the condition — the other will see that zero rows were updated, and knows it lost the race.

No Redis locks. No application-level mutexes. No external coordination. The database's own row-level locking gives us the guarantee we need, for free.

The same pattern applies on release: the decrement is conditioned on current_job > 0, preventing the counter from going negative if release is somehow called more times than acquire in an unexpected error flow.


The Business Impact

This was a small module by any measure. But the effect on fairness across tenants was immediate.

  • No single tenant can monopolize the job queue, regardless of how aggressively they upload or sync.
  • Most tenants run comfortably on the default single-slot limit — simple, first-in-first-out behavior.
  • Premium tenants with higher throughput needs get their limit raised with a simple config update.
  • No new infrastructure. No Redis, no distributed lock service — the existing database was enough.

What I Learned

The most effective solutions are often the ones that fit the shape of the system you already have.

We had PostgreSQL. We had Sidekiq. We didn't need a new component.

One database table. One atomic update pattern. One shared module any worker can include.

That was enough to give every tenant a fair share of the job queue — regardless of how active their neighbors were.


Tags: #rails #sidekiq #ruby #backgroundjobs #softwareengineering

Source: dev.to

arrow_back Back to Tutorials