The Context: What is Commerce Connect?
At Casa Retail AI, we have an internal platform called Commerce Connect (CC).
Commerce Connect acts as the central Product Information Management (PIM) system — the single source of truth for product data across our entire retail ecosystem. Built on top of a customized version of the open-source e-commerce platform Spree Commerce, it is extended with multi-vendor and multi-tenant capabilities.
Its job is straightforward:
Collect product information from multiple retail ecosystems and distribute it to every Casa product that needs it — PostgreSQL for operational modules, ClickHouse for analytics, and external B2B integrations for partner systems.
All of this synchronization work happens in the background using Sidekiq.
The Problem: A Shared Queue With No Per-Tenant Guardrails
Sidekiq processes jobs concurrently using a fixed pool of worker threads. At any given time, there is a hard cap on how many jobs can run simultaneously across the entire system.
In practice, this meant:
- A tenant uploading multiple CSV files in quick succession
- A tenant triggering multiple sync operations to different external systems
- Multiple tenants doing all of this at the same time
There was nothing stopping a single active tenant from claiming the majority of worker slots. If one tenant fired several heavy jobs back-to-back, every other tenant's jobs sat in the queue waiting — with no visibility into why.
This wasn't theoretical. As we onboarded more tenants, the pattern became real:
- Smaller tenants experienced noticeable delays during high-activity periods from larger ones.
- Large catalog syncs from one tenant could hold up all other background work.
- There was no way to offer differentiated service — every tenant was treated identically, regardless of their plan or size.
We needed a way to restrict how many jobs a single tenant could run in parallel, at the queue level.
The Solution: Per-Tenant Job Slot Tracking
The idea was simple:
Each tenant gets a configurable maximum number of concurrent jobs per queue. If a job tries to run and the tenant is already at their limit, reschedule it for later and let the next tenant go.
We introduced a database record per tenant per queue, tracking two things:
-
max_job— how many parallel jobs this tenant is allowed -
current_job— how many are currently running
Before any background job executes, it must acquire a slot. When it finishes — or fails — it releases the slot.
For most tenants, the default is max_job: 1. Single-job concurrency. Simple, predictable, fair.
For premium tenants with higher throughput requirements, we raise the limit — no code change required, just a configuration update.
The limit is also queue-aware. A tenant's slot count on the CSV import queue is tracked independently from the sync queue. One slow upload doesn't block a faster downstream sync for the same tenant.
The Flow
When a job is picked up by a Sidekiq worker, before any real business logic runs:
- Look up the tenant's config record for this queue. Create it with defaults if it doesn't exist yet — new tenants need no manual setup.
- Check if a slot is available (
current_job < max_job). - If not — reschedule the job for 1 minute later and return. The work still gets done; it just waits its turn.
- If yes — increment
current_joband proceed with the actual job. - When the job finishes (or raises an exception), decrement
current_jobback.
The decrement in step 5 always happens — even on failures. This is the guarantee that prevents a crashed job from permanently consuming a slot and slowly starving the tenant out of all future work.
The Race Condition — and How We Handle It
This is where it gets interesting.
At first glance, "check if slots are available, then increment" sounds simple. But in a concurrent system with multiple Sidekiq threads running simultaneously, this is a classic read-modify-write race condition.
Imagine two jobs for the same tenant land in the queue at almost the same time. Both threads check current_job < max_job. Both see the count is within limits. Both decide to proceed. Both increment. Now the tenant is running more concurrent jobs than allowed — the limit is silently violated.
The naive fix — read the value, check it in application code, then write it back — doesn't hold up under concurrency. There's always a window between the read and the write where another thread can slip through.
The correct fix is to make the check and the increment a single atomic operation at the database level.
Instead of reading the value and then updating it separately, we issue a conditional SQL update that does both in one statement: "increment current_job by 1, but only if current_job is currently less than max_job." The database processes this as one indivisible operation. If two threads race to the same record simultaneously, only one of them will satisfy the condition — the other will see that zero rows were updated, and knows it lost the race.
No Redis locks. No application-level mutexes. No external coordination. The database's own row-level locking gives us the guarantee we need, for free.
The same pattern applies on release: the decrement is conditioned on current_job > 0, preventing the counter from going negative if release is somehow called more times than acquire in an unexpected error flow.
The Business Impact
This was a small module by any measure. But the effect on fairness across tenants was immediate.
- No single tenant can monopolize the job queue, regardless of how aggressively they upload or sync.
- Most tenants run comfortably on the default single-slot limit — simple, first-in-first-out behavior.
- Premium tenants with higher throughput needs get their limit raised with a simple config update.
- No new infrastructure. No Redis, no distributed lock service — the existing database was enough.
What I Learned
The most effective solutions are often the ones that fit the shape of the system you already have.
We had PostgreSQL. We had Sidekiq. We didn't need a new component.
One database table. One atomic update pattern. One shared module any worker can include.
That was enough to give every tenant a fair share of the job queue — regardless of how active their neighbors were.
Tags: #rails #sidekiq #ruby #backgroundjobs #softwareengineering