Retry, timeout and bulkhead: composing Resilience4j the right way

Day 15 of building OrderHub in the open. Yesterday I added a circuit breaker around the inventory call, so a dead dependency can't drag the whole API down with it. But a breaker is a blunt, late instrument: it waits until something is clearly broken and then stops calling it. Most real trouble is smaller and earlier — a single dropped packet, one 500ms hiccup, a brief burst of concurrency. Those don't trip a breaker, and shouldn't, yet handled naively they still hurt users. Today's three patterns catch exactly those moments: retry, timeout, and bulkhead. Then the interesting part — composing all four together in the right order.

Retry: a few more goes, politely

A retry gives a transient fault a second and third chance before you give up. maxAttempts: 3 means the original call plus two retries. The trap is that a retry is a load multiplier — retry blindly and a struggling dependency gets hit three times as hard, precisely when it can least afford it. Two things make it safe.

First, how you wait. Retrying immediately, in a tight loop, is the worst thing you can do to a service that's already gasping. Exponential backoff widens the gap each time (300ms, 600ms, 1200ms), and jitter randomises each wait so that a thousand clients which all failed at the same instant don't all retry at the same instant — that synchronised thundering herd just recreates the spike. Resilience4j folds both into one exponential-random interval.

Second, which exceptions you retry. Only retry a fault a retry could actually fix. A transient 5xx or timeout? Yes. A 4xx-style bad request — here, a blank item name that throws InvalidItemException — never, because it's deterministic and a retry just fails identically.

resilience4j.retry.instances.inventory:
  max-attempts: 3
  wait-duration: 300ms
  enable-exponential-backoff: true
  exponential-backoff-multiplier: 2
  enable-randomized-wait: true      # jitter
  retry-exceptions:
    - dev.dev48v.orderhub.inventory.InventoryUnavailableException
    - java.util.concurrent.TimeoutException
  ignore-exceptions:
    - dev.dev48v.orderhub.inventory.InvalidItemException

Time limiter: a hard deadline

A call that hangs is worse than one that errors: it ties up a thread and a connection indefinitely while the user stares at a spinner. A time limiter puts a hard deadline on the call — but it can only do that if the call is asynchronous, so the guarded method has to return a CompletableFuture. The work runs on another thread; if the future hasn't completed within timeout-duration, Resilience4j completes it exceptionally with a TimeoutException and cancels the runaway work. That timeout then becomes a first-class outcome the retry and the breaker both react to.

Bulkhead: cap the concurrency

If the inventory service goes slow, every incoming request piles up waiting on it, and eventually the whole app's thread pool is exhausted — a problem in one dependency becomes total unavailability. A bulkhead, named after a ship's watertight compartments, caps how many calls to that dependency can be in flight at once. The semaphore variant is a simple permit count; the call over the limit is rejected immediately with BulkheadFullException and diverted to the fallback, so threads stay free for everything else. max-wait-duration: 0 means fail fast rather than queue.

Composition: order is the whole point

Stack all four annotations and Resilience4j applies them in a fixed order, outer to inner: Retry ( CircuitBreaker ( TimeLimiter ( Bulkhead ( call ) ) ) ). That ordering isn't cosmetic.

Retry is outermost, so each attempt is a fresh pass through the breaker. That means once the breaker is OPEN a retry fails fast with CallNotPermittedException instead of hammering, and every attempt is recorded in the breaker's window. If retry were inside the breaker, all three attempts would look like one call and retries could mask a failing dependency.

CircuitBreaker sits outside TimeLimiter, so a TimeoutException counts as a breaker failure — a dependency that's merely slow, never erroring, still trips it. The alternative order (time limiter outside the breaker) wouldn't feed timeouts into the failure window, which is worse for a "slow, not dead" dependency. That's the tradeoff, and it's why I kept the default.

Bulkhead is innermost, so its permit maps one-to-one to a real in-flight call.

One more subtlety: the fallbackMethod goes on the outermost annotation, @Retry. That way retries run first, and the graceful degraded answer is produced only once they're exhausted. Put the fallback on the inner breaker and it would recover the exception into a "success" before retry ever saw a failure — so nothing would retry.

@Retry(name = "inventory", fallbackMethod = "checkStockResilientFallback")
@CircuitBreaker(name = "inventory")
@TimeLimiter(name = "inventory")
@Bulkhead(name = "inventory", type = Bulkhead.Type.SEMAPHORE)
public CompletableFuture<InventoryStatus> checkStockResilient(String item) {
    return CompletableFuture.supplyAsync(() -> doCheckStock(item));
}

Each pattern got a plain unit test on the Resilience4j primitive — no Spring, no Docker. Retry: a broken downstream is called exactly maxAttempts times, then falls back; a non-transient fault is tried once. Timeout: a call slower than the deadline is abandoned. Bulkhead: six callers, two permits, four rejected. Composition: after one composed call, the breaker's window holds every attempt — proving retry sits outside it.

That wraps Phase 2. Next up, Day 16: idempotency keys, so those retries can safely hit a POST without creating a duplicate order.

The interactive version — a retry timeline with live backoff, a call racing a timeout deadline, and a semaphore you can saturate — plus the full walkthrough is here: https://dev48v.infy.uk/orderhub.php

Code: https://github.com/dev48v/order-hub-from-zero