Day 14 of building OrderHub in the open. So far we have a monolith with real persistence, validation, exception handling, OpenAPI docs, a Redis cache, and a rate limiter. Today we add the piece that keeps the whole thing standing when a dependency falls over instead of a client: a circuit breaker.
Here's the scenario. Placing an order needs a stock check, so the order flow calls an inventory service. That service gets slow, or starts throwing 5xxs. The naive reaction is the dangerous one: keep calling it, keep retrying. Threads pile up blocked on a dead dependency, timeouts stack, connection pools drain, and within seconds a problem in inventory has become an outage in OrderHub. That cascade is how small failures become big ones.
A circuit breaker is a fuse for service calls. It watches how recent calls have gone, and when too many fail it stops calling the dependency for a while — returning a fast fallback instead. The struggling service gets room to recover, and your service stays responsive. Isolation instead of propagation.
The three states
The breaker is a small state machine with three states.
CLOSED is normal operation. Calls pass straight through to the dependency, and the breaker quietly records each outcome — success or failure — in a sliding window.
OPEN is the protective state. Once the failure rate in that window crosses a threshold, the breaker trips. Now every call is short-circuited: it returns the fallback immediately without touching the dependency at all. This lasts for a configured wait duration. The point is that a service that's drowning isn't being hammered with more traffic while it tries to recover.
HALF-OPEN is the probe. After the wait elapses, the breaker lets a small number of trial calls through. If they succeed, it goes back to CLOSED. If they fail, it snaps back to OPEN for another cooldown. That loop — OPEN, probe, close or re-open — is the entire recovery mechanism, and it runs itself.
The sliding window
The interesting decision is how the breaker decides to trip, and that's the sliding window. A count-based window keeps the last N calls; a time-based one keeps the calls from the last N seconds. The failure rate is simply failures divided by total over that window.
Two knobs make this behave well. failureRateThreshold is the percentage that trips the breaker — 50% over a window of 10 means five of the last ten calls failed. And minimumNumberOfCalls stops it evaluating (and tripping) before there's enough data. Without it, two unlucky failures on a cold start would open the breaker over an empty system. You want the breaker to react to a trend, not to noise.
There's a second dimension: slow calls. A dependency doesn't have to error to be a problem — one that takes eight seconds to answer is arguably worse than one that fails fast. Resilience4j can treat calls slower than a threshold as failures too, so the breaker trips on latency, not just errors.
Wiring it in Spring Boot
Resilience4j is the modern, lightweight fault-tolerance library for the JVM. The resilience4j-spring-boot3 starter registers everything and binds each breaker's config from application.yml. One thing that trips people up: the annotations are implemented with AOP, so you also need spring-boot-starter-aop on the classpath — without it, @CircuitBreaker compiles fine and does absolutely nothing.
The guarded call is just an annotation:
@CircuitBreaker(name = "inventory", fallbackMethod = "checkStockFallback")
public InventoryStatus checkStock(String item) {
// the real (or here, stubbed flaky) downstream call
if (downstreamFails()) throw new InventoryUnavailableException(item);
return InventoryStatus.live(item, true);
}
// same args + a trailing Throwable, same return type
InventoryStatus checkStockFallback(String item, Throwable cause) {
return InventoryStatus.degraded(item); // available, but flagged degraded
}
The fallback is the whole reason this is graceful. Its signature must match the guarded method plus a trailing Throwable, and return the same type. Resilience4j calls it whenever the real call throws or the breaker is OPEN and short-circuits. Ours returns a degraded stock status — optimistically available, but clearly flagged — so an order can still be placed and reconciled later. A 200 with degraded: true, never a 500. The fallback must be cheap and must never throw.
The tuning lives in config, and it's genuinely different per environment:
resilience4j.circuitbreaker.instances.inventory:
sliding-window-size: 10
minimum-number-of-calls: 5
failure-rate-threshold: 50
wait-duration-in-open-state: 10s
permitted-number-of-calls-in-half-open-state: 3
automatic-transition-from-open-to-half-open-enabled: true
Dev uses a small window and a short cooldown so you can trip it and watch it recover in seconds while clicking around. Prod uses a larger window — statistically stable, so one unlucky error can't trip it — and a longer cooldown to give a genuinely struggling dependency real time to heal.
Seeing it work
With Actuator on the classpath, the breaker isn't a black box. It contributes a health indicator and a /actuator/circuitbreakerevents stream that logs every state transition, call, and failure rate — so you can watch CLOSED to OPEN to HALF-OPEN to CLOSED happen live.
And it's tested. A plain unit test builds a breaker with a tiny deterministic window, drives the flaky client to 100% failure and asserts it opens; confirms that while OPEN the downstream is never called and the caller still gets the degraded fallback; then sets the dependency healthy, waits out the cooldown, and asserts the half-open trials close it again. No Spring context, no Docker — the thing under test is the state machine.
A breaker handles errors. Next up (Day 15) is what handles hangs and blips: pairing it with a timeout, a retry, and a bulkhead.
The interactive version — a live breaker you can trip with a failure-rate slider and watch recover — plus the full code walkthrough is here: https://dev48v.infy.uk/orderhub.php