Saga Orchestration in Go: Distributed Workflows That Actually Roll Back

Every non-trivial business operation touches more than one system.

An e-commerce order reserves inventory, charges a payment method, and schedules a shipment — three services, three databases. A bank transfer debits one account and credits another across two ledgers that may not even be in the same data center. A cloud VM provisioning workflow reserves a network port, allocates storage, starts the hypervisor, registers billing, and sends a notification — five services, five independent state stores.

The question is: what happens when step four fails after steps one through three have already succeeded?

In a monolith backed by a single database, the answer is simple: roll back the transaction. The database engine guarantees atomicity; either everything commits or nothing does. But when your workflow spans multiple services, each owning its own storage, there is no transaction boundary that wraps them all. There is no rollback button. Step one through three have already made durable changes to systems that do not know about each other, and step four's failure has left the system in an inconsistent state.

This is not a pathological edge case. It is the default condition in any distributed architecture. And it gets worse: the failure might not be a hard error. The network might time out. The billing service might return a 503. You do not know whether step four applied its effect or not — you only know you did not receive a success response. Now what?

This is the problem sagas were designed for.

  Client         Inventory Svc      Payment Svc      Shipping Svc
    │                  │                 │                 │
 1  │──reserve(item)──►│                 │                 │
    │◄──── 200 OK ─────│                 │                 │
    │              [reserved ✓]          │                 │
    │                  │                 │                 │
 2  │──────────── charge(card, $99) ────►│                 │
    │◄───────────────── 200 OK ──────────│                 │
    │                  │            [charged ✓]            │
    │                  │                 │                 │
 3  │─────────────────────── schedule(order) ─────────────►│
    │◄─────────────────────────── 503 ──────────────────── │
    │                  │                 │           [no record ✗]
    │                  │                 │                 │
    ╔══════════════════════════════════════════════════════╗
    ║  ⚠  Inconsistent state                               ║
    ║     Inventory: item reserved   ✓  (durable)          ║
    ║     Payment:   $99 charged     ✓  (durable)          ║
    ║     Shipping:  no record       ✗  (never happened)   ║
    ║                                                      ║
    ║     Customer paid — nothing ships                    ║
    ╚══════════════════════════════════════════════════════╝

Step three failed, but the first two services already committed. Their changes are durable and cannot be undone by simply retrying or ignoring the error. Without an explicit rollback strategy, the system is stuck: inventory is locked, the customer's card was charged, and no shipment was ever created.

Why you cannot use a distributed transaction here

Two-phase commit (2PC) is the classical answer to multi-node consistency. It works: coordinators send PREPARE, wait for all participants to acknowledge, then send COMMIT. But in a microservices architecture it creates a cluster of problems:

Lock contention. Every participant holds locks from PREPARE until the coordinator sends COMMIT or ABORT. Under partial failure or a slow coordinator, those locks hang indefinitely.
Single point of failure. If the coordinator crashes between PREPARE and COMMIT, participants are stuck in an uncertain state with no way to decide unilaterally.
Service coupling. 2PC requires all participants to speak the same protocol. That forces every service — including third-party ones you do not control — to implement a two-phase interface.

For workflows that touch multiple independently deployed services, 2PC is a liability. The saga pattern is the alternative.

The saga pattern in one paragraph

A saga breaks a distributed workflow into a sequence of local transactions, each scoped to a single service. Every step has a compensating action that undoes its effect. If any step fails, the already-completed steps are compensated in reverse order, returning the system to a consistent state. There are no distributed locks, no coordinator protocol, and no coupling beyond the interfaces each service already exposes.

The pattern was described by Garcia-Molina and Salem in 1987 for long-lived database transactions. It maps directly onto modern microservice workflows.

Applied to the order example above:

┌──────────────────────── SAGA ORCHESTRATOR ──────────────────────────┐
│                                                                     │
│  ▶  FORWARD  (steps execute in order)                               │
│                                                                     │
│     step 1 │ reserve-inventory ──► Inventory Svc   ✓  committed     │
│     step 2 │ charge-payment    ──► Payment Svc     ✓  committed     │
│     step 3 │ schedule-shipment ──► Shipping Svc    ✗  503 — FAILED  │
│                                                    │                │
│            └────────────────────────────────────── ▼                │
│                                                                     │
│  ◀  COMPENSATION  (triggered automatically, reverse order)          │
│                                                                     │
│     undo 2 │ refund-payment    ──► Payment Svc     ✓  undone        │
│     undo 1 │ release-inventory ──► Inventory Svc   ✓  undone        │
│            │ (step 3 never committed — no compensation needed)      │
│                                                                     │
│  ✓  Every completed step rolled back. Consistent state restored.    │
└─────────────────────────────────────────────────────────────────────┘

The key observation: only the steps that actually succeeded need compensation. Step 3 failed before it wrote anything, so there is nothing to undo. The orchestrator tracks this automatically.

Orchestration versus choreography

Sagas are typically implemented in one of two styles:

Choreography — each service publishes events and reacts to events from other services. There is no central coordinator. This scales well but makes the overall workflow hard to trace; understanding what "provisioning" means requires reading every participant.

Orchestration — a dedicated orchestrator drives the saga, calling each service in turn and triggering compensation on failure. The whole workflow is visible in one place, which makes debugging, observability, and retry logic straightforward to reason about.

For provisioning workflows — where the sequence is deterministic, failures are meaningful, and you need an audit trail — orchestration is the right choice.

A Go implementation

Here is a small, dependency-free saga orchestrator in Go. The full source is at github.com/mrchcoin/go-saga-orchestrator.

The core type is generic over the payload T, so one saga definition can be executed concurrently from many goroutines without sharing mutable state:

type Saga[T any] struct {
    name  string
    steps []Step[T]
    retry RetryPolicy
    obs   func(Event)
}

Each step pairs a forward action with a compensating action. The compensating action may be nil when the forward action has no side effect to reverse — a terminal notification, for example:

type Step[T any] struct {
    Name       string
    Action     func(ctx context.Context, payload T) error
    Compensate func(ctx context.Context, payload T) error
}

A saga is built with a fluent API, then executed:

flow := saga.New[*ProvisionRequest]("provision-vm").
    Step("reserve-network-port",   reservePort,   releasePort).
    Step("allocate-storage",       allocateBlock, releaseBlock).
    Step("start-hypervisor",       startVM,       stopVM).
    Step("register-billing",       openBillingRecord, closeBillingRecord).
    Step("notify-user",            sendEmail,     nil) // notification has no compensation

result, err := flow.Execute(ctx, req)

What happens when billing fails

Suppose register-billing returns an error. The orchestrator does not panic or leave the system in a half-provisioned state. It compensates the already-completed steps in reverse order:

step_started:       reserve-network-port
step_completed:     reserve-network-port
step_started:       allocate-storage
step_completed:     allocate-storage
step_started:       start-hypervisor
step_completed:     start-hypervisor
step_started:       register-billing
step_failed:        register-billing  (billing service unavailable)
compensation_started:   start-hypervisor
compensation_completed: start-hypervisor
compensation_started:   allocate-storage
compensation_completed: allocate-storage
compensation_started:   reserve-network-port
compensation_completed: reserve-network-port
saga_aborted

The caller gets an *AbortError that unwraps to the original step error. No cleanup code in the calling layer, no manual rollback logic scattered across services.

Three design decisions worth explaining

1. The observer is transport-agnostic

The orchestrator emits an Event for every state transition, but it does not know how to publish that event. The caller registers an observer function:

flow.Observe(func(e saga.Event) {
    // publish to Kafka, append to a journal, stream to a dashboard — your choice
    broker.Publish("saga.events", marshalEvent(e))
})

This keeps the package free of any messaging, HTTP, or database dependency. The same core runs in a unit test (with no observer) and in production (with a Kafka publisher) without branching in the orchestrator itself.

2. Compensation is best-effort across all steps

When rollback begins, the orchestrator runs every compensation, even if some fail. A single broken undo should not strand the others. If compensations fail, the error is wrapped alongside the original abort error:

result, err := flow.Execute(ctx, req)
if err != nil {
    var ce *saga.CompensationError
    if errors.As(err, &ce) {
        // partial rollback — operator intervention required
        for _, f := range ce.Failures {
            log.Errorf("compensation failed: step=%s err=%v", f.Step, f.Err)
        }
    }
    // original step error is still accessible
    log.Errorf("saga aborted: %v", err)
}

This distinction matters operationally. An *AbortError with no CompensationError means the system rolled back cleanly. An error wrapping both means at least one step left residual state — alert your on-call.

3. Retry policy is injected, not hardcoded

Transient network errors should be retried before compensation triggers. The retry policy is set on the saga, not buried in step implementations:

flow.Retry(saga.RetryPolicy{
    MaxAttempts: 3,
    Backoff: func(attempt int) time.Duration {
        return time.Duration(attempt*attempt) * 100 * time.Millisecond // exponential
    },
})

Context cancellation is honoured between attempts, so a timeout at the caller layer cleanly aborts a retry loop rather than forcing it to sleep through the full backoff.

Production considerations

Idempotency is the caller's responsibility. If the process crashes mid-saga and restarts from a checkpoint, compensations may be called on steps that were already compensated. Each compensation must tolerate being called more than once. A billing-record deletion that checks whether the record still exists before deleting is idempotent. One that blindly sends a DELETE and treats 404 as an error is not.

Persistence is outside the package. The observer pattern is the hook for it. On each step_completed event, append the step name to a durable journal (a Postgres row, a Kafka topic). On restart, replay the journal to reconstruct how far the saga progressed, then resume from the first incomplete step or re-trigger compensation for each completed one. The orchestrator itself does not need to know about persistence; it runs the steps you ask it to run.

Timeouts belong on the context. Pass a context.WithTimeout to Execute. The orchestrator checks ctx.Done() between retry attempts and surfaces context cancellation as a step failure, triggering normal compensation. The provisioning workflow that gets stuck waiting for a sluggish billing service will roll back cleanly within the deadline rather than hanging indefinitely.

The key invariant

A saga does not prevent partial failures — it recovers from them predictably. The invariant the orchestrator enforces is: either all forward actions succeed, or the system is returned to the state it was in before the saga started (modulo compensation failures, which are reported explicitly and require operator attention).

That is a weaker guarantee than ACID atomicity, but it is the guarantee you can actually achieve across independently deployed services. And in practice, for provisioning workflows, billing pipelines, and order fulfilment flows, it is the guarantee you need.

Source: github.com/mrchcoin/go-saga-orchestrator — MIT licensed, dependency-free, race-tested.