Understanding the SAGA Pattern: Coordinating Long-Running Transactions

Traditional database transactions guarantee ACID semantics because a single engine controls all the state. Locking rows, writing to a write-ahead log, and atomically committing or rolling back is all possible because one system owns all the data.

Long-running business operations break this model in two ways.

First, they span service boundaries. A hotel booking touches a room inventory service, a payment service, a loyalty points service, and a calendar service — each owning its own database. No single engine can wrap all four writes in a transaction.

Second, they take time. A loan approval may need a credit check that takes 30 seconds, a fraud scoring call that takes 5 seconds, and a human underwriting decision that takes 4 hours. Holding database locks open for 4 hours is not a transaction — it's a disaster waiting to happen. Lock contention would grind every other operation in the system to a halt.

The SAGA pattern addresses both problems. A SAGA decomposes a long-running operation into a sequence of local transactions, each independently committed to its service's own database. Consistency is maintained not through locks, but through compensating transactions — operations that semantically reverse the effect of a prior step when a downstream failure requires it.

SAGAs are not an academic pattern. They are the operational backbone of every serious microservices system that handles multi-step business processes.

Without them, engineering teams fall into one of two failure modes:

The silent inconsistency failure — Services commit independently with no rollback strategy. When step 4 of 6 fails, the system leaves partial state: a payment captured, a room reserved, loyalty points awarded, but no booking record created. Reconciling these inconsistencies requires manual intervention and produces incorrect reports.

The over-synchronization failure — Teams reach for synchronous HTTP chains. Service A calls B, B calls C, C calls D — all in a request-response chain. A timeout at step C leaves A waiting, holding a connection, and with no reliable signal about what state C ended in. The system becomes fragile under any latency variance.

SAGAs provide a third path: asynchronous coordination with explicit failure handling, durable state tracking, and a clear rollback strategy defined before the operation begins.

Relationship to Distributed Transactions

The SAGA pattern was first described by Hector Garcia-Molina and Kenneth Salem in their 1987 paper "SAGAS" as a mechanism for long-lived transactions in database systems. Microservices architectures have adopted and extended it for cross-service coordination.

Consider a travel booking platform where confirming a trip requires:

Step	Service	Local Transaction
T1	Flight Service	Reserve seat on flight
T2	Hotel Service	Reserve room for dates
T3	Car Rental Service	Reserve vehicle
T4	Payment Service	Capture full trip cost
T5	Loyalty Service	Award points to account
T6	Notification Service	Send booking confirmation

Each service manages its own database. None can participate in a cross-service transaction.

A realistic failure path:

T1 succeeds — seat reserved
T2 succeeds — room reserved
T3 succeeds — vehicle reserved
T4 fails — payment gateway timeout, charge not captured
The SAGA must now compensate: release the vehicle (C3), release the room (C2), release the flight seat (C1)
The user sees a clean "booking failed" response with no partial charges

Another failure path, harder to handle:

T1 through T4 succeed — all resources reserved, payment captured
T5 fails — loyalty service database is down
Do you roll back the entire booking, including a successful payment?
Or do you accept the booking and compensate later when loyalty is available?

This second scenario illustrates that SAGA design is never just about rollback — it's about defining acceptable intermediate states and forward recovery paths alongside backward recovery.

Every SAGA has three possible terminal outcomes:

Completed — all local transactions committed successfully
Compensated — a step failed, all prior steps were compensated, system returned to its original state
Stuck — a step failed and compensation also failed, requiring manual intervention

The state machine governing a SAGA's lifecycle looks like this:

STARTED → [T1] → [T2] → [T3] → ... → COMPLETED
                   ↓ failure
              COMPENSATING → [C(n-1)] → [C(n-2)] → ... → COMPENSATED
                                ↓ compensation failure
                           STUCK (requires human intervention)

Every SAGA implementation must handle the STUCK state. It is not theoretical — compensation failures happen in production. The correct response is to persist the stuck SAGA, alert an operator, and surface the state in an admin dashboard for resolution.

In a choreography model, there is no central coordinator. Each service reacts to domain events published by the previous step and publishes its own event upon completion or failure.

FlightService publishes → SeatReserved
HotelService listens → SeatReserved → reserves room → publishes RoomReserved
CarRentalService listens → RoomReserved → reserves vehicle → publishes VehicleReserved
PaymentService listens → VehicleReserved → captures payment → publishes PaymentCaptured
LoyaltyService listens → PaymentCaptured → awards points → publishes PointsAwarded
NotificationService listens → PointsAwarded → sends email

On failure, the failing service publishes a failure event:

PaymentService publishes → PaymentFailed
CarRentalService listens → PaymentFailed → releases vehicle → publishes VehicleReleased
HotelService listens → VehicleReleased → releases room → publishes RoomReleased
FlightService listens → RoomReleased → releases seat → publishes SeatReleased

Strengths:

No coordinator bottleneck — each service is autonomous
Natural fit for event-driven architectures already using a message broker
Services are loosely coupled — they only know about events, not about each other

Weaknesses:

The SAGA's overall state is implicit — distributed across event logs in multiple services. There is no single place to query "what is the current status of booking #123?"
Adding a new step requires changing multiple subscriber configurations
Debugging failures means correlating events across service logs using a correlation ID
Cyclic dependencies can emerge if the event graph isn't designed carefully

Choreography is appropriate for simple, linear flows with stable participants. It breaks down when the business logic has many conditional branches or when debugging turnaround time needs to be short.

In an orchestration model, a dedicated SAGA orchestrator service owns the entire transaction lifecycle. It calls each participant, receives results, and decides what to do next. Participants don't know about each other — they only know how to execute a command and respond.

Orchestrator → ReserveSeat → FlightService → SeatReserved
Orchestrator → ReserveRoom → HotelService → RoomReserved
Orchestrator → ReserveVehicle → CarRentalService → VehicleReserved
Orchestrator → CapturePayment → PaymentService → PaymentFailed
  ↳ Orchestrator → ReleaseVehicle → CarRentalService → VehicleReleased
  ↳ Orchestrator → ReleaseRoom → HotelService → RoomReleased
  ↳ Orchestrator → ReleaseSeat → FlightService → SeatReleased

The orchestrator persists its state after every step. If it crashes mid-execution, it can resume from the last committed step on restart.

Strengths:

The SAGA's full state is in one place — easy to query, debug, and monitor
Business logic lives in the orchestrator, not scattered across event subscribers
Easy to add new steps without touching existing services
Supports complex conditional branching naturally

Weaknesses:

The orchestrator is a coordination point — if it's unavailable, no new SAGAs start
Risk of the orchestrator becoming a "god service" with too much business logic
Requires durable state persistence for the orchestrator itself

Choosing a Model

Start with orchestration for any flow that has more than 3 steps, conditional branches, or rollback complexity. Choreography works well for simple append-only flows in systems already heavily event-driven. Mixing both in the same system is common — use the model that fits each flow.

A compensating transaction is not the same as an undo. It is a new forward operation that semantically reverses the effect of a prior step in the business domain.

Releasing a reserved seat is a compensating transaction for reserving it. Issuing a refund is a compensating transaction for capturing a payment. These are new database writes — not rollbacks of the original writes.

Not every step has a meaningful compensating transaction. Some operations are pivot transactions — once committed, they cannot be undone in the business sense. A confirmation email cannot be unsent. An SMS cannot be recalled. An audit log entry should not be deleted.

For pivot transactions, the compensation strategy is forward recovery rather than backward recovery: accept that the step happened and compensate forward by sending a correction or cancellation notification instead of attempting to reverse it.

SAGAs provide Atomicity (the whole saga eventually completes or fully compensates), Consistency (the system moves between valid business states), and Durability (each local transaction is durable). What they sacrifice is Isolation.

In a traditional transaction, concurrent operations don't see each other's intermediate state. In a SAGA, each local transaction commits independently. Other processes can read the system's state between steps. A customer support agent querying a booking during a payment step may see a partially-booked trip. A concurrent booking may attempt to reserve the same seat window.

This is the most significant production consequence of SAGA. Systems built on SAGAs must be designed to tolerate and reason about these intermediate states.

Four specific anomalies become possible with SAGA's lack of isolation:

Lost updates — Two concurrent SAGAs both read the same stock level and both decide to reserve the last unit. Both commit their reservations, creating an oversell.

Dirty reads — A concurrent process reads a booking record created by a SAGA's order step before the payment step has committed. The booking may be cancelled if payment fails, but the concurrent reader saw it as "confirmed."

Non-repeatable reads — A SAGA reads a value early in its execution and makes a decision based on it. By the time a later step executes, another process has changed that value. The earlier decision is now based on stale data.

Phantom reads — A SAGA queries for a set of records meeting criteria, makes a decision, and by the time it acts, new records matching those criteria have been inserted by another process.

Chris Richardson (in Microservices Patterns) defines six countermeasures for SAGA isolation issues:

Countermeasure	Description
Semantic lock	Mark records with a `PENDING` flag during SAGA execution; reject concurrent modifications
Commutative updates	Design updates so order doesn't matter (e.g., increment/decrement instead of set)
Pessimistic view	Reorder SAGA steps to minimize the window of dirty reads
Reread value	Re-read a value before committing a step to detect concurrent modifications
Version file	Record updates as a versioned append log; apply them in a defined order
By value	Use business-level risk assessment to decide per-transaction which countermeasure to apply

The most broadly applicable countermeasure for most systems is the semantic lock: mark the primary aggregate with a processing state at the start of the SAGA, and only clear it on completion or full compensation. Concurrent reads see the processing state and treat the record as unavailable or pending.

Here is a production-grade orchestration SAGA for the travel booking flow, using a durable state machine in Laravel:

// The SAGA state record — persisted to database
class TravelBookingSaga extends Model
{
    protected $casts = [
        'payload'     => 'array',
        'context'     => 'array',
        'compensated_steps' => 'array',
    ];

    public function isCompensating(): bool
    {
        return $this->status === SagaStatus::COMPENSATING;
    }
}

class TravelBookingOrchestrator
{
    private array $steps = [
        'reserve_flight',
        'reserve_hotel',
        'reserve_vehicle',
        'capture_payment',
        'award_points',
        'send_confirmation',
    ];

    public function execute(TravelBookingCommand $command): BookingResult
    {
        $saga = TravelBookingSaga::create([
            'id'      => Str::uuid(),
            'status'  => SagaStatus::STARTED,
            'payload' => $command->toArray(),
            'context' => [],
            'compensated_steps' => [],
        ]);

        foreach ($this->steps as $step) {
            try {
                $result = $this->executeStep($step, $saga);
                $saga->update([
                    'current_step' => $step,
                    'context'      => array_merge($saga->context, [$step => $result]),
                ]);
            } catch (StepFailedException $e) {
                $this->compensate($saga, $step, $e);
                return BookingResult::failure($saga->id, $e->getMessage());
            }
        }

        $saga->update(['status' => SagaStatus::COMPLETED]);
        return BookingResult::success($saga->id, $saga->context['reserve_flight']['booking_ref']);
    }

    private function executeStep(string $step, TravelBookingSaga $saga): array
    {
        return match ($step) {
            'reserve_flight'   => $this->flightService->reserve($saga->payload, $saga->id),
            'reserve_hotel'    => $this->hotelService->reserve($saga->payload, $saga->id),
            'reserve_vehicle'  => $this->carService->reserve($saga->payload, $saga->id),
            'capture_payment'  => $this->paymentService->capture($saga->payload, $saga->id),
            'award_points'     => $this->loyaltyService->award($saga->payload, $saga->id),
            'send_confirmation'=> $this->notificationService->send($saga->context, $saga->id),
            default => throw new \InvalidArgumentException("Unknown step: {$step}"),
        };
    }

    private function compensate(TravelBookingSaga $saga, string $failedStep, \Throwable $reason): void
    {
        $saga->update([
            'status'         => SagaStatus::COMPENSATING,
            'failed_step'    => $failedStep,
            'failure_reason' => $reason->getMessage(),
        ]);

        // Compensate all successfully completed steps in reverse order
        $completedSteps = array_keys($saga->context);
        $compensations  = array_reverse($completedSteps);

        foreach ($compensations as $step) {
            try {
                $this->compensateStep($step, $saga);
                $saga->update([
                    'compensated_steps' => array_merge($saga->compensated_steps, [$step]),
                ]);
            } catch (\Throwable $e) {
                // Compensation failed — SAGA is stuck
                $saga->update(['status' => SagaStatus::STUCK, 'stuck_reason' => $e->getMessage()]);
                $this->alertOperations($saga, $step, $e);
                return;
            }
        }

        $saga->update(['status' => SagaStatus::COMPENSATED]);
    }

    private function compensateStep(string $step, TravelBookingSaga $saga): void
    {
        $ctx = $saga->context[$step];

        match ($step) {
            'reserve_flight'  => $this->flightService->release($ctx['reservation_id'], $saga->id),
            'reserve_hotel'   => $this->hotelService->release($ctx['reservation_id'], $saga->id),
            'reserve_vehicle' => $this->carService->release($ctx['reservation_id'], $saga->id),
            'capture_payment' => $this->paymentService->refund($ctx['charge_id'], $saga->id),
            'award_points'    => $this->loyaltyService->revoke($ctx['transaction_id'], $saga->id),
            // Notifications cannot be compensated — forward recovery only
            'send_confirmation' => $this->notificationService->sendCancellation($ctx, $saga->id),
            default => null,
        };
    }
}

Each participant service receives the $saga->id as an idempotency key. Calling flightService->reserve() twice with the same SAGA ID returns the same reservation rather than creating a duplicate.

Backward recovery (compensating all prior steps) is the default mental model for SAGAs. But for some failures, it's wrong.

If a customer's payment is captured and loyalty points are awarded, but the notification email fails due to a transient SMTP error, rolling back the entire booking — including the successful payment — is a worse outcome than retrying the email. The correct strategy is forward recovery: persist the SAGA in a NOTIFICATION_PENDING state and retry the notification step with exponential backoff.

Design each SAGA step with explicit recovery semantics: is failure of this step a reason to compensate everything, or a reason to retry until it succeeds?

Compensations must be idempotent. Under retry conditions, the orchestrator may call a compensation multiple times. A payment refund called twice must result in one refund, not two. Enforce idempotency at the database level using the SAGA ID as a unique constraint on the compensation record:

CREATE UNIQUE INDEX idx_refunds_saga_idempotency
  ON refunds (charge_id, saga_id);

An attempt to insert a duplicate refund record fails silently with a unique constraint violation, which the compensation handler treats as success.

An orchestrator crash between steps leaves a SAGA with no one driving it forward. You need a recovery process — a scheduled job that queries for SAGAs in non-terminal states older than a threshold and resumes or compensates them:

// Runs every 5 minutes
class StuckSagaRecoveryJob implements ShouldQueue
{
    public function handle(): void
    {
        TravelBookingSaga::where('status', SagaStatus::STARTED)
            ->where('updated_at', '<', now()->subMinutes(10))
            ->each(fn ($saga) => $this->orchestrator->resume($saga));
    }
}

In a high-volume system, SAGA state records accumulate quickly. Completed SAGAs should be archived to cold storage on a schedule. Keep only terminal-state SAGAs (COMPLETED, COMPENSATED, STUCK) in the hot table for a rolling window (e.g., 30 days), then archive to S3 or a data warehouse. Never delete SAGAs — they are an audit trail.

Not all SAGA steps depend on each other. For the travel booking, reserving a flight, hotel, and car are independent operations. Running them sequentially adds latency with no benefit. A parallel fan-out reduces total booking time significantly:

// Laravel concurrent jobs via process pool
[$flightResult, $hotelResult, $carResult] = Process::pool(function (Pool $pool) use ($saga) {
    $pool->path(base_path())->command("php artisan saga:step {$saga->id} reserve_flight");
    $pool->path(base_path())->command("php artisan saga:step {$saga->id} reserve_hotel");
    $pool->path(base_path())->command("php artisan saga:step {$saga->id} reserve_vehicle");
})->start()->wait();

Parallel execution requires that compensation handles partial completion: if flight and hotel succeed but car fails, only two compensations are needed, not three.

In systems with millions of concurrent SAGAs, a single saga_state table becomes a write bottleneck. Partition the table by a hash of the SAGA ID or by creation date. Route orchestrator instances to their partition using consistent hashing. This eliminates cross-partition coordination while maintaining queryability.

Choreography SAGAs are vulnerable to event ordering issues. If the message broker delivers PaymentFailed before VehicleReserved arrives at the car service, the car service may try to compensate a reservation it hasn't recorded yet. Use a semantic lock on the aggregate record with a version counter, and reject compensation commands for steps that haven't been committed.

A SAGA that is STUCK or taking longer than expected is invisible without deliberate instrumentation. Every SAGA system needs:

Real-time dashboard — current SAGAs by status, step distribution, age
Step-level metrics — p50/p95/p99 latency per step, failure rate per step
SAGA duration SLOs — alert when a SAGA has been running longer than the 99th percentile baseline
Stuck SAGA alerting — page oncall when any SAGA enters STUCK state
Correlation IDs — every service call emits traces with the SAGA ID, enabling end-to-end trace reconstruction

The SAGA pattern is the correct answer to long-running distributed transactions. It trades ACID isolation for operational resilience — and that trade is usually worth it.

The architecture decision isn't whether to use SAGAs, but which coordination model to use and how carefully to design compensations. A system where compensations are idempotent, SAGAs are durably persisted, STUCK states trigger alerting, and recovery jobs run on schedule is a system that handles distributed consistency without depending on unreliable global coordination.

When to choose orchestration: Complex flows, conditional branching, rollback logic, audit requirements, or more than 4 participants. This covers most real business processes.

When to choose choreography: Simple linear flows in existing event-driven architectures where services are stable and the observability infrastructure can correlate events across services.

What never changes regardless of model:

Every SAGA step must be idempotent
Every compensation must be idempotent
The STUCK state must be handled explicitly
SAGAs are an audit log — never delete them
Test compensation paths as rigorously as the happy path

The hard part of SAGAs isn't the implementation. It's defining the business logic: what is an acceptable intermediate state? What warrants backward recovery versus forward recovery? What are the countermeasures for isolation anomalies in this specific domain? Answer those questions in design, not in production.

Understanding the SAGA Pattern:
Coordinating Long-Running Transactions

Distributed Transactions in Microservices: Why Consistency Becomes Difficult

Two-Phase Commit (2PC): Achieving Atomicity Across Distributed Systems

CAP Theorem Explained: The Tradeoffs Behind Distributed Systems

Designing a complex system?