Building a Job Queue in Rust: Persistent Tasks With Retry Logic

"Transient failures are inevitable; durable execution requires state to survive the crash."

What We're Building

We are constructing a resilient worker service in Rust that processes background tasks from a persistent queue. This example prioritizes data durability over peak throughput, ensuring that failed jobs are never lost but eventually succeed or move to a dead letter queue. We will use async Rust with SQL for storage, demonstrating how to structure state transitions that survive application restarts. The focus is on architectural correctness over raw performance, building a foundation for long-running background processing systems.

Step 1 — Define the Job State Machine

The worker must track a job's lifecycle without relying on volatile memory alone. We start by defining an enum that explicitly tracks every state transition, ensuring the logic is exhaustive.

pub enum JobStatus {
  Pending,
  Running,
  Succeeded,
  Failed,
  DeadLetter,
}

This choice matters because explicit states prevent silent state drifts that often plague long-running daemon processes. By forcing the developer to handle every case, we reduce the chance of forgetting to update a database column after a panic.

Step 2 — Persist Job State in Storage

A transient failure of the application worker must not result in data loss. We model the job table to include columns for status, retry count, and last attempt timestamp, creating a source of truth that survives restarts.

#[derive(sqlx::FromRow)]
pub struct Job {
  pub id: uuid::Uuid,
  pub status: JobStatus,
  pub retry_count: i32,
  pub created_at: DateTime<Utc>,
  pub last_attempted: Option<DateTime<Utc>>,
}

Storing metadata here allows us to query for pending work and ensures we can resume processing from exactly where the application died. We use UUIDs for the ID to maintain uniqueness and avoid accidental collisions.

Step 3 — Implement Exponential Backoff Logic

When a job fails, we must wait before retrying to prevent database overload. We generate a delay based on the current retry count, using a tokio::time::sleep to enforce a pause before the next attempt.

pub fn calculate_delay(retry_count: i32) -> Duration {
  // Start with 1 second delay and double it with each retry
  let base_duration = Duration::from_secs(1);
  let max_duration = Duration::from_secs(30);

  let raw_delay = base_duration * (1 << (retry_count as u32));
  let capped_delay = raw_delay.min(max_duration);

  // Add jitter to prevent thundering herd issues
  let jitter = Duration::from_millis(rand::random::<u64>() % 100);

  Duration::from_secs(capped_delay.as_secs() + jitter.as_secs())
}

Using exponential backoff instead of a fixed delay ensures that transient network issues resolve without overwhelming the system resources. The jitter component is critical for preventing multiple workers from retrying at the exact same second, which can cause spikes in database load.

Step 4 — Handle Permanent Failures in a DLQ

A job should not be retried infinitely if the error is irrecoverable. If the retry count exceeds a threshold, we transition the state to DeadLetter to prevent an infinite loop and allow operators to manually inspect or discard the job.

pub fn should_retry(job: &Job, error: &Error) -> bool {
  if job.retry_count >= MAX_RETRIES {
    // Mark as DeadLetter
    return false;
  }
  true
}

This separation isolates error handling from success paths, adhering to the principle of separation of concerns. The DeadLetter state acts as a final repository for problematic jobs, ensuring the system doesn't block on them.

Takeaways

Building a durable job queue requires treating state as an external truth source rather than application memory. By defining a strict state machine and persisting it in a relational database, we ensure that no work is ever lost even if the worker process crashes. The retry logic with exponential backoff protects system health, while the dead letter queue allows for manual intervention on permanent failures. This pattern scales well for any background processing system that values correctness over speed. The separation of concerns—logic for success, logic for retry, logic for failure—ensures that the code remains maintainable and the architecture remains robust against transient failures.

To expand on this pattern, consider adding concurrency controls to process jobs in parallel without overloading the database write locks. Investigate how postgres connection pooling interacts with long-running transactions when processing large payloads. Finally, review the logging strategies for tracking job lifecycle events in a distributed system context to ensure observability aligns with operational expectations. You might also consider implementing a metrics pipeline to track average processing times per job type.

Reading

Books

Designing Data-Intensive Applications (Kleppmann): Covers the tradeoffs between durability and availability that inform our database schema choices.
A Philosophy of Software Design (Ousterhout): The chapter on coupling applies to how we separate the retry logic from the processing logic.