Building Distributed Data Processing with Spring Batch 6 + Spring Boot 4

java dev.to

When people first use Spring Batch, they usually start with a simple single-threaded job. That works for small datasets, but once data volume grows, throughput becomes the bottleneck.

In this sample project, I implemented a partitioned, multi-threaded Spring Batch pipeline to process sales records in parallel using a master/worker step model.

πŸ‘‰ Code repo: github.com/ykpraveen/spring-batch-sample

Spring Batch

At its core, Spring Batch is built around a few key abstractions:

  • Job: a complete batch workflow
  • Step: one phase of a job
  • ItemReader / ItemProcessor / ItemWriter: read-transform-write pipeline
  • Chunk processing: process N items in one transaction (chunkSize) ### Why chunking matters

In chunk-oriented steps, Spring Batch reads and processes items until the chunk size is reached, then writes and commits in one transaction.

So with chunk(500):

  • 500 items are read/processed/written
  • one commit happens per chunk
  • failures can be retried at chunk boundaries

This gives a good balance between:

  • too-small chunks (high transaction overhead)
  • too-large chunks (long transactions, higher rollback cost)

How Spring Batch scales

Spring Batch offers multiple scaling patterns:

  1. Multi-threaded Step: one step, concurrent chunk processing
  2. Partitioning: split input domain into partitions, each handled by a worker step
  3. Remote Chunking / Remote Partitioning: distribute work across processes/nodes

This project uses partitioning + thread pool execution (local distributed-style parallelism).

How this project applies those concepts

Repository: spring-batch-sample

The architecture is:

  • A master step creates partitions (data ranges)
  • A worker step executes each partition
  • A ThreadPoolTaskExecutor runs workers concurrently

Key classes (see src/main/java in repo):

  • BatchConfiguration β†’ job/step orchestration
  • SalesDataPartitioner β†’ partition boundary logic
  • SalesDataProcessor β†’ business transformation logic

Code area: src/main/java

Performance tuning used here

The sample uses:

  • gridSize: 8 (number of partitions)
  • Thread pool: corePoolSize=4, maxPoolSize=8
  • chunk size: 500
  • Sample input: 5000 records

Interpretation

  • gridSize controls parallel work units.
  • Thread pool size controls actual concurrent execution.
  • Effective throughput depends on DB I/O, CPU, and item processing complexity.
  • Increasing partitions beyond available threads can still help load balancing, but with diminishing returns.

Database + metadata angle

Spring Batch is not just a processing framework; it is also a stateful execution framework.

It tracks job/step execution state in metadata, enabling:

  • restartability
  • execution history
  • failure diagnostics

In this sample, PostgreSQL stores both:

  • domain tables (sales_data, processed_data, processing_statistics)
  • batch execution context/metadata managed by Spring Batch

That combination is what makes batch jobs operationally reliable in real systems.


Run locally

1) Start PostgreSQL

docker compose up -d
Enter fullscreen mode Exit fullscreen mode

2) Build and run the app

mvn clean install
mvn spring-boot:run
Enter fullscreen mode Exit fullscreen mode

3) Trigger the batch job

curl -X POST http://localhost:8080/api/batch/start
Enter fullscreen mode Exit fullscreen mode

4) Stop PostgreSQL

docker compose down
Enter fullscreen mode Exit fullscreen mode

Why this pattern is useful in real projects

This design is a strong baseline for:

  • ETL and data migration
  • order/payment reconciliation
  • large-volume reporting prep
  • scheduled backend data shaping

You get:

  • clear separation of orchestration vs business logic
  • predictable transactional boundaries
  • scalable parallel execution
  • operational observability through batch metadata

Next extensions

If you want to evolve this sample toward production-grade scale:

  1. Add retry/skip policies for fault tolerance.
  2. Export job metrics (Micrometer + Prometheus/Grafana).
  3. Make partition strategy adaptive to dataset size.
  4. Move to remote partitioning for multi-node execution.

If you’re learning Spring Batch or designing high-throughput processing pipelines, this pattern is a solid starting point: simple enough to understand, realistic enough to extend.

Source: dev.to

arrow_back Back to Tutorials