Building Distributed Data Processing with Spring Batch 6 + Spring Boot 4

When people first use Spring Batch, they usually start with a simple single-threaded job. That works for small datasets, but once data volume grows, throughput becomes the bottleneck.

In this sample project, I implemented a partitioned, multi-threaded Spring Batch pipeline to process sales records in parallel using a master/worker step model.

👉 Code repo: github.com/ykpraveen/spring-batch-sample

Spring Batch

At its core, Spring Batch is built around a few key abstractions:

Job: a complete batch workflow
Step: one phase of a job
ItemReader / ItemProcessor / ItemWriter: read-transform-write pipeline
Chunk processing: process N items in one transaction (chunkSize) ### Why chunking matters

In chunk-oriented steps, Spring Batch reads and processes items until the chunk size is reached, then writes and commits in one transaction.

So with chunk(500):

500 items are read/processed/written
one commit happens per chunk
failures can be retried at chunk boundaries

This gives a good balance between:

too-small chunks (high transaction overhead)
too-large chunks (long transactions, higher rollback cost)

How Spring Batch scales

Spring Batch offers multiple scaling patterns:

Multi-threaded Step: one step, concurrent chunk processing
Partitioning: split input domain into partitions, each handled by a worker step
Remote Chunking / Remote Partitioning: distribute work across processes/nodes

This project uses partitioning + thread pool execution (local distributed-style parallelism).

How this project applies those concepts

Repository: spring-batch-sample

The architecture is:

A master step creates partitions (data ranges)
A worker step executes each partition
A ThreadPoolTaskExecutor runs workers concurrently

Key classes (see src/main/java in repo):

BatchConfiguration → job/step orchestration
SalesDataPartitioner → partition boundary logic
SalesDataProcessor → business transformation logic

Code area: src/main/java

Performance tuning used here

The sample uses:

gridSize: 8 (number of partitions)
Thread pool: corePoolSize=4, maxPoolSize=8
chunk size: 500
Sample input: 5000 records

Interpretation

gridSize controls parallel work units.
Thread pool size controls actual concurrent execution.
Effective throughput depends on DB I/O, CPU, and item processing complexity.
Increasing partitions beyond available threads can still help load balancing, but with diminishing returns.

Database + metadata angle

Spring Batch is not just a processing framework; it is also a stateful execution framework.

It tracks job/step execution state in metadata, enabling:

restartability
execution history
failure diagnostics

In this sample, PostgreSQL stores both:

domain tables (sales_data, processed_data, processing_statistics)
batch execution context/metadata managed by Spring Batch

That combination is what makes batch jobs operationally reliable in real systems.

Run locally

1) Start PostgreSQL

docker compose up -d

2) Build and run the app

mvn clean install
mvn spring-boot:run

3) Trigger the batch job

curl -X POST http://localhost:8080/api/batch/start

4) Stop PostgreSQL

docker compose down

Why this pattern is useful in real projects

This design is a strong baseline for:

ETL and data migration
order/payment reconciliation
large-volume reporting prep
scheduled backend data shaping

You get:

clear separation of orchestration vs business logic
predictable transactional boundaries
scalable parallel execution
operational observability through batch metadata

Next extensions

If you want to evolve this sample toward production-grade scale:

Add retry/skip policies for fault tolerance.
Export job metrics (Micrometer + Prometheus/Grafana).
Make partition strategy adaptive to dataset size.
Move to remote partitioning for multi-node execution.

If you’re learning Spring Batch or designing high-throughput processing pipelines, this pattern is a solid starting point: simple enough to understand, realistic enough to extend.