When people first use Spring Batch, they usually start with a simple single-threaded job. That works for small datasets, but once data volume grows, throughput becomes the bottleneck.
In this sample project, I implemented a partitioned, multi-threaded Spring Batch pipeline to process sales records in parallel using a master/worker step model.
π Code repo: github.com/ykpraveen/spring-batch-sample
Spring Batch
At its core, Spring Batch is built around a few key abstractions:
- Job: a complete batch workflow
- Step: one phase of a job
- ItemReader / ItemProcessor / ItemWriter: read-transform-write pipeline
-
Chunk processing: process N items in one transaction (
chunkSize) ### Why chunking matters
In chunk-oriented steps, Spring Batch reads and processes items until the chunk size is reached, then writes and commits in one transaction.
So with chunk(500):
- 500 items are read/processed/written
- one commit happens per chunk
- failures can be retried at chunk boundaries
This gives a good balance between:
- too-small chunks (high transaction overhead)
- too-large chunks (long transactions, higher rollback cost)
How Spring Batch scales
Spring Batch offers multiple scaling patterns:
- Multi-threaded Step: one step, concurrent chunk processing
- Partitioning: split input domain into partitions, each handled by a worker step
- Remote Chunking / Remote Partitioning: distribute work across processes/nodes
This project uses partitioning + thread pool execution (local distributed-style parallelism).
How this project applies those concepts
Repository: spring-batch-sample
The architecture is:
- A master step creates partitions (data ranges)
- A worker step executes each partition
- A
ThreadPoolTaskExecutorruns workers concurrently
Key classes (see src/main/java in repo):
-
BatchConfigurationβ job/step orchestration -
SalesDataPartitionerβ partition boundary logic -
SalesDataProcessorβ business transformation logic
Code area: src/main/java
Performance tuning used here
The sample uses:
-
gridSize: 8(number of partitions) - Thread pool:
corePoolSize=4,maxPoolSize=8 chunk size: 500- Sample input: 5000 records
Interpretation
-
gridSizecontrols parallel work units. - Thread pool size controls actual concurrent execution.
- Effective throughput depends on DB I/O, CPU, and item processing complexity.
- Increasing partitions beyond available threads can still help load balancing, but with diminishing returns.
Database + metadata angle
Spring Batch is not just a processing framework; it is also a stateful execution framework.
It tracks job/step execution state in metadata, enabling:
- restartability
- execution history
- failure diagnostics
In this sample, PostgreSQL stores both:
- domain tables (
sales_data,processed_data,processing_statistics) - batch execution context/metadata managed by Spring Batch
That combination is what makes batch jobs operationally reliable in real systems.
Run locally
1) Start PostgreSQL
docker compose up -d
2) Build and run the app
mvn clean install
mvn spring-boot:run
3) Trigger the batch job
curl -X POST http://localhost:8080/api/batch/start
4) Stop PostgreSQL
docker compose down
Why this pattern is useful in real projects
This design is a strong baseline for:
- ETL and data migration
- order/payment reconciliation
- large-volume reporting prep
- scheduled backend data shaping
You get:
- clear separation of orchestration vs business logic
- predictable transactional boundaries
- scalable parallel execution
- operational observability through batch metadata
Next extensions
If you want to evolve this sample toward production-grade scale:
- Add retry/skip policies for fault tolerance.
- Export job metrics (Micrometer + Prometheus/Grafana).
- Make partition strategy adaptive to dataset size.
- Move to remote partitioning for multi-node execution.
If youβre learning Spring Batch or designing high-throughput processing pipelines, this pattern is a solid starting point: simple enough to understand, realistic enough to extend.