Your CEO wants to see last night's revenue on a dashboard when she arrives at work. Your fraud team needs to block a suspicious transaction within 2 seconds of it occurring. Both are data problems — but they require completely different solutions. The choice between batch and streaming is fundamentally about one question: how stale can your data be before it stops being useful?
The analogy: doing laundry
Imagine you work at a laundromat. Batch processing is like waiting until all the machines are full before running them — you collect dirty clothes all day and do one big wash at 2am. Streaming is like washing each item the moment a customer drops it off.
Both approaches get clothes clean. But they require completely different equipment, workflows, staffing, and costs. A 2am overnight wash cycle is far cheaper and simpler to run than a 24/7 instant-wash operation. The question is which one your customers actually need.
What is batch processing?
In a batch pipeline, data accumulates over a period of time and is processed in one scheduled run — typically hourly, nightly, or weekly. The pipeline starts, does its work, and stops.
Common tools: Apache Airflow (orchestration), dbt (SQL transformations), Apache Spark in batch mode, AWS Glue, Google Cloud Dataflow in batch mode.
Batch is the right choice when you need:
- Nightly revenue reports and financial summaries
- Monthly invoicing or billing runs
- Historical data backfills or migrations
- Training machine learning models on past data
- Any reporting where data from a few hours ago is “fresh enough”
The vast majority of analytics workloads at growing companies fall into this category. A dashboard showing yesterday’s sales is usually enough — nobody needs to know the exact revenue figure from 4 minutes ago to make a business decision.
What is stream processing?
In a streaming pipeline, each event is processed the moment it arrives — or within seconds. Instead of a scheduled job that starts and stops, a streaming system runs continuously, consuming events one by one (or in tiny micro-batches of a few seconds).
Common tools: Apache Kafka (message broker/stream backbone), Apache Flink, Kafka Streams, Spark Structured Streaming, AWS Kinesis.
Streaming is genuinely necessary when you need:
- Fraud detection — block a bad transaction within 2 seconds
- Real-time dashboards — live user counts, live order tracking
- IoT sensor monitoring — alert when a temperature exceeds a threshold
- Operational alerting — notify on-call when error rate spikes
- Live personalisation — recommend based on what a user did 30 seconds ago
The tradeoff: why streaming costs more
Here’s the part most engineers don’t explain clearly enough: streaming is significantly harder and more expensive to build and operate than batch. Not a little — quite a lot.
With batch, your pipeline runs once, does its work, and turns off. With streaming, you’re running a system 24/7. You must now handle:
- Message ordering — events can arrive out of sequence. Does it matter?
- Late-arriving data — what if an event from 2 hours ago shows up now?
- Exactly-once processing — what happens when your consumer crashes mid-way? Do you reprocess?
- Consumer group management — multiple consumers reading the same Kafka topic
- Stateful computations — “count unique users in the last 5 minutes” requires maintaining state across events
- Schema evolution — what if the event format changes?
None of these problems exist in batch. Batch is simple: read data, transform it, write it. Done. Every one of these streaming concerns requires additional engineering and ongoing operational overhead.
The middle ground: micro-batching
Most teams don’t actually need true streaming. They need near real-time — data that’s 5 minutes old instead of 12 hours old. That’s micro-batching: running your batch job every 5–10 minutes rather than nightly.
Spark Structured Streaming and dbt incremental models can run in micro-batch mode. You get most of the freshness benefit of streaming at a fraction of the operational complexity. No Kafka, no consumer groups, no watermarks. Just a very fast batch job.
The micro-batch sweet spot: If your stakeholders ask for “near real-time” data, try 15-minute batch first. In most cases, nobody can tell the difference — and you’ve saved yourself months of streaming complexity.
Going deeper: watermarks and exactly-once
If you do go streaming, two concepts you’ll encounter constantly:
Watermarks tell your system how late is too late. If a mobile app event was generated at 10:00am but arrives at your consumer at 12:00pm (because the user was offline), do you still process it? A watermark is the cutoff — events past it are dropped or handled as exceptions. Get this wrong and you’ll see mysterious gaps in your data where late events were silently discarded.
Exactly-once semantics means each event is processed precisely once, even if your system crashes mid-run. Both Flink and Kafka Streams support this, but it adds overhead. If your use case can tolerate duplicates (“at-least-once”) — like counting page views where a few extra doesn’t matter — you can skip this complexity. If you’re processing financial transactions, you cannot.
How to decide: the three-question framework
Ask yourself:
- How stale can this data be before it stops being useful? If the answer is “a day is fine,” nightly batch wins.
- What’s the real cost of being 15 minutes late? For most reporting use cases, nothing bad happens. For fraud detection or live trading, everything breaks.
- Does delayed or incorrect data cause actual harm? Financial, legal, or safety-critical harm justifies streaming. A slightly stale KPI dashboard does not.
Most growing companies will answer “batch” or “micro-batch” for 90% of their use cases. Reserve streaming for the 10% where freshness is genuinely business-critical. Start simple. Upgrade when the evidence is clear.