Real-Time SQL on Streams: Materialize, RisingWave & Flink SQL Compared

real-time sql used to mean "spin up a Java team and learn the Kafka Streams DSL." In 2026 it means typing CREATE MATERIALIZED VIEW top_products AS SELECT product_id, SUM(amount) FROM orders GROUP BY product_id and watching the answer update in milliseconds as orders flow in. Four engines made that shift real: Materialize, RisingWave, Apache Flink SQL, and ksqlDB — same dialect, very different tradeoffs.

This guide is the cheat sheet for picking among them. It walks through what each engine is good at, the incremental view maintenance trick that lets a 1B-row aggregation update in microseconds, the watermark contract that keeps event-time semantics honest, the exactly-once delivery story (source + engine + sink all cooperating), and three end-to-end reference architectures with cost and ops profiles. Each section pairs a teaching block with a senior-interview-grade Solution Tail — code, step-by-step trace, output table, then a concept-by-concept breakdown of why it works.

When you want hands-on reps immediately after reading, drill the streaming practice library →, rehearse on real-time analytics problems →, and stack the time-series muscles with sliding window drills →.

On this page

Why streaming SQL finally matters in 2026
The streaming-SQL engines matrix — what each is good at
Incremental view maintenance — the magic behind streaming SQL
Watermarks, event time vs ingestion time, late data
Reference architectures — Kafka → Flink, Kafka → Materialize, RisingWave end-to-end
Cheat sheet — streaming SQL recipes
Frequently asked questions
Practice on PipeCode

1. Why streaming SQL finally matters in 2026

The decade-long shift from batch jobs to long-running materialised views you query like a database

The one-sentence invariant: streaming SQL is the model where every query is a long-running subscription that keeps its answer fresh as the underlying data changes — not a one-shot scan over a frozen snapshot. Once you internalise that the query is the maintained view, the entire family of "should this be a Spark job or a streaming view?" decisions resolves to a single question — "how stale can the answer be?"

The three eras in one paragraph.

Batch era (2010-2016). Hadoop MapReduce, Hive, then early Spark. The answer was hours stale by design.
Micro-batch era (2017-2022). Spark Structured Streaming, Kafka Connect, dbt on warehouses with hourly schedules. The answer was minutes stale and pretended to be live.
True streaming era (2023+). Materialize, RisingWave, Flink SQL — CREATE MATERIALIZED VIEW as a streaming primitive. The answer is millisecond-fresh by construction.

Where streaming SQL actually fits.

Operational dashboards. Sub-second freshness on top-K leaderboards, queue depth, latency p99, error rates.
Alerting. "Page me when error rate over a 5-minute sliding window crosses 1%" is a SQL query, not a service.
Feature serving. Online ML features (rolling means, last-N events, session counts) that must match the offline training distribution.
Fraud detection. Sessionised joins between transaction and device streams, evaluated on every event.
CDC fan-out. A Postgres source change propagates to Iceberg, Snowflake, Kafka, and a Slack alert in one materialized view per sink.

What interviewers listen for.

Do you say "streaming SQL is a subscription, not a query" when asked the difference from batch? — senior signal.
Do you reach for incremental view maintenance as the cost-model phrase? — required answer.
Do you mention watermarks as the contract between event-time and the operator? — required answer.
Do you call out Spark Structured Streaming as micro-batch, not true streaming? — senior signal.
Do you separate exactly-once at the source, engine, and sink rather than as a single property? — staff signal.

The 2026 reality.

Materialize is the iteration leader — Postgres wire-compatible, in-memory differential dataflow, one-cluster start.
RisingWave is the OSS economics leader — Rust, distributed, S3-tiered state, Apache-2.0 license, one binary.
Flink SQL is still the scale leader — JVM, distributed, RocksDB state, the de-facto choice for petabyte streams.
ksqlDB is the deprecation cautionary tale — Confluent shipped end-of-life notices through 2025 in favour of Flink.

Worked example — the same "top-10 products" query across all four engines

Detailed explanation. A common interview opener: "show me the same top-10-products-by-revenue-in-the-last-hour query in Materialize, RisingWave, Flink SQL, and ksqlDB and call out the dialect differences." The query itself is nearly identical — the differences are in source declarations, watermark syntax, and refresh semantics.

Question. Write the streaming-SQL definition of top_products_last_hour (top 10 products by revenue in the trailing 60-minute window) in each of the four engines. Highlight where the syntax diverges.

Input. A Kafka topic orders with schema (order_id BIGINT, product_id BIGINT, amount DECIMAL, event_time TIMESTAMP).

Code.

-- Materialize
CREATE SOURCE orders FROM KAFKA BROKER 'kafka:9092' TOPIC 'orders'
    FORMAT JSON
    INCLUDE TIMESTAMP AS event_time;

CREATE MATERIALIZED VIEW top_products_last_hour AS
SELECT product_id,
       SUM(amount) AS revenue
FROM orders
WHERE event_time >= mz_now() - INTERVAL '1 hour'
GROUP BY product_id
ORDER BY revenue DESC
LIMIT 10;

-- RisingWave
CREATE SOURCE orders (
    order_id BIGINT, product_id BIGINT, amount DECIMAL, event_time TIMESTAMP,
    WATERMARK FOR event_time AS event_time - INTERVAL '30 SECONDS'
) WITH (connector = 'kafka', topic = 'orders', properties.bootstrap.server = 'kafka:9092')
ROW FORMAT JSON;

CREATE MATERIALIZED VIEW top_products_last_hour AS
SELECT product_id, SUM(amount) AS revenue
FROM TUMBLE(orders, event_time, INTERVAL '1 HOUR')
GROUP BY product_id, window_start
ORDER BY revenue DESC
LIMIT 10;

-- Flink SQL
CREATE TABLE orders (
    order_id BIGINT, product_id BIGINT, amount DECIMAL, event_time TIMESTAMP(3),
    WATERMARK FOR event_time AS event_time - INTERVAL '30' SECOND
) WITH ('connector' = 'kafka', 'topic' = 'orders', 'format' = 'json',
        'properties.bootstrap.servers' = 'kafka:9092');

CREATE VIEW top_products_last_hour AS
SELECT product_id, SUM(amount) AS revenue
FROM TABLE(TUMBLE(TABLE orders, DESCRIPTOR(event_time), INTERVAL '1' HOUR))
GROUP BY product_id, window_start, window_end
ORDER BY revenue DESC
LIMIT 10;

-- ksqlDB (legacy — being deprecated)
CREATE STREAM orders (order_id BIGINT, product_id BIGINT, amount DECIMAL, event_time BIGINT)
    WITH (kafka_topic='orders', value_format='JSON', timestamp='event_time');

CREATE TABLE top_products_last_hour AS
SELECT product_id, SUM(amount) AS revenue
FROM orders
WINDOW TUMBLING (SIZE 1 HOUR)
GROUP BY product_id
EMIT CHANGES;

Step-by-step explanation.

Materialize treats the source as a long-running view; mz_now() is the engine's notion of "now" and the WHERE-clause filter implicitly maintains a sliding window. No watermark required for the simple case.
RisingWave and Flink SQL both use a WATERMARK FOR ... AS event_time - INTERVAL '30 SECONDS' declaration — the operator promises "I will not emit events with timestamp more than 30 seconds older than the max seen."
RisingWave's TUMBLE is a tabular function call; Flink SQL's is a polymorphic table function (PTF). Both produce the same tumbling-window semantics, but the polymorphic-table-function call site is more verbose.
ksqlDB uses the older WINDOW TUMBLING (SIZE 1 HOUR) clause and a special EMIT CHANGES marker to indicate the view should refresh continuously. The dialect is the most Kafka-coupled and the least portable.
All four queries produce conceptually the same "top 10 by revenue in the trailing hour" answer. The choice of engine drives the operational story (state size, ops burden, scalability), not the SQL.

Output (every engine).

product_id	revenue
7	18420.50
2	17910.10
19	14200.00
...	...

Rule of thumb. The streaming-SQL learning curve is dominated by source / watermark declarations and time-window syntax, not by the aggregate logic. Pick the engine on operational tradeoffs (Materialize for iteration, RisingWave for OSS economics, Flink for scale) — the SQL itself is largely portable.

Worked example — when the streaming view becomes a database table

Detailed explanation. The most underrated property of Materialize and RisingWave is Postgres wire compatibility — your dashboard, BI tool, or Slack bot connects to the streaming engine as if it were Postgres and SELECTs from a materialized view. The view is the database table; the streaming job is invisible.

Question. Given the top_products_last_hour materialized view defined in Materialize, show how a Grafana dashboard, a Python service, and a CLI query all consume the same fresh data without any custom wire protocol.

Input. A running Materialize cluster with the view from the previous example.

Code.

# CLI
psql "postgres://materialize@host:6875/materialize" \
     -c "SELECT product_id, revenue FROM top_products_last_hour LIMIT 5"

# Python service
python - <<'PY'
import psycopg2
conn = psycopg2.connect("postgres://materialize@host:6875/materialize")
cur  = conn.cursor()
cur.execute("SELECT product_id, revenue FROM top_products_last_hour")
print(cur.fetchall())
PY

# Grafana — configure a Postgres data source pointing at the cluster,
# then write the SELECT as the panel query. Auto-refresh every 5 s.

Step-by-step explanation.

The Materialize cluster speaks the Postgres wire protocol on port 6875. Every tool that knows how to talk to Postgres talks to Materialize unchanged.
The materialized view is not a frozen snapshot — every read returns the freshest committed answer, computed incrementally from the input stream.
Latency from "order produced in Kafka" to "row visible to SELECT" is single-digit milliseconds on small to medium clusters.
The same property holds for RisingWave: Postgres wire on port 4566 by default, queryable from any Postgres client.
Flink SQL does not speak Postgres wire — you push results to a sink (Kafka, Iceberg, JDBC) and query the sink. This is the single biggest workflow gap for Flink in 2026, addressed partially by Apache Paimon and Flink Table Store.

Output. A SELECT returns the same shape as if the view were a Postgres table; freshness is implicit.

Rule of thumb. When the team wants "BI tool plugged into streams without custom connectors," pick an engine that speaks Postgres wire (Materialize or RisingWave). When the team wants "petabyte stream landing in Iceberg," pick Flink SQL with the Iceberg sink. Match the engine to the consumer surface.

Worked example — counting the cost of a long-running streaming view

Detailed explanation. Streaming SQL feels free at small scale but bills compute and state continuously. The right cost mental model is state size × retention window × throughput — not "queries per second."

Question. Estimate the rough monthly cost of running the top_products_last_hour view on a 50k events/second stream for a 30-day retention window on each of Materialize, RisingWave, and Flink SQL. Show how the cost drivers differ.

Input. 50,000 events/second × 86,400 seconds/day × 30 days = ~129B events; average row size 200 bytes; state is keyed by product_id (assume 50k distinct products) and one row per window per product.

Code (state-size estimate).

events_per_day = 50_000 * 86_400        # 4.32B
state_rows     = 50_000 * 24 * 30       # 36M (product × hour × day)
state_size_gb  = state_rows * 200 / 1e9 # ~7.2 GB

Step-by-step explanation.

Materialize charges roughly on cluster size; a large cluster at ~$1k/month handles ~50k events/sec for a 50k-key, 7-GB-state aggregation comfortably. Cost is dominated by compute, not storage.
RisingWave tiers state to S3, so the 7-GB state lives at S3 prices (~$0.16/month) and compute is the binary's machine cost — typically half of a Flink cluster at the same throughput.
Flink SQL stores state in RocksDB on local disk with periodic checkpoints to S3; a 4-node TaskManager cluster (~$1.2k/month on managed Flink) plus S3 checkpointing (~$5/month at 7 GB) is the typical bill.
All three can blow up the bill if you write GROUP BY user_id (potentially 100M keys) without a retention window — the state grows unbounded and either OOMs (Materialize) or pages to S3 forever (RisingWave) or spills RocksDB to multi-TB disks (Flink).

Output (rough monthly costs, illustrative).

Engine	Compute	State	Ops overhead	Approx. total
Materialize	$1,000	in-cluster	low (managed)	$1,000
RisingWave	$500	$0.20 (S3)	medium (OSS deploy)	$500
Flink SQL	$1,200	$5 (S3 checkpoints)	high (JVM tuning)	$1,205

Rule of thumb. Before scaling a streaming SQL workload, run a back-of-envelope state size × retention × throughput estimate. If state explodes, change the SQL (add WHERE event_time > now() - INTERVAL '7 days', switch from GROUP BY user_id to GROUP BY user_segment) before changing the engine.

SQL interview question on streaming-SQL fundamentals

A senior interviewer often opens with: "Explain the difference between a Spark Structured Streaming query and a Materialize materialized view in two minutes. Then describe one workload where each is the right call." It probes whether the candidate knows micro-batch vs true streaming and the cost model.

Solution Using a side-by-side mental model

-- Spark Structured Streaming (micro-batch)
val query = orders.readStream
    .format("kafka").option("subscribe", "orders").load()
    .groupBy("product_id")
    .agg(sum("amount").alias("revenue"))
    .writeStream
    .trigger(Trigger.ProcessingTime("10 seconds"))  -- micro-batch
    .outputMode("update")
    .format("delta").start()

-- Materialize (true streaming, IVM)
CREATE MATERIALIZED VIEW top_products AS
SELECT product_id, SUM(amount) AS revenue
FROM orders
GROUP BY product_id;

Step-by-step trace.

Property	Spark Structured Streaming	Materialize
Latency floor	the trigger interval (e.g. 10 s)	sub-second per row
State model	re-read all input each micro-batch (with state store)	maintain delta-driven state per row
Query language	Scala / PySpark DataFrame DSL or SQL	pure SQL (Postgres dialect)
Read shape	downstream consumer reads the sink (Delta, Iceberg)	downstream consumer SELECTs the view via Postgres wire
Best for	append-only ETL into a lakehouse	dashboards, alerting, BI on live data

Output:

Workload	Best engine	Why
Hourly Iceberg landing of clickstream	Spark Structured Streaming	append-only, lakehouse-native, hourly is fine
Sub-second leaderboard for a live ops console	Materialize	Postgres wire to Grafana, single-digit-ms freshness

Why this works — concept by concept:

Micro-batch vs true streaming — Spark waits a trigger interval, then runs a batch over the new input plus the state store. Materialize updates the answer per row. The two are different products even though both call themselves "streaming."
Subscription, not query — the Materialize view is a long-lived computation. Once created, the consumer SELECTs the freshest answer. The query is the maintenance.
Latency floor — Spark's floor is the trigger interval; Materialize's floor is the dataflow latency (~10 ms). For sub-second SLAs, Spark is mathematically the wrong choice.
Operational surface — Spark needs a sink and a separate dashboard reader; Materialize is the read endpoint. One fewer hop, one fewer system to operate.
Cost — Spark Structured Streaming on a 50k events/sec workload typically costs 2-3× a Materialize cluster of equivalent freshness, because every micro-batch re-scans state. IVM is the algorithmic win.

SQL
Topic — streaming
Streaming SQL problems (SQL)

Practice →

2. The streaming-SQL engines matrix — what each is good at

Same SQL dialect family, four very different operational profiles — pick by team, scale, and ops appetite

The mental model in one line: Materialize is for iteration and Postgres-wire dashboards; RisingWave is for OSS economics and S3-tiered state; Flink SQL is for petabyte-scale stateful pipelines; ksqlDB is legacy and being sunset by Confluent. Once you can name the right engine in one sentence per use case, the interview question "which streaming engine would you pick for X?" collapses to a memorisation exercise.

The feature matrix in one table.

Feature	Materialize	RisingWave	Flink SQL	ksqlDB
Language	Rust	Rust	Java / Scala	Java
IVM	true (differential dataflow)	true	true	limited
Streaming joins	✓	✓	✓	basic
Exactly-once	sink-dependent	sink-dependent	two-phase commit	sink-dependent
CDC source	Postgres + Debezium	native + Debezium	Debezium connector	Debezium connector
Distributed	single-node + read replicas	distributed	distributed	distributed
Wire protocol	Postgres	Postgres	JDBC / custom	REST + Kafka
State backend	in-memory + S3 spill	S3-tiered	RocksDB on local + S3 checkpoint	RocksDB
License	source-available BSL	Apache 2.0	Apache 2.0	Confluent CL (deprecated)
Best for	iteration + dashboards	OSS lakehouse	large state at scale	legacy Kafka-only shops

When to pick each — in one sentence.

Materialize. "Small team, dashboards over a Postgres BI tool, need sub-second freshness." The fastest path from CREATE MATERIALIZED VIEW to "Grafana shows live data."
RisingWave. "OSS-first lakehouse, need cheap S3-tiered state, want the latest engine architecture." The 2026 default for new OSS deployments.
Apache Flink SQL. "Petabyte stream, JVM-fluent SRE team, must guarantee exactly-once end-to-end." The de-facto enterprise standard.
ksqlDB. Do not start new projects on ksqlDB in 2026 — Confluent's 2025 deprecation notice pushed it to maintenance mode in favour of Flink-on-Confluent.

The non-engines (the things that look like streaming SQL but are not).

dbt + Kafka. dbt is a batch DAG runner; pairing it with Kafka does not make it streaming. The freshness floor is dbt's run interval.
Spark Structured Streaming. Micro-batch. The freshness floor is the trigger interval (typically 5-60 s). Excellent for append-only ETL, wrong for sub-second SLAs.
Redshift Streaming Ingestion. A continuous COPY into a Redshift table. Querying is still batch on the read side.
Snowflake Dynamic Tables. A managed materialized view that refreshes on a schedule (down to 1 minute). Closer to "incremental dbt" than to true streaming SQL.

Common interview probes on the matrix.

"Materialize vs RisingWave — what's the deciding factor?" — managed vs OSS, in-memory vs S3-tiered, single-region vs distributed-first.
"When is Flink the right call?" — petabyte streams, exactly-once required end-to-end, team already fluent in JVM tuning.
"Why is ksqlDB on the way out?" — Confluent positioning Flink-on-Confluent as the successor; ksqlDB's join semantics and IVM coverage are weaker.
"Is Spark Structured Streaming a competitor?" — only for append-only ETL with relaxed latency. Not for sub-second views.

Worked example — picking the right engine for three workloads

Detailed explanation. A common senior-design probe: given three workloads with different freshness, throughput, and team-shape constraints, name the engine and justify in two sentences. The probe tests whether you have a mental model, not whether you've memorised one engine's docs.

Question. For each workload, pick one engine and explain the choice. (1) A 4-person fintech team needs a fraud-dashboard that updates within 1 second on 5k events/sec, queried from Grafana. (2) A 30-person ad-tech team needs to land 200k events/sec into Iceberg with strict exactly-once guarantees. (3) A 6-person OSS-only startup wants a single binary they can run on EC2 with no managed vendors and S3 state.

Input. Three constraint sets above.

Code (decision matrix).

Workload 1 — fintech fraud dashboard
  freshness: 1 s | throughput: 5k/s | team: small | consumer: Grafana
  → Materialize
    • Postgres-wire connects directly to Grafana
    • Single cluster, no JVM ops
    • IVM keeps the dashboard answer fresh at delta speed

Workload 2 — ad-tech Iceberg landing
  freshness: minutes | throughput: 200k/s | team: large JVM-fluent
  exactly-once required end-to-end
  → Flink SQL + Iceberg sink
    • Two-phase commit between Flink checkpoint and Iceberg snapshot
    • Petabyte state handled by RocksDB + S3 checkpoint
    • Mature operational tooling (Flink UI, metrics, savepoints)

Workload 3 — OSS-only startup
  freshness: 1 s | throughput: 5k/s | team: 6 | OSS-only | S3 state
  → RisingWave
    • Apache-2.0 license, single binary
    • S3-tiered state keeps the bill near-zero
    • Postgres wire = any BI tool plugs in

Step-by-step explanation.

Workload 1 screams "managed Postgres-wire engine" — sub-second SLA + small team + Grafana eliminates Flink (JVM ops too heavy) and RisingWave (more ops than Materialize Cloud).
Workload 2 screams "Flink" — exactly-once + Iceberg + 200k/s + large team makes the JVM cost worth it. Materialize cannot handle the state size; RisingWave can but Flink has the mature Iceberg sink story.
Workload 3 screams "RisingWave" — OSS + single binary + S3 state is exactly its pitch. Materialize is source-available BSL (not OSS by some definitions); Flink is OSS but the JVM ops are heavy for a 6-person team.
The matrix is not "which engine is best" — it is "which engine fits this team and SLA."

Output.

Workload	Pick	Single-line reason
Fraud dashboard	Materialize	Postgres-wire + sub-second + tiny ops
Iceberg landing	Flink SQL	Exactly-once + petabyte scale + JVM team
OSS startup	RisingWave	Apache-2.0 + S3 state + single binary

Rule of thumb. Match the engine to the consumer surface and the team shape, not to the workload's throughput in isolation. Throughput is necessary but rarely sufficient — operational fit is what decides the deployment.

Worked example — same query, three engines, three watermark syntaxes

Detailed explanation. The watermark declaration is where every dialect diverges most. Same semantic ("I will not see events more than 30 seconds older than the max seen"), three different keywords.

Question. Translate the watermark declaration WATERMARK FOR event_time AS event_time - INTERVAL '30 SECONDS' into the canonical syntax of Materialize, RisingWave, and Flink SQL, and highlight which engine omits the explicit watermark for simple cases.

Input. A Kafka topic events with event_time TIMESTAMP.

Code.

-- Materialize — implicit; no WATERMARK keyword
CREATE SOURCE events FROM KAFKA BROKER 'kafka:9092' TOPIC 'events'
    FORMAT JSON
    INCLUDE TIMESTAMP AS event_time;
-- Materialize derives a watermark from the source automatically;
-- tighten with mz_now() vs event_time predicates.

-- RisingWave — WATERMARK inside the source definition
CREATE SOURCE events (
    user_id BIGINT, action TEXT, event_time TIMESTAMP,
    WATERMARK FOR event_time AS event_time - INTERVAL '30 SECONDS'
) WITH (connector = 'kafka', topic = 'events',
        properties.bootstrap.server = 'kafka:9092')
ROW FORMAT JSON;

-- Flink SQL — WATERMARK inside the CREATE TABLE
CREATE TABLE events (
    user_id BIGINT, action STRING, event_time TIMESTAMP(3),
    WATERMARK FOR event_time AS event_time - INTERVAL '30' SECOND
) WITH ('connector' = 'kafka', 'topic' = 'events', 'format' = 'json',
        'properties.bootstrap.servers' = 'kafka:9092');

Step-by-step explanation.

Materialize auto-derives a watermark for Kafka sources from the offset clock — no explicit syntax for the common case. Override only when correctness depends on event-time lag.
RisingWave puts the watermark in the source definition (the CREATE SOURCE block).
Flink SQL puts it in the CREATE TABLE block. The interval keyword is SECOND (singular) in Flink and SECONDS (plural) in RisingWave — a small but real foot-gun when porting queries.
All three semantics are identical: the operator promises to never emit a row with event-time older than (max_event_time_seen - 30s).

Output. Each engine compiles the source, attaches the same semantic watermark, and downstream windows now operate in event-time correctly.

Rule of thumb. When you port a streaming SQL job between engines, the watermark declaration is where the dialect tax lands. Memorise the syntax for the engine you use most, and keep a one-line cheat for the other two.

Worked example — exactly-once across the three engines

Detailed explanation. Exactly-once is not a single engine property — it is a contract among the source, the engine's checkpoint mechanism, and the sink's transaction support. Every modern engine can deliver it, but only when all three pieces cooperate.

Question. Walk through what each layer must guarantee for an exactly-once pipeline in (a) Flink SQL → Iceberg, (b) Materialize → Postgres, (c) RisingWave → Kafka. Identify the weakest link.

Input. Three end-to-end pipelines above.

Code (configuration excerpt).

-- Flink → Iceberg, exactly-once
SET 'execution.checkpointing.interval' = '60s';
SET 'execution.checkpointing.mode' = 'EXACTLY_ONCE';
-- Iceberg sink commits per checkpoint via two-phase commit;
-- the same checkpoint also commits Kafka source offsets.

-- Materialize → Postgres (idempotent sink)
-- Materialize emits row diffs (INSERT/UPDATE/DELETE); the sink
-- applies them inside a single Postgres transaction per microbatch.
-- Exactly-once = source replayability + idempotent diff apply.

-- RisingWave → Kafka transactional producer
CREATE SINK leaderboard_sink FROM top_products
    WITH ( connector = 'kafka', topic = 'leaderboard',
           type = 'append-only',
           properties.bootstrap.server = 'kafka:9092',
           properties.transactional.id = 'leaderboard-tx');

Step-by-step explanation.

Source. Must support replay from a deterministic offset (Kafka, Postgres logical replication, Pulsar). If the source cannot replay, exactly-once is impossible on a crash — you can only get at-least-once.
Engine. Must checkpoint state atomically. Flink uses 2PC across operators; Materialize uses transactional state in its dataflow; RisingWave checkpoints to S3 with a coordinator.
Sink. Must commit atomically with the engine's checkpoint. Iceberg supports two-phase commit; Postgres supports it via transactional inserts; Kafka supports it via transactional-producer + read-committed consumers.
The weakest link. A non-idempotent sink (e.g. a fire-and-forget HTTP webhook) collapses exactly-once to at-least-once regardless of how strong the engine is. The pipeline's guarantee is the minimum across the three layers.

Output.

Pipeline	Source	Engine	Sink	End-to-end
Flink → Iceberg	Kafka replayable	2PC checkpoint	Iceberg 2PC	exactly-once
Materialize → Postgres	Kafka replayable	diff-dataflow	idempotent diff apply	exactly-once
RisingWave → Kafka	Kafka replayable	S3 checkpoint	Kafka transactional	exactly-once
Any → webhook	Kafka replayable	any	non-idempotent	at-least-once

Rule of thumb. Before you advertise "exactly-once" to a stakeholder, audit each of the three layers separately. The pipeline is exactly-once only when source replay + engine checkpoint + sink commit all cooperate; replace any non-idempotent sink with an idempotent or transactional one or accept at-least-once.

SQL interview question on engine selection

A senior interviewer might frame this as: "Your team is moving off ksqlDB before Confluent's end-of-life. You have three workloads — a sub-second leaderboard, a Iceberg landing job, and a CDC fan-out to four sinks. Pick a single engine for all three, or pick three different engines, and justify."

Solution Using a single-engine-when-possible heuristic

Constraint                | Materialize | RisingWave | Flink SQL
--------------------------+-------------+------------+----------
Sub-second leaderboard    |  best       |  good      |  good
Iceberg landing (200k/s)  |  weak       |  good      |  best
CDC fan-out, 4 sinks      |  good       |  best      |  good
Single binary OSS         |  no         |  yes       |  no
Postgres wire             |  yes        |  yes       |  no
JVM ops avoided           |  yes        |  yes       |  no

The decision is RisingWave for all three — it covers every workload at "good or best," ships as a single OSS binary, speaks Postgres wire, and avoids the JVM ops burden. The only "best" Materialize wins is the sub-second leaderboard, and RisingWave is close enough that the operational simplicity of a single engine wins.

Step-by-step trace.

Workload	Pick	Why not the alternative
Leaderboard	RisingWave	Materialize is "best" but adds a second engine
Iceberg landing	RisingWave	Flink is "best" but adds JVM ops
CDC fan-out	RisingWave	both alternatives need extra tooling

Output:

Engine count	Operational surface	Iteration speed	Cost
1 (RisingWave)	smallest	fastest	lowest
2 (Materialize + Flink)	medium	medium	medium
3 (one per workload)	largest	slowest	highest

Why this works — concept by concept:

Single-engine bias — every engine you add is more code paths, more on-call runbooks, more upgrade lockstep. Default to one engine unless a workload genuinely cannot fit.
"Good" beats "best" plus an extra system — RisingWave's "good" sub-second performance with a 50k-key state is operationally cheaper than Materialize's "best" with a separate Iceberg-landing system.
License matters — Apache-2.0 vs source-available BSL is a procurement question. RisingWave's pure OSS license is the safe answer for security-sensitive shops.
Postgres wire is a force multiplier — every BI tool, every dashboarding tool, every analyst's notebook just works. This is the largest hidden lever in streaming-SQL adoption.
Cost — three engines at 30% utilisation each costs more than one engine at 90% utilisation. Single-engine consolidation is the classic "make the bill smaller" move.

SQL
Topic — real-time analytics
Real-time analytics problems (SQL)

Practice →

3. Incremental view maintenance — the magic behind streaming SQL

IVM turns expensive joins into O(delta) updates — the algorithmic foundation that makes real-time SQL viable

The mental model in one line: incremental view maintenance recomputes only what changes when an input row is inserted, updated, or deleted — not the whole view from scratch. Once you internalise "delta in, delta out," the entire family of "how can a billion-row aggregate update in microseconds?" questions resolves to a single algorithmic invariant.

IVM in five bullets.

Definition. Given a view V = f(R) and a delta ΔR on the input, IVM computes ΔV = f'(R, ΔR) such that V' = V + ΔV — without recomputing f(R + ΔR) from scratch.
Cost model. Each delta does O(log N) work for indexed joins, O(1) work for aggregations on a fixed group key, O(K) work for top-K updates. Independent of N.
Operator coverage. SELECT, WHERE, INNER / LEFT / OUTER JOIN, GROUP BY, COUNT/SUM/AVG, window aggregations with watermarks, DISTINCT, UNION.
Operator gaps. Arbitrary UDFs (the engine cannot prove they are deterministic), recursive CTEs, non-deterministic functions (RANDOM(), NOW() in some contexts), order-dependent aggregates.
Retractions. Updates and deletes require negative deltas — the operator emits "remove old row, add new row." Naive IVM that only handles inserts breaks on the first DELETE.

Differential dataflow — the academic foundation.

Materialize is built on differential dataflow, a 2013 Microsoft Research framework that formalises "compute the same answer incrementally as inputs change," including across nested loops and recursive queries. RisingWave and Flink use simpler but related approaches (stream-table duality, retract streams). The practical upshot: every modern streaming engine treats IVM as the core primitive, not an optimisation.

Common interview probes on IVM.

"How can a 1B-row aggregate update in microseconds?" — because the aggregate state is a hash map keyed by group_id; one delta touches one key and emits one updated value.
"Why doesn't Spark Structured Streaming have true IVM?" — because micro-batch re-reads the input each interval; the state store helps but the algorithmic model is still "read window of input, compute, write." Not delta-only.
"What happens to IVM on a DELETE?" — the engine emits a negative delta. Downstream aggregates apply it (e.g. SUM decrements by the deleted row's value).
"Can IVM handle a UDF?" — only if the UDF is deterministic, pure, and side-effect-free. Most engines require an explicit annotation; without it, IVM treats the UDF as opaque and falls back to recomputation.

The five operator coverage tiers.

Tier 1 — trivially IVM. SELECT, WHERE on constant predicates, projection, UNION ALL. O(1) per delta.
Tier 2 — IVM via index. INNER / LEFT JOIN on equality keys. O(log N) per delta with a hash index.
Tier 3 — IVM via fixed group state. GROUP BY + SUM/COUNT/AVG. O(1) per delta plus the hash-map lookup.
Tier 4 — IVM with retraction. Top-K, DISTINCT COUNT, MIN/MAX (because the previous-max row might be deleted). O(K) per delta.
Tier 5 — not IVM-friendly. Arbitrary UDFs, recursive CTEs, ORDER BY with LIMIT on unbounded streams, window functions without watermarks.

Worked example — IVM on a streaming top-K leaderboard

Detailed explanation. A click-event stream feeds a top-10 leaderboard. Each click event triggers a delta that updates exactly one product's score; the top-K state re-sorts the small head of the list, not the whole product population.

Question. Given the click event stream below and a top-3 leaderboard, walk through the IVM updates step by step. Show what changes in the materialized state on each event.

Input — click stream (oldest first).

seq	product_id	score
1	A	+5
2	B	+3
3	A	+4
4	C	+12
5	A	-5 (correction)

Code.

CREATE MATERIALIZED VIEW leaderboard AS
SELECT product_id, SUM(score) AS total_score
FROM clicks
GROUP BY product_id
ORDER BY total_score DESC
LIMIT 3;

Step-by-step explanation.

Event 1 (A, +5). Insert into clicks. Aggregate state: {A: 5}. Top-3: [A=5]. Δ emitted to leaderboard: +(A, 5).
Event 2 (B, +3). Aggregate state: {A: 5, B: 3}. Top-3: [A=5, B=3]. Δ emitted: +(B, 3).
Event 3 (A, +4). A's row updates: old (A, 5), new (A, 9). Δ emitted: -(A, 5) then +(A, 9). Top-3 re-sorts: [A=9, B=3].
Event 4 (C, +12). Aggregate state adds {C: 12}. Top-3 now includes C and re-orders: [C=12, A=9, B=3]. Δ emitted: +(C, 12) and a reorder of the existing rows.
Event 5 (A, -5 correction). A's row updates: old (A, 9), new (A, 4). Δ: -(A, 9) then +(A, 4). Top-3 re-sorts: [C=12, A=4, B=3].
Total work. Each event did O(1) hash-map update + O(K) leaderboard re-sort, where K=3. Never re-scanned the full input or full state.

Output (final).

product_id	total_score
C	12
A	4
B	3

Rule of thumb. Streaming top-K with IVM is the workhorse "leaderboard / trending / hot-items" pattern. Cap K deliberately (top-10 / top-100, not top-10000) — beyond a few hundred entries the re-sort cost dominates and you should switch to a windowed top-K.

Worked example — IVM on a streaming JOIN

Detailed explanation. A streaming join between two long-running tables (orders and customers) maintains a delta-driven hash index on both sides. When an order arrives, the engine probes the customers index in O(log N); when a customer is updated, the engine probes the orders index. No full re-scan of either side.

Question. Walk through the IVM behaviour when (a) a new order arrives, (b) a customer record is updated, (c) a customer is deleted. Use the small example tables below.

Input — customers.

customer_id	name	tier
1	Alice	gold
2	Bob	silver

Input — orders (streaming).

order_id	customer_id	amount
100	1	50

Code.

CREATE MATERIALIZED VIEW order_with_customer AS
SELECT o.order_id, o.amount, c.name, c.tier
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;

Step-by-step explanation.

(a) New order arrives — (order_id=101, customer_id=2, amount=80). Engine probes the customers hash index for customer_id=2 → finds (Bob, silver). Emits +(101, 80, Bob, silver).
(b) Customer 1 updated — name changes from Alice to Alicia. Engine emits two deltas: -(100, 50, Alice, gold) and +(100, 50, Alicia, gold). Downstream materialized state replaces the matching row.
(c) Customer 2 deleted — engine emits -(101, 80, Bob, silver). The materialized view loses that row; the underlying orders row stays in the input but no longer joins.
None of these steps re-scanned the full orders or customers tables. The total work per event is O(log N) for the hash-index probe plus O(1) for the materialized-view update.

Output (after all three steps).

order_id	amount	name	tier
100	50	Alicia	gold

Rule of thumb. Streaming joins on equality keys are IVM-friendly out of the box. The hidden cost is the join state size — both sides must keep their indices in memory or on tiered state for the full retention window. Bound the retention (e.g. WHERE order_date > now() - INTERVAL '7 days') to cap the state.

Worked example — when IVM falls back to recomputation

Detailed explanation. Some operators cannot be incrementally maintained — the engine has to recompute the view (or part of it) on each delta. Knowing which patterns trigger fallback is what separates the staff engineer from the senior.

Question. Identify which of the following queries IVM cannot maintain incrementally. For each, explain why and propose a rewrite.

Input. Three candidate queries.

Code.

-- 1) Arbitrary UDF
SELECT call_external_api(user_id) AS enriched FROM events;

-- 2) Recursive CTE
WITH RECURSIVE org AS (
    SELECT employee_id, manager_id FROM employees WHERE manager_id IS NULL
    UNION ALL
    SELECT e.employee_id, e.manager_id
    FROM employees e JOIN org ON e.manager_id = org.employee_id
)
SELECT * FROM org;

-- 3) Median (order-dependent aggregate)
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) AS median_amount
FROM orders;

Step-by-step explanation.

(1) UDF. The engine cannot prove call_external_api is deterministic or side-effect-free. On a delta, it cannot know whether to re-invoke. Most engines refuse to maintain this incrementally; some fall back to per-row recomputation each delta. Rewrite: pre-enrich the event in the source pipeline, then stream the enriched topic.
(2) Recursive CTE. A delta on employees might invalidate the entire recursion tree. Materialize supports differential recursion (a research-grade feature); Flink and RisingWave generally do not. Rewrite: materialize the closure as a separate non-recursive view, or compute the recursion in a periodic batch job.
(3) Median. Order-dependent aggregates require knowing the full sorted set. A single new event can shift the median, which means recomputing on every delta. Rewrite: use approximate quantiles (T-Digest, KLL) which are IVM-friendly (the sketch state updates in O(log K) per delta).

Output (rewrites).

Original	Issue	Rewrite
UDF `call_external_api`	nondeterministic / side-effects	pre-enrich at source
Recursive CTE	full-tree invalidation per delta	periodic batch closure
Exact median	order-dependent	T-Digest sketch / approximate quantile

Rule of thumb. Before declaring "we'll do this in streaming SQL," audit the query for IVM-unfriendly operators. The fix is usually a rewrite (push computation upstream, swap exact for sketch) rather than abandoning streaming entirely.

SQL interview question on IVM cost model

A senior interviewer often opens with: "Walk me through the per-delta cost of maintaining a streaming view that does SELECT region, SUM(amount) FROM orders GROUP BY region on a 100M-row table receiving 50k inserts/sec. Be explicit about what is O(1), what is O(log N), and what the state size is."

Solution Using the IVM cost-model decomposition

CREATE MATERIALIZED VIEW revenue_by_region AS
SELECT region, SUM(amount) AS revenue
FROM orders
GROUP BY region;

Step-by-step trace.

Step	Per-delta cost	Why
Hash `region` to a group key	O(1)	constant-time hash
Look up current `revenue[region]`	O(1) avg (hash map)	open-addressing hash
Add `amount` to `revenue[region]`	O(1)	arithmetic
Emit delta `(region, +amount)`	O(1)	append to output stream
Downstream view applies delta	O(1)	per-key update
Total per delta	O(1)	independent of 100M row count

State size = number of distinct regions (say ~250 country codes) × ~80 bytes per row ≈ 20 KB. Independent of the 100M input rows — the state is keyed by the GROUP BY column, not the input row count.

Output:

Metric	Value
Per-delta cost	O(1)
State size	~20 KB
Throughput ceiling	~10M events/sec on a single core (memory-bound)
Memory growth per new event	0 (existing region) or ~80 bytes (new region)

Why this works — concept by concept:

GROUP BY state is keyed by the group column, not by input row count — that's why SUM(amount) GROUP BY region has a state size proportional to the number of regions (small, bounded), not to the number of orders (huge, unbounded).
Hash-map updates are amortised O(1) — the engine maintains a hash map keyed by region; each delta is one lookup + one update.
Delta out, not snapshot out — the operator emits a (region, +amount) delta downstream. Materialized consumers apply the delta in O(1). No re-broadcast of the full view.
Bounded state — the operator never grows beyond the distinct group key count. Compare with GROUP BY user_id over 100M users → 100M-row state. Always check the cardinality of the GROUP BY column before declaring "this is fine."
Cost — per delta: O(1) compute, O(1) state update, O(1) output. End-to-end: 100k events/sec on a small cluster, with state size capped by group cardinality.

SQL
Topic — aggregation
Aggregation problems (SQL)

Practice →

4. Watermarks, event time vs ingestion time, late data

Watermarks are the producer's promise; event-time is the only honest clock; late data is the consumer's tolerance

The mental model in one line: a watermark W at time t is a promise that no future event will have an event_time < W — the engine uses it to decide when a window can be closed and emitted. Once you can quote that contract verbatim, the entire family of "how does Flink handle late data?" questions resolves to "what watermark interval and lateness allowance did you configure?"

The three clocks.

Event time. When the event actually happened in the source system. The only clock that gives analytically-correct answers. Example: "the user clicked at 09:42:15."
Ingestion time. When the engine received the event from the broker. Slightly delayed from event time (network + broker buffer). Example: "Kafka consumer received the click at 09:42:18."
Processing time. When the operator processed the event in the dataflow. Most delayed; subject to backpressure. Example: "Flink window operator handled the click at 09:42:21."

Watermarks in five bullets.

Watermark = max-event-time-seen minus lateness budget. A watermark of 09:50 means "I will not see any event with timestamp < 09:50 ever again."
Watermarks advance the engine's notion of "now." When the watermark crosses 10:00, every 09:00-10:00 tumbling window can close and emit.
Tighter watermark = lower latency, more late events dropped. Looser watermark = higher latency, fewer dropped events. There is no free lunch.
Watermarks are per-source. Multi-source joins take the minimum watermark across all inputs (the slowest source paces the join).
Watermarks do not enforce determinism in the face of failure. That is the checkpoint's job. Watermarks govern correctness under normal operation.

Late-data strategies in three bullets.

Drop. Default for cheap real-time pipelines. The late event is logged and discarded. Simple; loses signal.
Side-output. The recommended default. The late event is routed to a separate topic for batch repair downstream.
Re-emit. The window is reopened, the corrected aggregate is emitted with a retraction. Downstream consumers must support retractions.

Allowed lateness. A window can stay open for L seconds past its watermark to admit straggler events. L=0 is "drop immediately"; L=10m is "tolerate 10-minute stragglers." Trade-off: higher L = more state retained per window, higher memory.

Common interview probes on watermarks.

"Explain watermarks to a non-streaming engineer in three sentences." — "Streaming queries need a way to know when a time window can close. A watermark is a producer's promise: 'I will never send events older than this timestamp.' The engine uses watermarks to decide when to emit window results."
"What's the difference between event-time and ingestion-time semantics?" — event-time is correct under reordering and replay; ingestion-time is cheaper but breaks under late arrivals.
"Why does my Flink window never fire?" — likely the watermark is not advancing (no events, no idle-source detection, or a per-key watermark is stuck on a slow key).
"How does Materialize handle late data?" — it auto-retracts the affected window result and re-emits, because materialized views naturally retract.

Worked example — a 1-hour tumbling window with 5-minute watermark and 10-minute lateness

Detailed explanation. A classic combo: tumbling 1-hour windows for hourly aggregates, with a 5-minute watermark lag (most events arrive within 5 minutes of their event-time) and a 10-minute allowed lateness (window stays open 10 minutes past watermark to admit stragglers).

Question. Given the events below, walk through which window each event belongs to and when each window emits. Identify which event is dropped as late and which triggers a re-emit.

Input — events arriving (processing-time order, event-times in column 3).

arrived_at	event_time	description
09:55	09:50	normal event
10:03	09:58	normal event
10:07	09:42	straggler (within 10m lateness)
10:16	09:20	very late (past lateness)
10:20	10:15	normal next-window event

Code.

SELECT
    window_start, window_end, COUNT(*) AS event_count
FROM TABLE(TUMBLE(TABLE events, DESCRIPTOR(event_time), INTERVAL '1' HOUR))
GROUP BY window_start, window_end;
-- assume WATERMARK FOR event_time AS event_time - INTERVAL '5' MINUTE
-- and ALLOWED LATENESS '10' MINUTE

Step-by-step explanation.

Event 1 (09:50, arrives 09:55). Belongs to the [09:00, 10:00) window. Within watermark. Counted normally.
Event 2 (09:58, arrives 10:03). Belongs to [09:00, 10:00). Watermark just advanced past 09:58. Counted normally.
Event 3 (09:42, arrives 10:07). Belongs to [09:00, 10:00). Watermark = max_ts - 5min. Now watermark is at ~09:55 (max ts seen was 10:00). Event is past watermark — but within the 10-minute lateness budget. Window re-opens, re-emits corrected count.
Event 4 (09:20, arrives 10:16). Belongs to [09:00, 10:00). Watermark is at ~09:55, plus lateness = 10:05 closing deadline. Event arrived at 10:16 — dropped as too late (or side-outputted, depending on config).
Event 5 (10:15, arrives 10:20). Belongs to [10:00, 11:00). Watermark advances to ~10:10. Event is well within window. Counted normally.

Output (final emitted windows).

window_start	window_end	event_count
09:00	10:00	3 (events 1, 2, 3)
10:00	11:00	1 (so far)

Rule of thumb. Set the watermark lag to the 99th-percentile event-time-to-ingest delay. Set the allowed lateness to roughly 2× the lag. Anything past that is genuinely late and should go to a side-output for batch repair, not eaten by the streaming window.

Worked example — multi-source watermark alignment

Detailed explanation. When a query joins two streams (e.g. orders and payments), the engine takes the minimum watermark across both inputs. If one source is slow or idle, it can stall the entire join — the classic "my streaming join stopped emitting" bug.

Question. Given two source streams with different watermark behaviour, identify why the join output stalls and how to fix it.

Input — orders watermark trace.

time	watermark
09:55	09:50
10:00	09:55
10:05	10:00

Input — payments watermark trace.

time	watermark
09:55	09:50
10:00	09:50 (no new payments)
10:05	09:50 (still nothing)

Code.

SELECT o.order_id, p.payment_id, o.amount
FROM orders o
JOIN payments p ON o.order_id = p.order_id
WHERE o.event_time BETWEEN p.event_time - INTERVAL '5' MINUTE
                       AND p.event_time + INTERVAL '5' MINUTE;

Step-by-step explanation.

The join takes the minimum of the two input watermarks. Joined watermark = min(orders, payments) = 09:50 for the entire trace.
No new payments → payments watermark does not advance → join watermark stuck at 09:50.
Downstream window operators that depend on the join (e.g. an hourly aggregate over join output) cannot close their 09:00-10:00 window because the watermark never crosses 10:00.
The fix: idle-source detection. Configure the payments source with 'table.exec.source.idle-timeout' = '5 minute' (Flink) — after 5 minutes of no events, the source declares itself "idle" and the engine ignores its watermark for the min calculation.
Or: synthetic heartbeats. The producer emits a heartbeat event every minute that advances event-time without contributing to the join. Old-school but bulletproof.

Output (before vs after fix).

Config	Join watermark at 10:05
No idle detection	09:50 (stuck)
`idle-timeout = 5 min`	advances with orders

Rule of thumb. Multi-source joins are the #1 source of "my streaming pipeline stopped emitting" tickets. Always configure idle-source detection on every input source (or emit heartbeats), or you will eventually wake up to a 6-hour-stale dashboard with no errors in the logs.

Worked example — sizing the allowed-lateness budget

Detailed explanation. Allowed lateness sounds like a free safety net but costs window-state-retention × peak windows open. Setting it too high blows memory; setting it too low drops events.

Question. A 1-hour tumbling window with a per-window state of 200 MB receives events whose 99th-percentile lateness is 8 minutes. What allowed-lateness should you configure, and what is the resulting memory cost?

Input. 99th-percentile lateness = 8m; per-window state = 200 MB; query has 24 windows/day.

Code (sizing).

# Allowed lateness = 2x the p99 lateness
allowed_lateness_min = 2 * 8                       # 16 minutes
# At any moment, the windows open are:
#   - the current window
#   - the previous window if still inside allowed lateness
windows_open = 1 + (allowed_lateness_min / 60)     # ~1.27 → round up to 2
mem_per_window_mb = 200
total_mem_mb = windows_open * mem_per_window_mb    # 400 MB

Step-by-step explanation.

The p99 lateness is 8 minutes — that's the real answer to "how long do most events take?" Setting lateness at exactly 8 minutes drops 1% of events.
Rule of thumb: 2× p99. 16-minute lateness covers ~99.9% of events. Most pipelines accept the residual 0.1% to the side-output channel.
With 1-hour windows and 16-minute lateness, at most 2 windows are open simultaneously (the current one, and the previous one in lateness window).
Memory cost = windows_open × state_per_window = 2 × 200 MB = 400 MB. That's the per-key memory footprint; multiply by the number of GROUP BY keys.
If lateness were 60 minutes (one full window), 3 windows would be open at once, tripling the memory. Pick lateness deliberately; don't default to "an hour."

Output (sizing decision).

Lateness	Windows open	Memory per key	Dropped event rate
0 min	1	200 MB	~50%
8 min (p99)	1.13 → 2	400 MB	~1%
16 min (2× p99)	1.27 → 2	400 MB	~0.1%
60 min	2	400 MB	~0.01%
120 min	3	600 MB	~0.001%

Rule of thumb. Pick allowed-lateness based on measured p99 event delay, not on a round number of minutes. Wider lateness has diminishing returns on dropped events but linear cost on memory. The 2×p99 rule is the operational sweet spot.

SQL interview question on watermarks and late data

A senior interviewer might frame this as: "You're running a 1-hour tumbling window aggregate on a stream where ~1% of events arrive more than 10 minutes late. Design the watermark and lateness configuration, and describe what happens to a 15-minute-late event."

Solution Using a p99-aligned watermark with side-output routing

CREATE TABLE events (
    user_id BIGINT,
    event_type STRING,
    event_time TIMESTAMP(3),
    WATERMARK FOR event_time AS event_time - INTERVAL '5' MINUTE
) WITH ('connector' = 'kafka', /* ... */);

CREATE VIEW hourly_event_counts AS
SELECT window_start, window_end, COUNT(*) AS cnt
FROM TABLE(TUMBLE(TABLE events, DESCRIPTOR(event_time), INTERVAL '1' HOUR))
GROUP BY window_start, window_end;
-- ALLOWED LATENESS '10' MINUTE in the application code
-- Side-output: late events go to topic 'events-late' for batch repair

Step-by-step trace.

Step	Behaviour
Watermark	5-minute lag from max event-time
Allowed lateness	10 minutes past watermark
1% of events arriving >10m late	routed to `events-late` side-output
Batch repair job	reads `events-late` daily, merges into the hourly table

Output:

Window	Streaming count	Side-output count	Reconciled count
09:00-10:00	9,910	90	10,000
10:00-11:00	12,400	100	12,500

Why this works — concept by concept:

Watermark = max event time − lag — the lag is the producer's promise. 5 minutes is the typical "covers 95% of events" choice; anything past that is side-output material.
Allowed lateness is the stretch budget — 10 minutes past watermark catches ~99% of stragglers without breaking memory. The remaining 1% goes to a side-output topic.
Side-output instead of drop — the late events are not lost; they are re-routed to a separate topic for batch repair. The hourly table is corrected by a daily reconciliation job.
Window emits twice — once at watermark + lateness (the "streaming" emit), once during batch repair (the "corrected" emit). Downstream consumers either accept the first or wait for the second.
Cost — watermark + lateness costs a small multiple of single-window state; side-output is one extra Kafka topic + a daily batch repair job. The total operational overhead is small compared to silently dropping 1% of events.

SQL
Topic — sliding window
Sliding window problems (SQL)

Practice →

5. Reference architectures — Kafka → Flink, Kafka → Materialize, RisingWave end-to-end

Three canonical streaming-SQL stacks — same workload, very different operational profiles

The mental model in one line: Flink SQL is the highest-scale stack with the heaviest JVM ops burden, Materialize is the fastest-iteration Postgres-native stack, and RisingWave is the cheapest single-binary OSS stack — pick by team shape, scale, and procurement constraints. Once you can sketch all three from memory, the architecture-interview question "design a real-time fraud-detection pipeline" reduces to picking which of the three you'd defend.

Architecture 1 — Kafka → Flink SQL → sink (the classic).

Source. Kafka topics; ingest via Debezium for CDC sources (Postgres, MySQL).
Engine. Flink SQL with RocksDB state on TaskManager local disk; checkpoints to S3 every 60 seconds.
Sink. Iceberg, Postgres, Kafka, or any JDBC. Two-phase commit between Flink checkpoint and sink for exactly-once.
Strength. Petabyte-scale state, mature Iceberg sink, exactly-once end-to-end with 2PC.
Weakness. JVM tuning burden (GC, heap, RocksDB tuning); Flink SQL job lifecycle (deploy, savepoint, restart); operational headcount.

Architecture 2 — Kafka → Materialize → Postgres-protocol consumers.

Source. Kafka topics, Postgres logical replication, or both. Materialize ingests sources directly via SOURCE definitions.
Engine. Materialize cluster (managed cloud or self-hosted); in-memory differential dataflow with S3 spill for cold data.
Sink. Postgres-wire SELECTs from any BI tool, dashboard, or service. Optional Kafka / Webhook sinks for downstream fan-out.
Strength. Fastest iteration; Postgres wire is a force multiplier; no JVM ops; sub-second freshness.
Weakness. Single-region default; in-memory state caps the scale (~hundreds of GB practical limit); source-available BSL license is not pure OSS.

Architecture 3 — RisingWave end-to-end.

Source. Native CDC (Postgres, MySQL), Kafka, Pulsar, Kinesis, all defined in one SQL block.
Engine. RisingWave cluster with stateless compute nodes; state lives on S3 in tiered storage.
Sink. Iceberg, Postgres, Kafka, plus the Postgres-wire query endpoint for direct reads.
Strength. Apache-2.0 license; single binary deploys on EC2 with no Kafka requirement; cheap S3-tiered state; Postgres wire.
Weakness. Youngest of the three (2024 maturity); ecosystem still catching up to Flink for some niche sinks.

The exactly-once contract across all three.

Source. Must be replayable from a deterministic offset. Kafka is the canonical choice; Postgres logical replication is the runner-up.
Engine. Must checkpoint state atomically. All three engines do.
Sink. Must commit transactionally or be idempotent. Iceberg, transactional Kafka, idempotent JDBC inserts.

State backend choices.

RocksDB on local disk (Flink). Fast random access, periodic snapshot to S3. Heavy disk requirements.
In-memory with S3 spill (Materialize). Hot data in RAM, cold data on S3. Sub-millisecond hot reads.
S3-tiered, all the time (RisingWave). Stateless compute, all state in S3. Slowest hot reads but cheapest.

Cost driver mental model. Streaming SQL cost = (state size × retention) + (compute hours × throughput) + ops headcount. Cap state with retention windows; choose the engine that minimises the dominant term for your workload.

Common interview probes on architecture.

"Where does exactly-once break down?" — usually the sink. Audit the sink first.
"How do you handle a Flink TaskManager OOM mid-window?" — checkpoint, savepoint, restart from the last checkpoint; the window state is restored.
"Why would you choose Materialize over Flink for a 5k-event/sec workload?" — Postgres wire + no JVM ops + fastest iteration. The throughput is well within Materialize's comfort zone.
"What's the operational cost of RisingWave's S3-tiered state?" — slightly higher hot-read latency in exchange for ~10× cheaper state storage. Acceptable for most workloads.

Worked example — fraud-detection pipeline with a 30-day session window on Flink SQL

Detailed explanation. A fraud-detection pipeline maintains a 30-day rolling session per user, computes a velocity score over the session, and emits a binary fraud flag for each new transaction. The state size (30 days × millions of users × per-user session record) drives the architecture decision.

Question. Sketch the Flink SQL pipeline. Identify the state size, the checkpoint interval, and how exactly-once is achieved end-to-end.

Input. Kafka topic transactions at 50k events/sec; ~10M distinct users; 30-day session retention.

Code.

-- Source declaration
CREATE TABLE transactions (
    user_id BIGINT,
    tx_id BIGINT,
    amount DECIMAL,
    event_time TIMESTAMP(3),
    WATERMARK FOR event_time AS event_time - INTERVAL '30' SECOND
) WITH ('connector' = 'kafka', /* ... */);

-- 30-day session window per user
CREATE VIEW user_session_velocity AS
SELECT
    user_id,
    SESSION_START(event_time, INTERVAL '30' DAY) AS session_start,
    COUNT(*) AS tx_count,
    SUM(amount) AS total_amount,
    AVG(amount) AS avg_amount
FROM transactions
GROUP BY user_id, SESSION(event_time, INTERVAL '30' DAY);

-- Fraud flag — joined on the fly with new transactions
CREATE TABLE fraud_flags (
    tx_id BIGINT, user_id BIGINT, flag BOOLEAN, reason STRING
) WITH ('connector' = 'iceberg', /* ... */);

INSERT INTO fraud_flags
SELECT
    t.tx_id, t.user_id,
    CASE WHEN s.tx_count > 1000 OR t.amount > 10 * s.avg_amount
         THEN TRUE ELSE FALSE END AS flag,
    CASE WHEN s.tx_count > 1000 THEN 'velocity'
         WHEN t.amount > 10 * s.avg_amount THEN 'amount-anomaly'
         ELSE 'ok' END AS reason
FROM transactions t
JOIN user_session_velocity s
  ON t.user_id = s.user_id;

Step-by-step explanation.

State size. 10M users × ~500 bytes per session row = ~5 GB. Plus the streaming join state on transactions × user_session_velocity keyed by user_id (~10 GB). Total ~15 GB.
Checkpoint interval. Set to 60 seconds. Flink writes the RocksDB state delta to S3 every minute; on recovery, it restores from the last checkpoint and replays Kafka from the saved offset.
Exactly-once. Source (Kafka, replayable) + engine (60-second 2PC checkpoint) + sink (Iceberg, transactional commit per checkpoint). All three layers cooperate.
Why Flink and not Materialize. 15 GB state and 50k events/sec is at the upper edge of Materialize's comfort zone; Flink handles this comfortably with a 4-node cluster.
Why Flink and not RisingWave. RisingWave could handle it too; choice often comes down to existing JVM ops vs OSS-only preference. Both are defensible.

Output (per second of throughput).

Metric	Value
Throughput	50k events/sec
State size	~15 GB on RocksDB + S3
Checkpoint interval	60 seconds
Per-event end-to-end latency	~100 ms
Cluster cost	~$1,200/month (managed Flink)

Rule of thumb. For petabyte-scale, multi-TB-state streaming SQL, Flink is the safe default. Materialize and RisingWave are catching up but Flink has the operational maturity story and the exactly-once-to-Iceberg sink that fraud-detection auditors actually trust.

Worked example — same fraud pipeline on Materialize

Detailed explanation. Materialize handles the same workload at lower throughput with much faster iteration. The architectural difference: no JVM ops, Postgres wire endpoint, in-memory state with S3 spill for cold data.

Question. Convert the same fraud-detection pipeline to Materialize. Identify what changes in the architecture and where the limits lie.

Input. Same Kafka topic at a lower throughput (5k events/sec — 10× smaller); ~1M distinct users; 30-day session retention.

Code.

-- Source — Materialize's Kafka source
CREATE SOURCE transactions FROM KAFKA BROKER 'kafka:9092' TOPIC 'transactions'
    FORMAT JSON
    INCLUDE TIMESTAMP AS event_time;

-- Materialized view of 30-day per-user session
CREATE MATERIALIZED VIEW user_session AS
SELECT
    user_id,
    COUNT(*) AS tx_count,
    SUM(amount) AS total_amount,
    AVG(amount) AS avg_amount
FROM transactions
WHERE event_time > mz_now() - INTERVAL '30 days'
GROUP BY user_id;

-- Fraud view — joined on the fly
CREATE MATERIALIZED VIEW fraud_flags AS
SELECT
    t.tx_id, t.user_id,
    (s.tx_count > 1000 OR t.amount > 10 * s.avg_amount) AS flag
FROM transactions t
JOIN user_session s ON t.user_id = s.user_id
WHERE t.event_time > mz_now() - INTERVAL '1 hour';

-- Consumer: Grafana, Slack bot, internal API
-- all read via SELECT * FROM fraud_flags

Step-by-step explanation.

No external sink required for dashboard consumers — they SELECT directly from fraud_flags. The streaming view is the read endpoint.
State size. 1M users × ~500 bytes = ~500 MB. Comfortably fits in a small cluster's RAM.
Throughput. 5k events/sec is well within Materialize's comfort zone; even a small cluster can sustain it.
Iteration speed. The view definition is a single SQL statement that can be re-created in seconds. No Flink job lifecycle, no savepoint, no JVM restart.
The throughput ceiling. Around 100k events/sec on a large cluster; past that, you need Flink or RisingWave.

Output (operational profile).

Metric	Value
Throughput	5k events/sec
State size	~500 MB in-memory
End-to-end latency	~10 ms
Cluster cost	~$400/month (managed Materialize)
Iteration speed	seconds to re-create a view

Rule of thumb. When the workload fits Materialize's throughput / state ceiling, choose Materialize for the iteration speed alone. The team velocity gain (no JVM ops, Postgres wire, view-as-endpoint) compounds over months.

Worked example — same fraud pipeline on RisingWave

Detailed explanation. RisingWave splits the difference: Apache-2.0 license, S3-tiered state, single binary, Postgres wire. The same query shape as Materialize, but the state lives in S3, so the throughput ceiling is higher and the storage cost is near-zero.

Question. Convert the pipeline to RisingWave. Identify the architectural difference vs Materialize and where RisingWave is the better choice.

Input. Same Kafka topic at 20k events/sec; ~5M distinct users; 30-day session retention.

Code.

-- Source with watermark
CREATE SOURCE transactions (
    user_id BIGINT, tx_id BIGINT, amount DECIMAL, event_time TIMESTAMP,
    WATERMARK FOR event_time AS event_time - INTERVAL '30 SECONDS'
) WITH (connector = 'kafka', topic = 'transactions',
        properties.bootstrap.server = 'kafka:9092')
ROW FORMAT JSON;

-- Materialized view: 30-day session per user
CREATE MATERIALIZED VIEW user_session AS
SELECT
    user_id,
    COUNT(*) AS tx_count,
    SUM(amount) AS total_amount,
    AVG(amount) AS avg_amount
FROM transactions
WHERE event_time > NOW() - INTERVAL '30 DAYS'
GROUP BY user_id;

-- Iceberg sink for the fraud flags
CREATE SINK fraud_sink FROM (
    SELECT t.tx_id, t.user_id,
           (s.tx_count > 1000 OR t.amount > 10 * s.avg_amount) AS flag
    FROM transactions t
    JOIN user_session s ON t.user_id = s.user_id
    WHERE t.event_time > NOW() - INTERVAL '1 HOUR'
) WITH (connector = 'iceberg', /* ... */);

Step-by-step explanation.

State backend. All state on S3, accessed via tiered storage. 5M users × 500 bytes = ~2.5 GB on S3 at ~$0.06/month storage. Compare with Materialize's RAM-bound cost.
Compute. Stateless compute nodes; horizontal scale by adding more compute pods. Linear scaling up to ~hundreds of thousands of events/sec.
Exactly-once. Source replayable + S3 checkpoint + Iceberg transactional sink = exactly-once end-to-end. Identical contract to Flink.
The hot-read latency cost. S3-tiered state has slightly higher hot-read latency than in-memory (Materialize) or local-RocksDB (Flink). For most use cases, the cost is invisible.
The license advantage. Apache-2.0 is the procurement-friendliest license; the binary can be vendored, forked, or self-hosted indefinitely.

Output (operational profile).

Metric	Value
Throughput	20k events/sec
State size	~2.5 GB on S3 (~$0.06/month)
End-to-end latency	~50 ms
Cluster cost	~$600/month (self-hosted on EC2)
Iteration speed	seconds to re-create a view

Rule of thumb. RisingWave is the 2026 OSS default for new deployments — cheap state, Apache-2.0, single binary, Postgres wire. Pick Flink only if you need exactly-once-to-Iceberg at petabyte scale, and Materialize only if you need absolute lowest iteration cycle on a single team.

SQL interview question on streaming architecture

A senior interviewer often frames this as: "Design a real-time anomaly-detection pipeline for ride-sharing surge pricing. Input is 80k events/sec; freshness target is 1 second; team has 8 engineers, no JVM background; consumer is a Grafana dashboard plus a downstream Kafka topic. Sketch the architecture and justify the engine."

Solution Using RisingWave end-to-end with a Postgres-wire read and a Kafka sink

        ┌────────┐      ┌────────────────────────────────┐      ┌──────────┐
        │ Kafka  │ ───▶│ RisingWave cluster (Rust)       │ ───▶│ Iceberg  │
        │ rides  │      │  • source: rides + driver_state│      │ flags    │
        └────────┘      │  • MV: rolling_5min_velocity   │      └──────────┘
                        │  • MV: surge_flag              │      ┌──────────┐
                        │  • S3-tiered state backend     │ ───▶│ Grafana  │
                        │  • sinks: Iceberg + Kafka      │      │ (psql)   │
                        └────────────────────────────────┘      └──────────┘
                              │
                              └───▶ Slack alert via Kafka consumer

Step-by-step trace.

Constraint	Decision
80k events/sec	RisingWave scales horizontally; well within capacity
1 s freshness	RisingWave MV freshness = single-digit ms
No JVM background	RisingWave is Rust; no JVM ops
Grafana consumer	RisingWave speaks Postgres wire → direct connection
Downstream Kafka topic	RisingWave's Kafka sink with transactional producer
Single binary	RisingWave deploys as one binary on EC2

Output:

Layer	Choice	Why
Source	Kafka `rides` topic	replayable, standard
Engine	RisingWave cluster (3-node)	scale + OSS + Postgres wire
State	S3-tiered	cheap, decoupled compute
Sink 1	Iceberg (historical flags)	mature; exactly-once via 2PC
Sink 2	Kafka (real-time alerts)	transactional producer
Read	Grafana via Postgres wire	direct, no extra system

Why this works — concept by concept:

RisingWave for non-JVM teams — the 8-engineer team has no JVM ops experience. Flink would consume ~25% of their bandwidth on tuning; RisingWave is a single Rust binary.
Postgres wire to Grafana — the consumer plugs straight into the engine. No intermediate database, no extra sink to operate.
Kafka transactional sink for alerts — the downstream Slack bot consumes from Kafka with isolation.level=read_committed, getting exactly-once semantics for the alert side.
S3-tiered state — 80k events/sec × 1M unique riders × 30-day retention is well past in-memory; S3 storage at ~$0.5/month is irresistible.
Cost — 3-node RisingWave cluster on EC2 (~$900/month) + S3 state (~$0.5/month) + Kafka cluster (~$300/month) + Iceberg on S3 (~$5/month). Total ~$1,200/month for the full pipeline. Flink would be roughly the same compute cost plus the headcount overhead of JVM ops.

SQL
Topic — joins
Streaming join problems (SQL)

Practice →

Cheat sheet — streaming SQL recipes

Always declare event-time + watermark on every source. WATERMARK FOR event_time AS event_time - INTERVAL '30 SECONDS'. Without it, you implicitly use processing-time and lose correctness under reordering.
Use exactly-once sinks. Kafka transactional producer + read-committed consumers; idempotent JDBC inserts (ON CONFLICT DO NOTHING); Iceberg with two-phase commit.
Prefer IVM-friendly operators. SELECT, WHERE, INNER / LEFT JOIN on equality keys, GROUP BY on bounded-cardinality columns. Avoid UDFs in hot paths.
Bound state with retention. Every GROUP BY without WHERE event_time > now() - INTERVAL '...' is a candidate for unbounded state growth. Cap deliberately.
For dashboards, expose views via Postgres wire. Materialize (port 6875) and RisingWave (port 4566) both speak it natively; any BI tool plugs in unchanged.
Snapshot state regularly. Flink: execution.checkpointing.interval = 60s. Materialize: managed by the engine. RisingWave: configurable per cluster.
Treat the streaming engine as a database, not a job runner. Materialized views are tables; clients SELECT them. Don't push the engine into a "Kafka producer that runs SQL" mental model.
Configure idle-source detection on every input source. Multi-source joins stall when any input goes silent without idle detection.
Set allowed lateness at ~2× p99 event delay. Route the residual to a side-output topic; reconcile in a daily batch job.
Audit the cardinality of every GROUP BY column. Group-by-user-id can produce 100M-row state; group-by-region produces a 250-row state. Check before deploying.
Use IS DISTINCT FROM for NULL-safe join keys. NULL = NULL is NULL in three-valued logic; the NULL-safe operator returns TRUE / FALSE.
For Flink → Iceberg, use the 2PC commit. Configure EXACTLY_ONCE checkpoint mode and Iceberg's commit-protocol = 2pc. Both sides must cooperate.
For Materialize / RisingWave, use a small cluster first. Iterate on the view definitions in a small cluster, then scale up only when you've validated correctness.
Treat ksqlDB as legacy. Confluent's 2025 deprecation pushed it to maintenance mode. New projects: Materialize, RisingWave, or Flink.

Frequently asked questions

Materialize vs RisingWave — which should I pick in 2026?

Pick Materialize when you want the fastest iteration cycle on a Postgres-wire dashboard, your team is small (1-5 engineers), and your throughput is under 50k events/sec — its in-memory differential dataflow is the gold standard for sub-second freshness. Pick RisingWave when you need an OSS Apache-2.0 license, S3-tiered cheap state, single-binary deployment, or your throughput pushes past 50k events/sec. RisingWave is the safer procurement choice; Materialize is the faster iteration choice.

Is Flink SQL still relevant if Materialize and RisingWave exist?

Yes — Flink SQL is still the right call for petabyte-scale stateful pipelines, end-to-end exactly-once to Iceberg, and JVM-fluent platform teams. The exactly-once contract via two-phase commit between Flink checkpoints and the Iceberg sink is the most mature in the ecosystem, which matters for audit-sensitive workloads (fraud, finance, ad-fraud). The JVM ops burden is real and accelerates the headcount cost, but for the right workload Flink is still the lowest-risk choice.

Can I do streaming SQL without Kafka?

Yes. Materialize ingests directly from Postgres logical replication, S3, or webhook sources — no Kafka required. RisingWave has native CDC connectors for Postgres and MySQL. Flink has connectors for files, Pulsar, Kinesis, and CDC sources. Kafka is the most common source because it is replayable and durable, but every modern streaming engine can ingest from non-Kafka sources for smaller deployments or specific use cases.

What's the difference between watermarks and lateness?

A watermark is the producer's promise: "I will not emit any event with timestamp < W." It paces the engine's notion of "now." Lateness is the consumer's tolerance: "I will keep the window open for L seconds past the watermark to admit straggler events." Watermark is set on the source; lateness is set on the window operator. Together they let you balance freshness (tighter watermark) against completeness (wider lateness) — there is no free lunch, and the optimal pair is workload-specific.

How do streaming engines deliver exactly-once?

Exactly-once is a three-way contract among source, engine, and sink. The source must be replayable from a deterministic offset (Kafka, CDC). The engine must checkpoint state atomically (Flink uses 2PC across operators; Materialize uses transactional dataflow; RisingWave checkpoints to S3 with a coordinator). The sink must commit transactionally or be idempotent (Iceberg 2PC, transactional Kafka producer, idempotent JDBC). If any of the three links is missing, the pipeline collapses to at-least-once. Audit each layer separately before claiming exactly-once to a stakeholder.

Is streaming SQL cheaper than batch for my workload?

It depends on state size, retention window, and throughput. For sub-second freshness on small-to-medium state (under ~100 GB), streaming SQL is dramatically cheaper than batch — the IVM cost model is O(delta) per event vs O(N) per batch. For huge state (TB+) and relaxed freshness (hourly is fine), batch on Spark or dbt is usually cheaper because the per-batch cost amortises across many events. The rule of thumb: freshness target under 1 minute → streaming SQL; over 1 hour → batch; in between → measure both.

Practice on PipeCode

Drill the streaming practice library → for the watermark / event-time / late-data family of probes.
Rehearse on real-time analytics problems → when the interviewer wants sub-second SLAs and Postgres-wire reads.
Sharpen sliding window drills → for the tumbling vs hopping vs session window patterns.
Layer the aggregation library → for the GROUP BY + IVM cost model.
Stack the joins practice library → for streaming join state size and multi-source watermark alignment.
For the broader surface, read top data engineering interview questions →.
Stack the prerequisites with the only 5 skills you need to become a data engineer →.
Sharpen the dialect axis with the SQL for data engineering interviews course →.
For long-form schema craft, work through data modelling for DE interviews →.

Pipecode.ai is Leetcode for Data Engineering — every streaming-SQL recipe above ships with hands-on practice rooms where you write the watermark declaration, the IVM-friendly aggregation, and the exactly-once sink contract against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your fix to a watermark stall actually behaves the same on Materialize as on Flink.

Practice streaming SQL now →
Real-time analytics drills →