real-time sql used to mean "spin up a Java team and learn the Kafka Streams DSL." In 2026 it means typing CREATE MATERIALIZED VIEW top_products AS SELECT product_id, SUM(amount) FROM orders GROUP BY product_id and watching the answer update in milliseconds as orders flow in. Four engines made that shift real: Materialize, RisingWave, Apache Flink SQL, and ksqlDB — same dialect, very different tradeoffs.
This guide is the cheat sheet for picking among them. It walks through what each engine is good at, the incremental view maintenance trick that lets a 1B-row aggregation update in microseconds, the watermark contract that keeps event-time semantics honest, the exactly-once delivery story (source + engine + sink all cooperating), and three end-to-end reference architectures with cost and ops profiles. Each section pairs a teaching block with a senior-interview-grade Solution Tail — code, step-by-step trace, output table, then a concept-by-concept breakdown of why it works.
When you want hands-on reps immediately after reading, drill the streaming practice library →, rehearse on real-time analytics problems →, and stack the time-series muscles with sliding window drills →.
On this page
- Why streaming SQL finally matters in 2026
- The streaming-SQL engines matrix — what each is good at
- Incremental view maintenance — the magic behind streaming SQL
- Watermarks, event time vs ingestion time, late data
- Reference architectures — Kafka → Flink, Kafka → Materialize, RisingWave end-to-end
- Cheat sheet — streaming SQL recipes
- Frequently asked questions
- Practice on PipeCode
1. Why streaming SQL finally matters in 2026
The decade-long shift from batch jobs to long-running materialised views you query like a database
The one-sentence invariant: streaming SQL is the model where every query is a long-running subscription that keeps its answer fresh as the underlying data changes — not a one-shot scan over a frozen snapshot. Once you internalise that the query is the maintained view, the entire family of "should this be a Spark job or a streaming view?" decisions resolves to a single question — "how stale can the answer be?"
The three eras in one paragraph.
- Batch era (2010-2016). Hadoop MapReduce, Hive, then early Spark. The answer was hours stale by design.
- Micro-batch era (2017-2022). Spark Structured Streaming, Kafka Connect, dbt on warehouses with hourly schedules. The answer was minutes stale and pretended to be live.
-
True streaming era (2023+). Materialize, RisingWave, Flink SQL —
CREATE MATERIALIZED VIEWas a streaming primitive. The answer is millisecond-fresh by construction.
Where streaming SQL actually fits.
- Operational dashboards. Sub-second freshness on top-K leaderboards, queue depth, latency p99, error rates.
- Alerting. "Page me when error rate over a 5-minute sliding window crosses 1%" is a SQL query, not a service.
- Feature serving. Online ML features (rolling means, last-N events, session counts) that must match the offline training distribution.
- Fraud detection. Sessionised joins between transaction and device streams, evaluated on every event.
- CDC fan-out. A Postgres source change propagates to Iceberg, Snowflake, Kafka, and a Slack alert in one materialized view per sink.
What interviewers listen for.
- Do you say "streaming SQL is a subscription, not a query" when asked the difference from batch? — senior signal.
- Do you reach for incremental view maintenance as the cost-model phrase? — required answer.
- Do you mention watermarks as the contract between event-time and the operator? — required answer.
- Do you call out Spark Structured Streaming as micro-batch, not true streaming? — senior signal.
- Do you separate exactly-once at the source, engine, and sink rather than as a single property? — staff signal.
The 2026 reality.
- Materialize is the iteration leader — Postgres wire-compatible, in-memory differential dataflow, one-cluster start.
- RisingWave is the OSS economics leader — Rust, distributed, S3-tiered state, Apache-2.0 license, one binary.
- Flink SQL is still the scale leader — JVM, distributed, RocksDB state, the de-facto choice for petabyte streams.
- ksqlDB is the deprecation cautionary tale — Confluent shipped end-of-life notices through 2025 in favour of Flink.
Worked example — the same "top-10 products" query across all four engines
Detailed explanation. A common interview opener: "show me the same top-10-products-by-revenue-in-the-last-hour query in Materialize, RisingWave, Flink SQL, and ksqlDB and call out the dialect differences." The query itself is nearly identical — the differences are in source declarations, watermark syntax, and refresh semantics.
Question. Write the streaming-SQL definition of top_products_last_hour (top 10 products by revenue in the trailing 60-minute window) in each of the four engines. Highlight where the syntax diverges.
Input. A Kafka topic orders with schema (order_id BIGINT, product_id BIGINT, amount DECIMAL, event_time TIMESTAMP).
Code.
-- Materialize
CREATE SOURCE orders FROM KAFKA BROKER 'kafka:9092' TOPIC 'orders'
FORMAT JSON
INCLUDE TIMESTAMP AS event_time;
CREATE MATERIALIZED VIEW top_products_last_hour AS
SELECT product_id,
SUM(amount) AS revenue
FROM orders
WHERE event_time >= mz_now() - INTERVAL '1 hour'
GROUP BY product_id
ORDER BY revenue DESC
LIMIT 10;
-- RisingWave
CREATE SOURCE orders (
order_id BIGINT, product_id BIGINT, amount DECIMAL, event_time TIMESTAMP,
WATERMARK FOR event_time AS event_time - INTERVAL '30 SECONDS'
) WITH (connector = 'kafka', topic = 'orders', properties.bootstrap.server = 'kafka:9092')
ROW FORMAT JSON;
CREATE MATERIALIZED VIEW top_products_last_hour AS
SELECT product_id, SUM(amount) AS revenue
FROM TUMBLE(orders, event_time, INTERVAL '1 HOUR')
GROUP BY product_id, window_start
ORDER BY revenue DESC
LIMIT 10;
-- Flink SQL
CREATE TABLE orders (
order_id BIGINT, product_id BIGINT, amount DECIMAL, event_time TIMESTAMP(3),
WATERMARK FOR event_time AS event_time - INTERVAL '30' SECOND
) WITH ('connector' = 'kafka', 'topic' = 'orders', 'format' = 'json',
'properties.bootstrap.servers' = 'kafka:9092');
CREATE VIEW top_products_last_hour AS
SELECT product_id, SUM(amount) AS revenue
FROM TABLE(TUMBLE(TABLE orders, DESCRIPTOR(event_time), INTERVAL '1' HOUR))
GROUP BY product_id, window_start, window_end
ORDER BY revenue DESC
LIMIT 10;
-- ksqlDB (legacy — being deprecated)
CREATE STREAM orders (order_id BIGINT, product_id BIGINT, amount DECIMAL, event_time BIGINT)
WITH (kafka_topic='orders', value_format='JSON', timestamp='event_time');
CREATE TABLE top_products_last_hour AS
SELECT product_id, SUM(amount) AS revenue
FROM orders
WINDOW TUMBLING (SIZE 1 HOUR)
GROUP BY product_id
EMIT CHANGES;
Step-by-step explanation.
-
Materialize treats the source as a long-running view;
mz_now()is the engine's notion of "now" and the WHERE-clause filter implicitly maintains a sliding window. No watermark required for the simple case. -
RisingWave and Flink SQL both use a
WATERMARK FOR ... AS event_time - INTERVAL '30 SECONDS'declaration — the operator promises "I will not emit events with timestamp more than 30 seconds older than the max seen." - RisingWave's
TUMBLEis a tabular function call; Flink SQL's is a polymorphic table function (PTF). Both produce the same tumbling-window semantics, but the polymorphic-table-function call site is more verbose. -
ksqlDB uses the older
WINDOW TUMBLING (SIZE 1 HOUR)clause and a specialEMIT CHANGESmarker to indicate the view should refresh continuously. The dialect is the most Kafka-coupled and the least portable. - All four queries produce conceptually the same "top 10 by revenue in the trailing hour" answer. The choice of engine drives the operational story (state size, ops burden, scalability), not the SQL.
Output (every engine).
| product_id | revenue |
|---|---|
| 7 | 18420.50 |
| 2 | 17910.10 |
| 19 | 14200.00 |
| ... | ... |
Rule of thumb. The streaming-SQL learning curve is dominated by source / watermark declarations and time-window syntax, not by the aggregate logic. Pick the engine on operational tradeoffs (Materialize for iteration, RisingWave for OSS economics, Flink for scale) — the SQL itself is largely portable.
Worked example — when the streaming view becomes a database table
Detailed explanation. The most underrated property of Materialize and RisingWave is Postgres wire compatibility — your dashboard, BI tool, or Slack bot connects to the streaming engine as if it were Postgres and SELECTs from a materialized view. The view is the database table; the streaming job is invisible.
Question. Given the top_products_last_hour materialized view defined in Materialize, show how a Grafana dashboard, a Python service, and a CLI query all consume the same fresh data without any custom wire protocol.
Input. A running Materialize cluster with the view from the previous example.
Code.
# CLI
psql "postgres://materialize@host:6875/materialize" \
-c "SELECT product_id, revenue FROM top_products_last_hour LIMIT 5"
# Python service
python - <<'PY'
import psycopg2
conn = psycopg2.connect("postgres://materialize@host:6875/materialize")
cur = conn.cursor()
cur.execute("SELECT product_id, revenue FROM top_products_last_hour")
print(cur.fetchall())
PY
# Grafana — configure a Postgres data source pointing at the cluster,
# then write the SELECT as the panel query. Auto-refresh every 5 s.
Step-by-step explanation.
- The Materialize cluster speaks the Postgres wire protocol on port 6875. Every tool that knows how to talk to Postgres talks to Materialize unchanged.
- The materialized view is not a frozen snapshot — every read returns the freshest committed answer, computed incrementally from the input stream.
- Latency from "order produced in Kafka" to "row visible to
SELECT" is single-digit milliseconds on small to medium clusters. - The same property holds for RisingWave: Postgres wire on port 4566 by default, queryable from any Postgres client.
- Flink SQL does not speak Postgres wire — you push results to a sink (Kafka, Iceberg, JDBC) and query the sink. This is the single biggest workflow gap for Flink in 2026, addressed partially by Apache Paimon and Flink Table Store.
Output. A SELECT returns the same shape as if the view were a Postgres table; freshness is implicit.
Rule of thumb. When the team wants "BI tool plugged into streams without custom connectors," pick an engine that speaks Postgres wire (Materialize or RisingWave). When the team wants "petabyte stream landing in Iceberg," pick Flink SQL with the Iceberg sink. Match the engine to the consumer surface.
Worked example — counting the cost of a long-running streaming view
Detailed explanation. Streaming SQL feels free at small scale but bills compute and state continuously. The right cost mental model is state size × retention window × throughput — not "queries per second."
Question. Estimate the rough monthly cost of running the top_products_last_hour view on a 50k events/second stream for a 30-day retention window on each of Materialize, RisingWave, and Flink SQL. Show how the cost drivers differ.
Input. 50,000 events/second × 86,400 seconds/day × 30 days = ~129B events; average row size 200 bytes; state is keyed by product_id (assume 50k distinct products) and one row per window per product.
Code (state-size estimate).
events_per_day = 50_000 * 86_400 # 4.32B
state_rows = 50_000 * 24 * 30 # 36M (product × hour × day)
state_size_gb = state_rows * 200 / 1e9 # ~7.2 GB
Step-by-step explanation.
-
Materialize charges roughly on cluster size; a
largecluster at ~$1k/month handles ~50k events/sec for a 50k-key, 7-GB-state aggregation comfortably. Cost is dominated by compute, not storage. - RisingWave tiers state to S3, so the 7-GB state lives at S3 prices (~$0.16/month) and compute is the binary's machine cost — typically half of a Flink cluster at the same throughput.
- Flink SQL stores state in RocksDB on local disk with periodic checkpoints to S3; a 4-node TaskManager cluster (~$1.2k/month on managed Flink) plus S3 checkpointing (~$5/month at 7 GB) is the typical bill.
-
All three can blow up the bill if you write
GROUP BY user_id(potentially 100M keys) without a retention window — the state grows unbounded and either OOMs (Materialize) or pages to S3 forever (RisingWave) or spills RocksDB to multi-TB disks (Flink).
Output (rough monthly costs, illustrative).
| Engine | Compute | State | Ops overhead | Approx. total |
|---|---|---|---|---|
| Materialize | $1,000 | in-cluster | low (managed) | $1,000 |
| RisingWave | $500 | $0.20 (S3) | medium (OSS deploy) | $500 |
| Flink SQL | $1,200 | $5 (S3 checkpoints) | high (JVM tuning) | $1,205 |
Rule of thumb. Before scaling a streaming SQL workload, run a back-of-envelope state size × retention × throughput estimate. If state explodes, change the SQL (add WHERE event_time > now() - INTERVAL '7 days', switch from GROUP BY user_id to GROUP BY user_segment) before changing the engine.
SQL interview question on streaming-SQL fundamentals
A senior interviewer often opens with: "Explain the difference between a Spark Structured Streaming query and a Materialize materialized view in two minutes. Then describe one workload where each is the right call." It probes whether the candidate knows micro-batch vs true streaming and the cost model.
Solution Using a side-by-side mental model
-- Spark Structured Streaming (micro-batch)
val query = orders.readStream
.format("kafka").option("subscribe", "orders").load()
.groupBy("product_id")
.agg(sum("amount").alias("revenue"))
.writeStream
.trigger(Trigger.ProcessingTime("10 seconds")) -- micro-batch
.outputMode("update")
.format("delta").start()
-- Materialize (true streaming, IVM)
CREATE MATERIALIZED VIEW top_products AS
SELECT product_id, SUM(amount) AS revenue
FROM orders
GROUP BY product_id;
Step-by-step trace.
| Property | Spark Structured Streaming | Materialize |
|---|---|---|
| Latency floor | the trigger interval (e.g. 10 s) | sub-second per row |
| State model | re-read all input each micro-batch (with state store) | maintain delta-driven state per row |
| Query language | Scala / PySpark DataFrame DSL or SQL | pure SQL (Postgres dialect) |
| Read shape | downstream consumer reads the sink (Delta, Iceberg) | downstream consumer SELECTs the view via Postgres wire |
| Best for | append-only ETL into a lakehouse | dashboards, alerting, BI on live data |
Output:
| Workload | Best engine | Why |
|---|---|---|
| Hourly Iceberg landing of clickstream | Spark Structured Streaming | append-only, lakehouse-native, hourly is fine |
| Sub-second leaderboard for a live ops console | Materialize | Postgres wire to Grafana, single-digit-ms freshness |
Why this works — concept by concept:
- Micro-batch vs true streaming — Spark waits a trigger interval, then runs a batch over the new input plus the state store. Materialize updates the answer per row. The two are different products even though both call themselves "streaming."
- Subscription, not query — the Materialize view is a long-lived computation. Once created, the consumer SELECTs the freshest answer. The query is the maintenance.
- Latency floor — Spark's floor is the trigger interval; Materialize's floor is the dataflow latency (~10 ms). For sub-second SLAs, Spark is mathematically the wrong choice.
- Operational surface — Spark needs a sink and a separate dashboard reader; Materialize is the read endpoint. One fewer hop, one fewer system to operate.
- Cost — Spark Structured Streaming on a 50k events/sec workload typically costs 2-3× a Materialize cluster of equivalent freshness, because every micro-batch re-scans state. IVM is the algorithmic win.
SQL
Topic — streaming
Streaming SQL problems (SQL)
2. The streaming-SQL engines matrix — what each is good at
Same SQL dialect family, four very different operational profiles — pick by team, scale, and ops appetite
The mental model in one line: Materialize is for iteration and Postgres-wire dashboards; RisingWave is for OSS economics and S3-tiered state; Flink SQL is for petabyte-scale stateful pipelines; ksqlDB is legacy and being sunset by Confluent. Once you can name the right engine in one sentence per use case, the interview question "which streaming engine would you pick for X?" collapses to a memorisation exercise.
The feature matrix in one table.
| Feature | Materialize | RisingWave | Flink SQL | ksqlDB |
|---|---|---|---|---|
| Language | Rust | Rust | Java / Scala | Java |
| IVM | true (differential dataflow) | true | true | limited |
| Streaming joins | ✓ | ✓ | ✓ | basic |
| Exactly-once | sink-dependent | sink-dependent | two-phase commit | sink-dependent |
| CDC source | Postgres + Debezium | native + Debezium | Debezium connector | Debezium connector |
| Distributed | single-node + read replicas | distributed | distributed | distributed |
| Wire protocol | Postgres | Postgres | JDBC / custom | REST + Kafka |
| State backend | in-memory + S3 spill | S3-tiered | RocksDB on local + S3 checkpoint | RocksDB |
| License | source-available BSL | Apache 2.0 | Apache 2.0 | Confluent CL (deprecated) |
| Best for | iteration + dashboards | OSS lakehouse | large state at scale | legacy Kafka-only shops |
When to pick each — in one sentence.
-
Materialize. "Small team, dashboards over a Postgres BI tool, need sub-second freshness." The fastest path from
CREATE MATERIALIZED VIEWto "Grafana shows live data." - RisingWave. "OSS-first lakehouse, need cheap S3-tiered state, want the latest engine architecture." The 2026 default for new OSS deployments.
- Apache Flink SQL. "Petabyte stream, JVM-fluent SRE team, must guarantee exactly-once end-to-end." The de-facto enterprise standard.
- ksqlDB. Do not start new projects on ksqlDB in 2026 — Confluent's 2025 deprecation notice pushed it to maintenance mode in favour of Flink-on-Confluent.
The non-engines (the things that look like streaming SQL but are not).
- dbt + Kafka. dbt is a batch DAG runner; pairing it with Kafka does not make it streaming. The freshness floor is dbt's run interval.
- Spark Structured Streaming. Micro-batch. The freshness floor is the trigger interval (typically 5-60 s). Excellent for append-only ETL, wrong for sub-second SLAs.
- Redshift Streaming Ingestion. A continuous COPY into a Redshift table. Querying is still batch on the read side.
- Snowflake Dynamic Tables. A managed materialized view that refreshes on a schedule (down to 1 minute). Closer to "incremental dbt" than to true streaming SQL.
Common interview probes on the matrix.
- "Materialize vs RisingWave — what's the deciding factor?" — managed vs OSS, in-memory vs S3-tiered, single-region vs distributed-first.
- "When is Flink the right call?" — petabyte streams, exactly-once required end-to-end, team already fluent in JVM tuning.
- "Why is ksqlDB on the way out?" — Confluent positioning Flink-on-Confluent as the successor; ksqlDB's join semantics and IVM coverage are weaker.
- "Is Spark Structured Streaming a competitor?" — only for append-only ETL with relaxed latency. Not for sub-second views.
Worked example — picking the right engine for three workloads
Detailed explanation. A common senior-design probe: given three workloads with different freshness, throughput, and team-shape constraints, name the engine and justify in two sentences. The probe tests whether you have a mental model, not whether you've memorised one engine's docs.
Question. For each workload, pick one engine and explain the choice. (1) A 4-person fintech team needs a fraud-dashboard that updates within 1 second on 5k events/sec, queried from Grafana. (2) A 30-person ad-tech team needs to land 200k events/sec into Iceberg with strict exactly-once guarantees. (3) A 6-person OSS-only startup wants a single binary they can run on EC2 with no managed vendors and S3 state.
Input. Three constraint sets above.
Code (decision matrix).
Workload 1 — fintech fraud dashboard
freshness: 1 s | throughput: 5k/s | team: small | consumer: Grafana
→ Materialize
• Postgres-wire connects directly to Grafana
• Single cluster, no JVM ops
• IVM keeps the dashboard answer fresh at delta speed
Workload 2 — ad-tech Iceberg landing
freshness: minutes | throughput: 200k/s | team: large JVM-fluent
exactly-once required end-to-end
→ Flink SQL + Iceberg sink
• Two-phase commit between Flink checkpoint and Iceberg snapshot
• Petabyte state handled by RocksDB + S3 checkpoint
• Mature operational tooling (Flink UI, metrics, savepoints)
Workload 3 — OSS-only startup
freshness: 1 s | throughput: 5k/s | team: 6 | OSS-only | S3 state
→ RisingWave
• Apache-2.0 license, single binary
• S3-tiered state keeps the bill near-zero
• Postgres wire = any BI tool plugs in
Step-by-step explanation.
- Workload 1 screams "managed Postgres-wire engine" — sub-second SLA + small team + Grafana eliminates Flink (JVM ops too heavy) and RisingWave (more ops than Materialize Cloud).
- Workload 2 screams "Flink" — exactly-once + Iceberg + 200k/s + large team makes the JVM cost worth it. Materialize cannot handle the state size; RisingWave can but Flink has the mature Iceberg sink story.
- Workload 3 screams "RisingWave" — OSS + single binary + S3 state is exactly its pitch. Materialize is source-available BSL (not OSS by some definitions); Flink is OSS but the JVM ops are heavy for a 6-person team.
- The matrix is not "which engine is best" — it is "which engine fits this team and SLA."
Output.
| Workload | Pick | Single-line reason |
|---|---|---|
| Fraud dashboard | Materialize | Postgres-wire + sub-second + tiny ops |
| Iceberg landing | Flink SQL | Exactly-once + petabyte scale + JVM team |
| OSS startup | RisingWave | Apache-2.0 + S3 state + single binary |
Rule of thumb. Match the engine to the consumer surface and the team shape, not to the workload's throughput in isolation. Throughput is necessary but rarely sufficient — operational fit is what decides the deployment.
Worked example — same query, three engines, three watermark syntaxes
Detailed explanation. The watermark declaration is where every dialect diverges most. Same semantic ("I will not see events more than 30 seconds older than the max seen"), three different keywords.
Question. Translate the watermark declaration WATERMARK FOR event_time AS event_time - INTERVAL '30 SECONDS' into the canonical syntax of Materialize, RisingWave, and Flink SQL, and highlight which engine omits the explicit watermark for simple cases.
Input. A Kafka topic events with event_time TIMESTAMP.
Code.
-- Materialize — implicit; no WATERMARK keyword
CREATE SOURCE events FROM KAFKA BROKER 'kafka:9092' TOPIC 'events'
FORMAT JSON
INCLUDE TIMESTAMP AS event_time;
-- Materialize derives a watermark from the source automatically;
-- tighten with mz_now() vs event_time predicates.
-- RisingWave — WATERMARK inside the source definition
CREATE SOURCE events (
user_id BIGINT, action TEXT, event_time TIMESTAMP,
WATERMARK FOR event_time AS event_time - INTERVAL '30 SECONDS'
) WITH (connector = 'kafka', topic = 'events',
properties.bootstrap.server = 'kafka:9092')
ROW FORMAT JSON;
-- Flink SQL — WATERMARK inside the CREATE TABLE
CREATE TABLE events (
user_id BIGINT, action STRING, event_time TIMESTAMP(3),
WATERMARK FOR event_time AS event_time - INTERVAL '30' SECOND
) WITH ('connector' = 'kafka', 'topic' = 'events', 'format' = 'json',
'properties.bootstrap.servers' = 'kafka:9092');
Step-by-step explanation.
- Materialize auto-derives a watermark for Kafka sources from the offset clock — no explicit syntax for the common case. Override only when correctness depends on event-time lag.
-
RisingWave puts the watermark in the source definition (the
CREATE SOURCEblock). -
Flink SQL puts it in the
CREATE TABLEblock. The interval keyword isSECOND(singular) in Flink andSECONDS(plural) in RisingWave — a small but real foot-gun when porting queries. - All three semantics are identical: the operator promises to never emit a row with event-time older than
(max_event_time_seen - 30s).
Output. Each engine compiles the source, attaches the same semantic watermark, and downstream windows now operate in event-time correctly.
Rule of thumb. When you port a streaming SQL job between engines, the watermark declaration is where the dialect tax lands. Memorise the syntax for the engine you use most, and keep a one-line cheat for the other two.
Worked example — exactly-once across the three engines
Detailed explanation. Exactly-once is not a single engine property — it is a contract among the source, the engine's checkpoint mechanism, and the sink's transaction support. Every modern engine can deliver it, but only when all three pieces cooperate.
Question. Walk through what each layer must guarantee for an exactly-once pipeline in (a) Flink SQL → Iceberg, (b) Materialize → Postgres, (c) RisingWave → Kafka. Identify the weakest link.
Input. Three end-to-end pipelines above.
Code (configuration excerpt).
-- Flink → Iceberg, exactly-once
SET 'execution.checkpointing.interval' = '60s';
SET 'execution.checkpointing.mode' = 'EXACTLY_ONCE';
-- Iceberg sink commits per checkpoint via two-phase commit;
-- the same checkpoint also commits Kafka source offsets.
-- Materialize → Postgres (idempotent sink)
-- Materialize emits row diffs (INSERT/UPDATE/DELETE); the sink
-- applies them inside a single Postgres transaction per microbatch.
-- Exactly-once = source replayability + idempotent diff apply.
-- RisingWave → Kafka transactional producer
CREATE SINK leaderboard_sink FROM top_products
WITH ( connector = 'kafka', topic = 'leaderboard',
type = 'append-only',
properties.bootstrap.server = 'kafka:9092',
properties.transactional.id = 'leaderboard-tx');
Step-by-step explanation.
- Source. Must support replay from a deterministic offset (Kafka, Postgres logical replication, Pulsar). If the source cannot replay, exactly-once is impossible on a crash — you can only get at-least-once.
- Engine. Must checkpoint state atomically. Flink uses 2PC across operators; Materialize uses transactional state in its dataflow; RisingWave checkpoints to S3 with a coordinator.
- Sink. Must commit atomically with the engine's checkpoint. Iceberg supports two-phase commit; Postgres supports it via transactional inserts; Kafka supports it via transactional-producer + read-committed consumers.
- The weakest link. A non-idempotent sink (e.g. a fire-and-forget HTTP webhook) collapses exactly-once to at-least-once regardless of how strong the engine is. The pipeline's guarantee is the minimum across the three layers.
Output.
| Pipeline | Source | Engine | Sink | End-to-end |
|---|---|---|---|---|
| Flink → Iceberg | Kafka replayable | 2PC checkpoint | Iceberg 2PC | exactly-once |
| Materialize → Postgres | Kafka replayable | diff-dataflow | idempotent diff apply | exactly-once |
| RisingWave → Kafka | Kafka replayable | S3 checkpoint | Kafka transactional | exactly-once |
| Any → webhook | Kafka replayable | any | non-idempotent | at-least-once |
Rule of thumb. Before you advertise "exactly-once" to a stakeholder, audit each of the three layers separately. The pipeline is exactly-once only when source replay + engine checkpoint + sink commit all cooperate; replace any non-idempotent sink with an idempotent or transactional one or accept at-least-once.
SQL interview question on engine selection
A senior interviewer might frame this as: "Your team is moving off ksqlDB before Confluent's end-of-life. You have three workloads — a sub-second leaderboard, a Iceberg landing job, and a CDC fan-out to four sinks. Pick a single engine for all three, or pick three different engines, and justify."
Solution Using a single-engine-when-possible heuristic
Constraint | Materialize | RisingWave | Flink SQL
--------------------------+-------------+------------+----------
Sub-second leaderboard | best | good | good
Iceberg landing (200k/s) | weak | good | best
CDC fan-out, 4 sinks | good | best | good
Single binary OSS | no | yes | no
Postgres wire | yes | yes | no
JVM ops avoided | yes | yes | no
The decision is RisingWave for all three — it covers every workload at "good or best," ships as a single OSS binary, speaks Postgres wire, and avoids the JVM ops burden. The only "best" Materialize wins is the sub-second leaderboard, and RisingWave is close enough that the operational simplicity of a single engine wins.
Step-by-step trace.
| Workload | Pick | Why not the alternative |
|---|---|---|
| Leaderboard | RisingWave | Materialize is "best" but adds a second engine |
| Iceberg landing | RisingWave | Flink is "best" but adds JVM ops |
| CDC fan-out | RisingWave | both alternatives need extra tooling |
Output:
| Engine count | Operational surface | Iteration speed | Cost |
|---|---|---|---|
| 1 (RisingWave) | smallest | fastest | lowest |
| 2 (Materialize + Flink) | medium | medium | medium |
| 3 (one per workload) | largest | slowest | highest |
Why this works — concept by concept:
- Single-engine bias — every engine you add is more code paths, more on-call runbooks, more upgrade lockstep. Default to one engine unless a workload genuinely cannot fit.
- "Good" beats "best" plus an extra system — RisingWave's "good" sub-second performance with a 50k-key state is operationally cheaper than Materialize's "best" with a separate Iceberg-landing system.
- License matters — Apache-2.0 vs source-available BSL is a procurement question. RisingWave's pure OSS license is the safe answer for security-sensitive shops.
- Postgres wire is a force multiplier — every BI tool, every dashboarding tool, every analyst's notebook just works. This is the largest hidden lever in streaming-SQL adoption.
- Cost — three engines at 30% utilisation each costs more than one engine at 90% utilisation. Single-engine consolidation is the classic "make the bill smaller" move.
SQL
Topic — real-time analytics
Real-time analytics problems (SQL)
3. Incremental view maintenance — the magic behind streaming SQL
IVM turns expensive joins into O(delta) updates — the algorithmic foundation that makes real-time SQL viable
The mental model in one line: incremental view maintenance recomputes only what changes when an input row is inserted, updated, or deleted — not the whole view from scratch. Once you internalise "delta in, delta out," the entire family of "how can a billion-row aggregate update in microseconds?" questions resolves to a single algorithmic invariant.
IVM in five bullets.
-
Definition. Given a view
V = f(R)and a deltaΔRon the input, IVM computesΔV = f'(R, ΔR)such thatV' = V + ΔV— without recomputingf(R + ΔR)from scratch. -
Cost model. Each delta does O(log N) work for indexed joins, O(1) work for aggregations on a fixed group key, O(K) work for top-K updates. Independent of
N. - Operator coverage. SELECT, WHERE, INNER / LEFT / OUTER JOIN, GROUP BY, COUNT/SUM/AVG, window aggregations with watermarks, DISTINCT, UNION.
-
Operator gaps. Arbitrary UDFs (the engine cannot prove they are deterministic), recursive CTEs, non-deterministic functions (
RANDOM(),NOW()in some contexts), order-dependent aggregates. - Retractions. Updates and deletes require negative deltas — the operator emits "remove old row, add new row." Naive IVM that only handles inserts breaks on the first DELETE.
Differential dataflow — the academic foundation.
Materialize is built on differential dataflow, a 2013 Microsoft Research framework that formalises "compute the same answer incrementally as inputs change," including across nested loops and recursive queries. RisingWave and Flink use simpler but related approaches (stream-table duality, retract streams). The practical upshot: every modern streaming engine treats IVM as the core primitive, not an optimisation.
Common interview probes on IVM.
- "How can a 1B-row aggregate update in microseconds?" — because the aggregate state is a hash map keyed by
group_id; one delta touches one key and emits one updated value. - "Why doesn't Spark Structured Streaming have true IVM?" — because micro-batch re-reads the input each interval; the state store helps but the algorithmic model is still "read window of input, compute, write." Not delta-only.
- "What happens to IVM on a DELETE?" — the engine emits a negative delta. Downstream aggregates apply it (e.g. SUM decrements by the deleted row's value).
- "Can IVM handle a UDF?" — only if the UDF is deterministic, pure, and side-effect-free. Most engines require an explicit annotation; without it, IVM treats the UDF as opaque and falls back to recomputation.
The five operator coverage tiers.
- Tier 1 — trivially IVM. SELECT, WHERE on constant predicates, projection, UNION ALL. O(1) per delta.
- Tier 2 — IVM via index. INNER / LEFT JOIN on equality keys. O(log N) per delta with a hash index.
- Tier 3 — IVM via fixed group state. GROUP BY + SUM/COUNT/AVG. O(1) per delta plus the hash-map lookup.
- Tier 4 — IVM with retraction. Top-K, DISTINCT COUNT, MIN/MAX (because the previous-max row might be deleted). O(K) per delta.
- Tier 5 — not IVM-friendly. Arbitrary UDFs, recursive CTEs, ORDER BY with LIMIT on unbounded streams, window functions without watermarks.
Worked example — IVM on a streaming top-K leaderboard
Detailed explanation. A click-event stream feeds a top-10 leaderboard. Each click event triggers a delta that updates exactly one product's score; the top-K state re-sorts the small head of the list, not the whole product population.
Question. Given the click event stream below and a top-3 leaderboard, walk through the IVM updates step by step. Show what changes in the materialized state on each event.
Input — click stream (oldest first).
| seq | product_id | score |
|---|---|---|
| 1 | A | +5 |
| 2 | B | +3 |
| 3 | A | +4 |
| 4 | C | +12 |
| 5 | A | -5 (correction) |
Code.
CREATE MATERIALIZED VIEW leaderboard AS
SELECT product_id, SUM(score) AS total_score
FROM clicks
GROUP BY product_id
ORDER BY total_score DESC
LIMIT 3;
Step-by-step explanation.
-
Event 1 (
A, +5). Insert intoclicks. Aggregate state:{A: 5}. Top-3:[A=5]. Δ emitted to leaderboard:+(A, 5). -
Event 2 (
B, +3). Aggregate state:{A: 5, B: 3}. Top-3:[A=5, B=3]. Δ emitted:+(B, 3). -
Event 3 (
A, +4). A's row updates: old(A, 5), new(A, 9). Δ emitted:-(A, 5)then+(A, 9). Top-3 re-sorts:[A=9, B=3]. -
Event 4 (
C, +12). Aggregate state adds{C: 12}. Top-3 now includes C and re-orders:[C=12, A=9, B=3]. Δ emitted:+(C, 12)and a reorder of the existing rows. -
Event 5 (
A, -5 correction). A's row updates: old(A, 9), new(A, 4). Δ:-(A, 9)then+(A, 4). Top-3 re-sorts:[C=12, A=4, B=3]. - Total work. Each event did O(1) hash-map update + O(K) leaderboard re-sort, where K=3. Never re-scanned the full input or full state.
Output (final).
| product_id | total_score |
|---|---|
| C | 12 |
| A | 4 |
| B | 3 |
Rule of thumb. Streaming top-K with IVM is the workhorse "leaderboard / trending / hot-items" pattern. Cap K deliberately (top-10 / top-100, not top-10000) — beyond a few hundred entries the re-sort cost dominates and you should switch to a windowed top-K.
Worked example — IVM on a streaming JOIN
Detailed explanation. A streaming join between two long-running tables (orders and customers) maintains a delta-driven hash index on both sides. When an order arrives, the engine probes the customers index in O(log N); when a customer is updated, the engine probes the orders index. No full re-scan of either side.
Question. Walk through the IVM behaviour when (a) a new order arrives, (b) a customer record is updated, (c) a customer is deleted. Use the small example tables below.
Input — customers.
| customer_id | name | tier |
|---|---|---|
| 1 | Alice | gold |
| 2 | Bob | silver |
Input — orders (streaming).
| order_id | customer_id | amount |
|---|---|---|
| 100 | 1 | 50 |
Code.
CREATE MATERIALIZED VIEW order_with_customer AS
SELECT o.order_id, o.amount, c.name, c.tier
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;
Step-by-step explanation.
-
(a) New order arrives —
(order_id=101, customer_id=2, amount=80). Engine probes the customers hash index forcustomer_id=2→ finds(Bob, silver). Emits+(101, 80, Bob, silver). -
(b) Customer 1 updated — name changes from
AlicetoAlicia. Engine emits two deltas:-(100, 50, Alice, gold)and+(100, 50, Alicia, gold). Downstream materialized state replaces the matching row. -
(c) Customer 2 deleted — engine emits
-(101, 80, Bob, silver). The materialized view loses that row; the underlyingordersrow stays in the input but no longer joins. - None of these steps re-scanned the full orders or customers tables. The total work per event is O(log N) for the hash-index probe plus O(1) for the materialized-view update.
Output (after all three steps).
| order_id | amount | name | tier |
|---|---|---|---|
| 100 | 50 | Alicia | gold |
Rule of thumb. Streaming joins on equality keys are IVM-friendly out of the box. The hidden cost is the join state size — both sides must keep their indices in memory or on tiered state for the full retention window. Bound the retention (e.g. WHERE order_date > now() - INTERVAL '7 days') to cap the state.
Worked example — when IVM falls back to recomputation
Detailed explanation. Some operators cannot be incrementally maintained — the engine has to recompute the view (or part of it) on each delta. Knowing which patterns trigger fallback is what separates the staff engineer from the senior.
Question. Identify which of the following queries IVM cannot maintain incrementally. For each, explain why and propose a rewrite.
Input. Three candidate queries.
Code.
-- 1) Arbitrary UDF
SELECT call_external_api(user_id) AS enriched FROM events;
-- 2) Recursive CTE
WITH RECURSIVE org AS (
SELECT employee_id, manager_id FROM employees WHERE manager_id IS NULL
UNION ALL
SELECT e.employee_id, e.manager_id
FROM employees e JOIN org ON e.manager_id = org.employee_id
)
SELECT * FROM org;
-- 3) Median (order-dependent aggregate)
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) AS median_amount
FROM orders;
Step-by-step explanation.
-
(1) UDF. The engine cannot prove
call_external_apiis deterministic or side-effect-free. On a delta, it cannot know whether to re-invoke. Most engines refuse to maintain this incrementally; some fall back to per-row recomputation each delta. Rewrite: pre-enrich the event in the source pipeline, then stream the enriched topic. -
(2) Recursive CTE. A delta on
employeesmight invalidate the entire recursion tree. Materialize supports differential recursion (a research-grade feature); Flink and RisingWave generally do not. Rewrite: materialize the closure as a separate non-recursive view, or compute the recursion in a periodic batch job. - (3) Median. Order-dependent aggregates require knowing the full sorted set. A single new event can shift the median, which means recomputing on every delta. Rewrite: use approximate quantiles (T-Digest, KLL) which are IVM-friendly (the sketch state updates in O(log K) per delta).
Output (rewrites).
| Original | Issue | Rewrite |
|---|---|---|
UDF call_external_api
|
nondeterministic / side-effects | pre-enrich at source |
| Recursive CTE | full-tree invalidation per delta | periodic batch closure |
| Exact median | order-dependent | T-Digest sketch / approximate quantile |
Rule of thumb. Before declaring "we'll do this in streaming SQL," audit the query for IVM-unfriendly operators. The fix is usually a rewrite (push computation upstream, swap exact for sketch) rather than abandoning streaming entirely.
SQL interview question on IVM cost model
A senior interviewer often opens with: "Walk me through the per-delta cost of maintaining a streaming view that does SELECT region, SUM(amount) FROM orders GROUP BY region on a 100M-row table receiving 50k inserts/sec. Be explicit about what is O(1), what is O(log N), and what the state size is."
Solution Using the IVM cost-model decomposition
CREATE MATERIALIZED VIEW revenue_by_region AS
SELECT region, SUM(amount) AS revenue
FROM orders
GROUP BY region;
Step-by-step trace.
| Step | Per-delta cost | Why |
|---|---|---|
Hash region to a group key |
O(1) | constant-time hash |
Look up current revenue[region]
|
O(1) avg (hash map) | open-addressing hash |
Add amount to revenue[region]
|
O(1) | arithmetic |
Emit delta (region, +amount)
|
O(1) | append to output stream |
| Downstream view applies delta | O(1) | per-key update |
| Total per delta | O(1) | independent of 100M row count |
State size = number of distinct regions (say ~250 country codes) × ~80 bytes per row ≈ 20 KB. Independent of the 100M input rows — the state is keyed by the GROUP BY column, not the input row count.
Output:
| Metric | Value |
|---|---|
| Per-delta cost | O(1) |
| State size | ~20 KB |
| Throughput ceiling | ~10M events/sec on a single core (memory-bound) |
| Memory growth per new event | 0 (existing region) or ~80 bytes (new region) |
Why this works — concept by concept:
-
GROUP BY state is keyed by the group column, not by input row count — that's why
SUM(amount) GROUP BY regionhas a state size proportional to the number of regions (small, bounded), not to the number of orders (huge, unbounded). -
Hash-map updates are amortised O(1) — the engine maintains a hash map keyed by
region; each delta is one lookup + one update. -
Delta out, not snapshot out — the operator emits a
(region, +amount)delta downstream. Materialized consumers apply the delta in O(1). No re-broadcast of the full view. -
Bounded state — the operator never grows beyond the distinct group key count. Compare with
GROUP BY user_idover 100M users → 100M-row state. Always check the cardinality of the GROUP BY column before declaring "this is fine." - Cost — per delta: O(1) compute, O(1) state update, O(1) output. End-to-end: 100k events/sec on a small cluster, with state size capped by group cardinality.
SQL
Topic — aggregation
Aggregation problems (SQL)
4. Watermarks, event time vs ingestion time, late data
Watermarks are the producer's promise; event-time is the only honest clock; late data is the consumer's tolerance
The mental model in one line: a watermark W at time t is a promise that no future event will have an event_time < W — the engine uses it to decide when a window can be closed and emitted. Once you can quote that contract verbatim, the entire family of "how does Flink handle late data?" questions resolves to "what watermark interval and lateness allowance did you configure?"
The three clocks.
- Event time. When the event actually happened in the source system. The only clock that gives analytically-correct answers. Example: "the user clicked at 09:42:15."
- Ingestion time. When the engine received the event from the broker. Slightly delayed from event time (network + broker buffer). Example: "Kafka consumer received the click at 09:42:18."
- Processing time. When the operator processed the event in the dataflow. Most delayed; subject to backpressure. Example: "Flink window operator handled the click at 09:42:21."
Watermarks in five bullets.
-
Watermark = max-event-time-seen minus lateness budget. A watermark of
09:50means "I will not see any event with timestamp < 09:50 ever again." - Watermarks advance the engine's notion of "now." When the watermark crosses 10:00, every 09:00-10:00 tumbling window can close and emit.
- Tighter watermark = lower latency, more late events dropped. Looser watermark = higher latency, fewer dropped events. There is no free lunch.
- Watermarks are per-source. Multi-source joins take the minimum watermark across all inputs (the slowest source paces the join).
- Watermarks do not enforce determinism in the face of failure. That is the checkpoint's job. Watermarks govern correctness under normal operation.
Late-data strategies in three bullets.
- Drop. Default for cheap real-time pipelines. The late event is logged and discarded. Simple; loses signal.
- Side-output. The recommended default. The late event is routed to a separate topic for batch repair downstream.
- Re-emit. The window is reopened, the corrected aggregate is emitted with a retraction. Downstream consumers must support retractions.
Allowed lateness. A window can stay open for L seconds past its watermark to admit straggler events. L=0 is "drop immediately"; L=10m is "tolerate 10-minute stragglers." Trade-off: higher L = more state retained per window, higher memory.
Common interview probes on watermarks.
- "Explain watermarks to a non-streaming engineer in three sentences." — "Streaming queries need a way to know when a time window can close. A watermark is a producer's promise: 'I will never send events older than this timestamp.' The engine uses watermarks to decide when to emit window results."
- "What's the difference between event-time and ingestion-time semantics?" — event-time is correct under reordering and replay; ingestion-time is cheaper but breaks under late arrivals.
- "Why does my Flink window never fire?" — likely the watermark is not advancing (no events, no idle-source detection, or a per-key watermark is stuck on a slow key).
- "How does Materialize handle late data?" — it auto-retracts the affected window result and re-emits, because materialized views naturally retract.
Worked example — a 1-hour tumbling window with 5-minute watermark and 10-minute lateness
Detailed explanation. A classic combo: tumbling 1-hour windows for hourly aggregates, with a 5-minute watermark lag (most events arrive within 5 minutes of their event-time) and a 10-minute allowed lateness (window stays open 10 minutes past watermark to admit stragglers).
Question. Given the events below, walk through which window each event belongs to and when each window emits. Identify which event is dropped as late and which triggers a re-emit.
Input — events arriving (processing-time order, event-times in column 3).
| arrived_at | event_time | description |
|---|---|---|
| 09:55 | 09:50 | normal event |
| 10:03 | 09:58 | normal event |
| 10:07 | 09:42 | straggler (within 10m lateness) |
| 10:16 | 09:20 | very late (past lateness) |
| 10:20 | 10:15 | normal next-window event |
Code.
SELECT
window_start, window_end, COUNT(*) AS event_count
FROM TABLE(TUMBLE(TABLE events, DESCRIPTOR(event_time), INTERVAL '1' HOUR))
GROUP BY window_start, window_end;
-- assume WATERMARK FOR event_time AS event_time - INTERVAL '5' MINUTE
-- and ALLOWED LATENESS '10' MINUTE
Step-by-step explanation.
- Event 1 (09:50, arrives 09:55). Belongs to the [09:00, 10:00) window. Within watermark. Counted normally.
- Event 2 (09:58, arrives 10:03). Belongs to [09:00, 10:00). Watermark just advanced past 09:58. Counted normally.
-
Event 3 (09:42, arrives 10:07). Belongs to [09:00, 10:00). Watermark =
max_ts - 5min. Now watermark is at ~09:55 (max ts seen was 10:00). Event is past watermark — but within the 10-minute lateness budget. Window re-opens, re-emits corrected count. - Event 4 (09:20, arrives 10:16). Belongs to [09:00, 10:00). Watermark is at ~09:55, plus lateness = 10:05 closing deadline. Event arrived at 10:16 — dropped as too late (or side-outputted, depending on config).
- Event 5 (10:15, arrives 10:20). Belongs to [10:00, 11:00). Watermark advances to ~10:10. Event is well within window. Counted normally.
Output (final emitted windows).
| window_start | window_end | event_count |
|---|---|---|
| 09:00 | 10:00 | 3 (events 1, 2, 3) |
| 10:00 | 11:00 | 1 (so far) |
Rule of thumb. Set the watermark lag to the 99th-percentile event-time-to-ingest delay. Set the allowed lateness to roughly 2× the lag. Anything past that is genuinely late and should go to a side-output for batch repair, not eaten by the streaming window.
Worked example — multi-source watermark alignment
Detailed explanation. When a query joins two streams (e.g. orders and payments), the engine takes the minimum watermark across both inputs. If one source is slow or idle, it can stall the entire join — the classic "my streaming join stopped emitting" bug.
Question. Given two source streams with different watermark behaviour, identify why the join output stalls and how to fix it.
Input — orders watermark trace.
| time | watermark |
|---|---|
| 09:55 | 09:50 |
| 10:00 | 09:55 |
| 10:05 | 10:00 |
Input — payments watermark trace.
| time | watermark |
|---|---|
| 09:55 | 09:50 |
| 10:00 | 09:50 (no new payments) |
| 10:05 | 09:50 (still nothing) |
Code.
SELECT o.order_id, p.payment_id, o.amount
FROM orders o
JOIN payments p ON o.order_id = p.order_id
WHERE o.event_time BETWEEN p.event_time - INTERVAL '5' MINUTE
AND p.event_time + INTERVAL '5' MINUTE;
Step-by-step explanation.
- The join takes the minimum of the two input watermarks. Joined watermark =
min(orders, payments) = 09:50for the entire trace. - No new payments → payments watermark does not advance → join watermark stuck at 09:50.
- Downstream window operators that depend on the join (e.g. an hourly aggregate over join output) cannot close their 09:00-10:00 window because the watermark never crosses 10:00.
-
The fix: idle-source detection. Configure the payments source with
'table.exec.source.idle-timeout' = '5 minute'(Flink) — after 5 minutes of no events, the source declares itself "idle" and the engine ignores its watermark for the min calculation. - Or: synthetic heartbeats. The producer emits a heartbeat event every minute that advances event-time without contributing to the join. Old-school but bulletproof.
Output (before vs after fix).
| Config | Join watermark at 10:05 |
|---|---|
| No idle detection | 09:50 (stuck) |
idle-timeout = 5 min |
advances with orders |
Rule of thumb. Multi-source joins are the #1 source of "my streaming pipeline stopped emitting" tickets. Always configure idle-source detection on every input source (or emit heartbeats), or you will eventually wake up to a 6-hour-stale dashboard with no errors in the logs.
Worked example — sizing the allowed-lateness budget
Detailed explanation. Allowed lateness sounds like a free safety net but costs window-state-retention × peak windows open. Setting it too high blows memory; setting it too low drops events.
Question. A 1-hour tumbling window with a per-window state of 200 MB receives events whose 99th-percentile lateness is 8 minutes. What allowed-lateness should you configure, and what is the resulting memory cost?
Input. 99th-percentile lateness = 8m; per-window state = 200 MB; query has 24 windows/day.
Code (sizing).
# Allowed lateness = 2x the p99 lateness
allowed_lateness_min = 2 * 8 # 16 minutes
# At any moment, the windows open are:
# - the current window
# - the previous window if still inside allowed lateness
windows_open = 1 + (allowed_lateness_min / 60) # ~1.27 → round up to 2
mem_per_window_mb = 200
total_mem_mb = windows_open * mem_per_window_mb # 400 MB
Step-by-step explanation.
- The p99 lateness is 8 minutes — that's the real answer to "how long do most events take?" Setting lateness at exactly 8 minutes drops 1% of events.
- Rule of thumb: 2× p99. 16-minute lateness covers ~99.9% of events. Most pipelines accept the residual 0.1% to the side-output channel.
- With 1-hour windows and 16-minute lateness, at most 2 windows are open simultaneously (the current one, and the previous one in lateness window).
- Memory cost =
windows_open × state_per_window = 2 × 200 MB = 400 MB. That's the per-key memory footprint; multiply by the number of GROUP BY keys. - If lateness were 60 minutes (one full window), 3 windows would be open at once, tripling the memory. Pick lateness deliberately; don't default to "an hour."
Output (sizing decision).
| Lateness | Windows open | Memory per key | Dropped event rate |
|---|---|---|---|
| 0 min | 1 | 200 MB | ~50% |
| 8 min (p99) | 1.13 → 2 | 400 MB | ~1% |
| 16 min (2× p99) | 1.27 → 2 | 400 MB | ~0.1% |
| 60 min | 2 | 400 MB | ~0.01% |
| 120 min | 3 | 600 MB | ~0.001% |
Rule of thumb. Pick allowed-lateness based on measured p99 event delay, not on a round number of minutes. Wider lateness has diminishing returns on dropped events but linear cost on memory. The 2×p99 rule is the operational sweet spot.
SQL interview question on watermarks and late data
A senior interviewer might frame this as: "You're running a 1-hour tumbling window aggregate on a stream where ~1% of events arrive more than 10 minutes late. Design the watermark and lateness configuration, and describe what happens to a 15-minute-late event."
Solution Using a p99-aligned watermark with side-output routing
CREATE TABLE events (
user_id BIGINT,
event_type STRING,
event_time TIMESTAMP(3),
WATERMARK FOR event_time AS event_time - INTERVAL '5' MINUTE
) WITH ('connector' = 'kafka', /* ... */);
CREATE VIEW hourly_event_counts AS
SELECT window_start, window_end, COUNT(*) AS cnt
FROM TABLE(TUMBLE(TABLE events, DESCRIPTOR(event_time), INTERVAL '1' HOUR))
GROUP BY window_start, window_end;
-- ALLOWED LATENESS '10' MINUTE in the application code
-- Side-output: late events go to topic 'events-late' for batch repair
Step-by-step trace.
| Step | Behaviour |
|---|---|
| Watermark | 5-minute lag from max event-time |
| Allowed lateness | 10 minutes past watermark |
| 1% of events arriving >10m late | routed to events-late side-output |
| Batch repair job | reads events-late daily, merges into the hourly table |
Output:
| Window | Streaming count | Side-output count | Reconciled count |
|---|---|---|---|
| 09:00-10:00 | 9,910 | 90 | 10,000 |
| 10:00-11:00 | 12,400 | 100 | 12,500 |
Why this works — concept by concept:
- Watermark = max event time − lag — the lag is the producer's promise. 5 minutes is the typical "covers 95% of events" choice; anything past that is side-output material.
- Allowed lateness is the stretch budget — 10 minutes past watermark catches ~99% of stragglers without breaking memory. The remaining 1% goes to a side-output topic.
- Side-output instead of drop — the late events are not lost; they are re-routed to a separate topic for batch repair. The hourly table is corrected by a daily reconciliation job.
- Window emits twice — once at watermark + lateness (the "streaming" emit), once during batch repair (the "corrected" emit). Downstream consumers either accept the first or wait for the second.
- Cost — watermark + lateness costs a small multiple of single-window state; side-output is one extra Kafka topic + a daily batch repair job. The total operational overhead is small compared to silently dropping 1% of events.
SQL
Topic — sliding window
Sliding window problems (SQL)
5. Reference architectures — Kafka → Flink, Kafka → Materialize, RisingWave end-to-end
Three canonical streaming-SQL stacks — same workload, very different operational profiles
The mental model in one line: Flink SQL is the highest-scale stack with the heaviest JVM ops burden, Materialize is the fastest-iteration Postgres-native stack, and RisingWave is the cheapest single-binary OSS stack — pick by team shape, scale, and procurement constraints. Once you can sketch all three from memory, the architecture-interview question "design a real-time fraud-detection pipeline" reduces to picking which of the three you'd defend.
Architecture 1 — Kafka → Flink SQL → sink (the classic).
- Source. Kafka topics; ingest via Debezium for CDC sources (Postgres, MySQL).
- Engine. Flink SQL with RocksDB state on TaskManager local disk; checkpoints to S3 every 60 seconds.
- Sink. Iceberg, Postgres, Kafka, or any JDBC. Two-phase commit between Flink checkpoint and sink for exactly-once.
- Strength. Petabyte-scale state, mature Iceberg sink, exactly-once end-to-end with 2PC.
- Weakness. JVM tuning burden (GC, heap, RocksDB tuning); Flink SQL job lifecycle (deploy, savepoint, restart); operational headcount.
Architecture 2 — Kafka → Materialize → Postgres-protocol consumers.
- Source. Kafka topics, Postgres logical replication, or both. Materialize ingests sources directly via SOURCE definitions.
- Engine. Materialize cluster (managed cloud or self-hosted); in-memory differential dataflow with S3 spill for cold data.
- Sink. Postgres-wire SELECTs from any BI tool, dashboard, or service. Optional Kafka / Webhook sinks for downstream fan-out.
- Strength. Fastest iteration; Postgres wire is a force multiplier; no JVM ops; sub-second freshness.
- Weakness. Single-region default; in-memory state caps the scale (~hundreds of GB practical limit); source-available BSL license is not pure OSS.
Architecture 3 — RisingWave end-to-end.
- Source. Native CDC (Postgres, MySQL), Kafka, Pulsar, Kinesis, all defined in one SQL block.
- Engine. RisingWave cluster with stateless compute nodes; state lives on S3 in tiered storage.
- Sink. Iceberg, Postgres, Kafka, plus the Postgres-wire query endpoint for direct reads.
- Strength. Apache-2.0 license; single binary deploys on EC2 with no Kafka requirement; cheap S3-tiered state; Postgres wire.
- Weakness. Youngest of the three (2024 maturity); ecosystem still catching up to Flink for some niche sinks.
The exactly-once contract across all three.
- Source. Must be replayable from a deterministic offset. Kafka is the canonical choice; Postgres logical replication is the runner-up.
- Engine. Must checkpoint state atomically. All three engines do.
- Sink. Must commit transactionally or be idempotent. Iceberg, transactional Kafka, idempotent JDBC inserts.
State backend choices.
- RocksDB on local disk (Flink). Fast random access, periodic snapshot to S3. Heavy disk requirements.
- In-memory with S3 spill (Materialize). Hot data in RAM, cold data on S3. Sub-millisecond hot reads.
- S3-tiered, all the time (RisingWave). Stateless compute, all state in S3. Slowest hot reads but cheapest.
Cost driver mental model. Streaming SQL cost = (state size × retention) + (compute hours × throughput) + ops headcount. Cap state with retention windows; choose the engine that minimises the dominant term for your workload.
Common interview probes on architecture.
- "Where does exactly-once break down?" — usually the sink. Audit the sink first.
- "How do you handle a Flink TaskManager OOM mid-window?" — checkpoint, savepoint, restart from the last checkpoint; the window state is restored.
- "Why would you choose Materialize over Flink for a 5k-event/sec workload?" — Postgres wire + no JVM ops + fastest iteration. The throughput is well within Materialize's comfort zone.
- "What's the operational cost of RisingWave's S3-tiered state?" — slightly higher hot-read latency in exchange for ~10× cheaper state storage. Acceptable for most workloads.
Worked example — fraud-detection pipeline with a 30-day session window on Flink SQL
Detailed explanation. A fraud-detection pipeline maintains a 30-day rolling session per user, computes a velocity score over the session, and emits a binary fraud flag for each new transaction. The state size (30 days × millions of users × per-user session record) drives the architecture decision.
Question. Sketch the Flink SQL pipeline. Identify the state size, the checkpoint interval, and how exactly-once is achieved end-to-end.
Input. Kafka topic transactions at 50k events/sec; ~10M distinct users; 30-day session retention.
Code.
-- Source declaration
CREATE TABLE transactions (
user_id BIGINT,
tx_id BIGINT,
amount DECIMAL,
event_time TIMESTAMP(3),
WATERMARK FOR event_time AS event_time - INTERVAL '30' SECOND
) WITH ('connector' = 'kafka', /* ... */);
-- 30-day session window per user
CREATE VIEW user_session_velocity AS
SELECT
user_id,
SESSION_START(event_time, INTERVAL '30' DAY) AS session_start,
COUNT(*) AS tx_count,
SUM(amount) AS total_amount,
AVG(amount) AS avg_amount
FROM transactions
GROUP BY user_id, SESSION(event_time, INTERVAL '30' DAY);
-- Fraud flag — joined on the fly with new transactions
CREATE TABLE fraud_flags (
tx_id BIGINT, user_id BIGINT, flag BOOLEAN, reason STRING
) WITH ('connector' = 'iceberg', /* ... */);
INSERT INTO fraud_flags
SELECT
t.tx_id, t.user_id,
CASE WHEN s.tx_count > 1000 OR t.amount > 10 * s.avg_amount
THEN TRUE ELSE FALSE END AS flag,
CASE WHEN s.tx_count > 1000 THEN 'velocity'
WHEN t.amount > 10 * s.avg_amount THEN 'amount-anomaly'
ELSE 'ok' END AS reason
FROM transactions t
JOIN user_session_velocity s
ON t.user_id = s.user_id;
Step-by-step explanation.
-
State size. 10M users × ~500 bytes per session row = ~5 GB. Plus the streaming join state on
transactions × user_session_velocitykeyed by user_id (~10 GB). Total ~15 GB. - Checkpoint interval. Set to 60 seconds. Flink writes the RocksDB state delta to S3 every minute; on recovery, it restores from the last checkpoint and replays Kafka from the saved offset.
- Exactly-once. Source (Kafka, replayable) + engine (60-second 2PC checkpoint) + sink (Iceberg, transactional commit per checkpoint). All three layers cooperate.
- Why Flink and not Materialize. 15 GB state and 50k events/sec is at the upper edge of Materialize's comfort zone; Flink handles this comfortably with a 4-node cluster.
- Why Flink and not RisingWave. RisingWave could handle it too; choice often comes down to existing JVM ops vs OSS-only preference. Both are defensible.
Output (per second of throughput).
| Metric | Value |
|---|---|
| Throughput | 50k events/sec |
| State size | ~15 GB on RocksDB + S3 |
| Checkpoint interval | 60 seconds |
| Per-event end-to-end latency | ~100 ms |
| Cluster cost | ~$1,200/month (managed Flink) |
Rule of thumb. For petabyte-scale, multi-TB-state streaming SQL, Flink is the safe default. Materialize and RisingWave are catching up but Flink has the operational maturity story and the exactly-once-to-Iceberg sink that fraud-detection auditors actually trust.
Worked example — same fraud pipeline on Materialize
Detailed explanation. Materialize handles the same workload at lower throughput with much faster iteration. The architectural difference: no JVM ops, Postgres wire endpoint, in-memory state with S3 spill for cold data.
Question. Convert the same fraud-detection pipeline to Materialize. Identify what changes in the architecture and where the limits lie.
Input. Same Kafka topic at a lower throughput (5k events/sec — 10× smaller); ~1M distinct users; 30-day session retention.
Code.
-- Source — Materialize's Kafka source
CREATE SOURCE transactions FROM KAFKA BROKER 'kafka:9092' TOPIC 'transactions'
FORMAT JSON
INCLUDE TIMESTAMP AS event_time;
-- Materialized view of 30-day per-user session
CREATE MATERIALIZED VIEW user_session AS
SELECT
user_id,
COUNT(*) AS tx_count,
SUM(amount) AS total_amount,
AVG(amount) AS avg_amount
FROM transactions
WHERE event_time > mz_now() - INTERVAL '30 days'
GROUP BY user_id;
-- Fraud view — joined on the fly
CREATE MATERIALIZED VIEW fraud_flags AS
SELECT
t.tx_id, t.user_id,
(s.tx_count > 1000 OR t.amount > 10 * s.avg_amount) AS flag
FROM transactions t
JOIN user_session s ON t.user_id = s.user_id
WHERE t.event_time > mz_now() - INTERVAL '1 hour';
-- Consumer: Grafana, Slack bot, internal API
-- all read via SELECT * FROM fraud_flags
Step-by-step explanation.
-
No external sink required for dashboard consumers — they SELECT directly from
fraud_flags. The streaming view is the read endpoint. -
State size. 1M users × ~500 bytes = ~500 MB. Comfortably fits in a
smallcluster's RAM. - Throughput. 5k events/sec is well within Materialize's comfort zone; even a small cluster can sustain it.
- Iteration speed. The view definition is a single SQL statement that can be re-created in seconds. No Flink job lifecycle, no savepoint, no JVM restart.
- The throughput ceiling. Around 100k events/sec on a large cluster; past that, you need Flink or RisingWave.
Output (operational profile).
| Metric | Value |
|---|---|
| Throughput | 5k events/sec |
| State size | ~500 MB in-memory |
| End-to-end latency | ~10 ms |
| Cluster cost | ~$400/month (managed Materialize) |
| Iteration speed | seconds to re-create a view |
Rule of thumb. When the workload fits Materialize's throughput / state ceiling, choose Materialize for the iteration speed alone. The team velocity gain (no JVM ops, Postgres wire, view-as-endpoint) compounds over months.
Worked example — same fraud pipeline on RisingWave
Detailed explanation. RisingWave splits the difference: Apache-2.0 license, S3-tiered state, single binary, Postgres wire. The same query shape as Materialize, but the state lives in S3, so the throughput ceiling is higher and the storage cost is near-zero.
Question. Convert the pipeline to RisingWave. Identify the architectural difference vs Materialize and where RisingWave is the better choice.
Input. Same Kafka topic at 20k events/sec; ~5M distinct users; 30-day session retention.
Code.
-- Source with watermark
CREATE SOURCE transactions (
user_id BIGINT, tx_id BIGINT, amount DECIMAL, event_time TIMESTAMP,
WATERMARK FOR event_time AS event_time - INTERVAL '30 SECONDS'
) WITH (connector = 'kafka', topic = 'transactions',
properties.bootstrap.server = 'kafka:9092')
ROW FORMAT JSON;
-- Materialized view: 30-day session per user
CREATE MATERIALIZED VIEW user_session AS
SELECT
user_id,
COUNT(*) AS tx_count,
SUM(amount) AS total_amount,
AVG(amount) AS avg_amount
FROM transactions
WHERE event_time > NOW() - INTERVAL '30 DAYS'
GROUP BY user_id;
-- Iceberg sink for the fraud flags
CREATE SINK fraud_sink FROM (
SELECT t.tx_id, t.user_id,
(s.tx_count > 1000 OR t.amount > 10 * s.avg_amount) AS flag
FROM transactions t
JOIN user_session s ON t.user_id = s.user_id
WHERE t.event_time > NOW() - INTERVAL '1 HOUR'
) WITH (connector = 'iceberg', /* ... */);
Step-by-step explanation.
- State backend. All state on S3, accessed via tiered storage. 5M users × 500 bytes = ~2.5 GB on S3 at ~$0.06/month storage. Compare with Materialize's RAM-bound cost.
- Compute. Stateless compute nodes; horizontal scale by adding more compute pods. Linear scaling up to ~hundreds of thousands of events/sec.
- Exactly-once. Source replayable + S3 checkpoint + Iceberg transactional sink = exactly-once end-to-end. Identical contract to Flink.
- The hot-read latency cost. S3-tiered state has slightly higher hot-read latency than in-memory (Materialize) or local-RocksDB (Flink). For most use cases, the cost is invisible.
- The license advantage. Apache-2.0 is the procurement-friendliest license; the binary can be vendored, forked, or self-hosted indefinitely.
Output (operational profile).
| Metric | Value |
|---|---|
| Throughput | 20k events/sec |
| State size | ~2.5 GB on S3 (~$0.06/month) |
| End-to-end latency | ~50 ms |
| Cluster cost | ~$600/month (self-hosted on EC2) |
| Iteration speed | seconds to re-create a view |
Rule of thumb. RisingWave is the 2026 OSS default for new deployments — cheap state, Apache-2.0, single binary, Postgres wire. Pick Flink only if you need exactly-once-to-Iceberg at petabyte scale, and Materialize only if you need absolute lowest iteration cycle on a single team.
SQL interview question on streaming architecture
A senior interviewer often frames this as: "Design a real-time anomaly-detection pipeline for ride-sharing surge pricing. Input is 80k events/sec; freshness target is 1 second; team has 8 engineers, no JVM background; consumer is a Grafana dashboard plus a downstream Kafka topic. Sketch the architecture and justify the engine."
Solution Using RisingWave end-to-end with a Postgres-wire read and a Kafka sink
┌────────┐ ┌────────────────────────────────┐ ┌──────────┐
│ Kafka │ ───▶│ RisingWave cluster (Rust) │ ───▶│ Iceberg │
│ rides │ │ • source: rides + driver_state│ │ flags │
└────────┘ │ • MV: rolling_5min_velocity │ └──────────┘
│ • MV: surge_flag │ ┌──────────┐
│ • S3-tiered state backend │ ───▶│ Grafana │
│ • sinks: Iceberg + Kafka │ │ (psql) │
└────────────────────────────────┘ └──────────┘
│
└───▶ Slack alert via Kafka consumer
Step-by-step trace.
| Constraint | Decision |
|---|---|
| 80k events/sec | RisingWave scales horizontally; well within capacity |
| 1 s freshness | RisingWave MV freshness = single-digit ms |
| No JVM background | RisingWave is Rust; no JVM ops |
| Grafana consumer | RisingWave speaks Postgres wire → direct connection |
| Downstream Kafka topic | RisingWave's Kafka sink with transactional producer |
| Single binary | RisingWave deploys as one binary on EC2 |
Output:
| Layer | Choice | Why |
|---|---|---|
| Source | Kafka rides topic |
replayable, standard |
| Engine | RisingWave cluster (3-node) | scale + OSS + Postgres wire |
| State | S3-tiered | cheap, decoupled compute |
| Sink 1 | Iceberg (historical flags) | mature; exactly-once via 2PC |
| Sink 2 | Kafka (real-time alerts) | transactional producer |
| Read | Grafana via Postgres wire | direct, no extra system |
Why this works — concept by concept:
- RisingWave for non-JVM teams — the 8-engineer team has no JVM ops experience. Flink would consume ~25% of their bandwidth on tuning; RisingWave is a single Rust binary.
- Postgres wire to Grafana — the consumer plugs straight into the engine. No intermediate database, no extra sink to operate.
-
Kafka transactional sink for alerts — the downstream Slack bot consumes from Kafka with
isolation.level=read_committed, getting exactly-once semantics for the alert side. - S3-tiered state — 80k events/sec × 1M unique riders × 30-day retention is well past in-memory; S3 storage at ~$0.5/month is irresistible.
- Cost — 3-node RisingWave cluster on EC2 (~$900/month) + S3 state (~$0.5/month) + Kafka cluster (~$300/month) + Iceberg on S3 (~$5/month). Total ~$1,200/month for the full pipeline. Flink would be roughly the same compute cost plus the headcount overhead of JVM ops.
SQL
Topic — joins
Streaming join problems (SQL)
Cheat sheet — streaming SQL recipes
-
Always declare event-time + watermark on every source.
WATERMARK FOR event_time AS event_time - INTERVAL '30 SECONDS'. Without it, you implicitly use processing-time and lose correctness under reordering. -
Use exactly-once sinks. Kafka transactional producer + read-committed consumers; idempotent JDBC inserts (
ON CONFLICT DO NOTHING); Iceberg with two-phase commit. - Prefer IVM-friendly operators. SELECT, WHERE, INNER / LEFT JOIN on equality keys, GROUP BY on bounded-cardinality columns. Avoid UDFs in hot paths.
-
Bound state with retention. Every GROUP BY without
WHERE event_time > now() - INTERVAL '...'is a candidate for unbounded state growth. Cap deliberately. - For dashboards, expose views via Postgres wire. Materialize (port 6875) and RisingWave (port 4566) both speak it natively; any BI tool plugs in unchanged.
-
Snapshot state regularly. Flink:
execution.checkpointing.interval = 60s. Materialize: managed by the engine. RisingWave: configurable per cluster. - Treat the streaming engine as a database, not a job runner. Materialized views are tables; clients SELECT them. Don't push the engine into a "Kafka producer that runs SQL" mental model.
- Configure idle-source detection on every input source. Multi-source joins stall when any input goes silent without idle detection.
- Set allowed lateness at ~2× p99 event delay. Route the residual to a side-output topic; reconcile in a daily batch job.
- Audit the cardinality of every GROUP BY column. Group-by-user-id can produce 100M-row state; group-by-region produces a 250-row state. Check before deploying.
-
Use
IS DISTINCT FROMfor NULL-safe join keys. NULL = NULL is NULL in three-valued logic; the NULL-safe operator returns TRUE / FALSE. -
For Flink → Iceberg, use the 2PC commit. Configure
EXACTLY_ONCEcheckpoint mode and Iceberg'scommit-protocol = 2pc. Both sides must cooperate. -
For Materialize / RisingWave, use a small cluster first. Iterate on the view definitions in a
smallcluster, then scale up only when you've validated correctness. - Treat ksqlDB as legacy. Confluent's 2025 deprecation pushed it to maintenance mode. New projects: Materialize, RisingWave, or Flink.
Frequently asked questions
Materialize vs RisingWave — which should I pick in 2026?
Pick Materialize when you want the fastest iteration cycle on a Postgres-wire dashboard, your team is small (1-5 engineers), and your throughput is under 50k events/sec — its in-memory differential dataflow is the gold standard for sub-second freshness. Pick RisingWave when you need an OSS Apache-2.0 license, S3-tiered cheap state, single-binary deployment, or your throughput pushes past 50k events/sec. RisingWave is the safer procurement choice; Materialize is the faster iteration choice.
Is Flink SQL still relevant if Materialize and RisingWave exist?
Yes — Flink SQL is still the right call for petabyte-scale stateful pipelines, end-to-end exactly-once to Iceberg, and JVM-fluent platform teams. The exactly-once contract via two-phase commit between Flink checkpoints and the Iceberg sink is the most mature in the ecosystem, which matters for audit-sensitive workloads (fraud, finance, ad-fraud). The JVM ops burden is real and accelerates the headcount cost, but for the right workload Flink is still the lowest-risk choice.
Can I do streaming SQL without Kafka?
Yes. Materialize ingests directly from Postgres logical replication, S3, or webhook sources — no Kafka required. RisingWave has native CDC connectors for Postgres and MySQL. Flink has connectors for files, Pulsar, Kinesis, and CDC sources. Kafka is the most common source because it is replayable and durable, but every modern streaming engine can ingest from non-Kafka sources for smaller deployments or specific use cases.
What's the difference between watermarks and lateness?
A watermark is the producer's promise: "I will not emit any event with timestamp < W." It paces the engine's notion of "now." Lateness is the consumer's tolerance: "I will keep the window open for L seconds past the watermark to admit straggler events." Watermark is set on the source; lateness is set on the window operator. Together they let you balance freshness (tighter watermark) against completeness (wider lateness) — there is no free lunch, and the optimal pair is workload-specific.
How do streaming engines deliver exactly-once?
Exactly-once is a three-way contract among source, engine, and sink. The source must be replayable from a deterministic offset (Kafka, CDC). The engine must checkpoint state atomically (Flink uses 2PC across operators; Materialize uses transactional dataflow; RisingWave checkpoints to S3 with a coordinator). The sink must commit transactionally or be idempotent (Iceberg 2PC, transactional Kafka producer, idempotent JDBC). If any of the three links is missing, the pipeline collapses to at-least-once. Audit each layer separately before claiming exactly-once to a stakeholder.
Is streaming SQL cheaper than batch for my workload?
It depends on state size, retention window, and throughput. For sub-second freshness on small-to-medium state (under ~100 GB), streaming SQL is dramatically cheaper than batch — the IVM cost model is O(delta) per event vs O(N) per batch. For huge state (TB+) and relaxed freshness (hourly is fine), batch on Spark or dbt is usually cheaper because the per-batch cost amortises across many events. The rule of thumb: freshness target under 1 minute → streaming SQL; over 1 hour → batch; in between → measure both.
Practice on PipeCode
- Drill the streaming practice library → for the watermark / event-time / late-data family of probes.
- Rehearse on real-time analytics problems → when the interviewer wants sub-second SLAs and Postgres-wire reads.
- Sharpen sliding window drills → for the tumbling vs hopping vs session window patterns.
- Layer the aggregation library → for the GROUP BY + IVM cost model.
- Stack the joins practice library → for streaming join state size and multi-source watermark alignment.
- For the broader surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Sharpen the dialect axis with the SQL for data engineering interviews course →.
- For long-form schema craft, work through data modelling for DE interviews →.
Pipecode.ai is Leetcode for Data Engineering — every streaming-SQL recipe above ships with hands-on practice rooms where you write the watermark declaration, the IVM-friendly aggregation, and the exactly-once sink contract against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your fix to a watermark stall actually behaves the same on Materialize as on Flink.