SQL Subqueries: Correlated, Scalar, Derived Tables & EXISTS
The sql subquery is the single SQL primitive that splits junior candidates from senior ones in a deep-dive interview. Anyone can write a one-table SELECT; only candidates who genuinely understand subqueries in sql can decide between a correlated subquery, a scalar subquery embedded in a SELECT clause, a derived table in the FROM clause, or an EXISTS semi-join — and then explain why their choice is the right one for the row count and the optimizer plan they expect.
This guide walks through every shape of nested query sql that shows up in FAANG and analytics-heavy interview loops. Each section ships a teaching block plus a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works. By the end you'll be able to defend a derived-table aggregation, debug a NOT IN that silently returns zero rows, rewrite a correlated subquery as a JOIN for performance, and pick the right operator from the exists sql / IN / NOT EXISTS / NOT IN four-quadrant matrix without hesitation.
When you want hands-on reps immediately after reading, drill the subqueries practice library →, then sharpen SQL subquery problems →, and stack the prerequisites with joins practice →.
On this page
- Why subqueries are the senior-round signal
- Scalar subqueries — one value, NULL gotchas
- Derived tables — FROM clause subqueries
- Correlated subqueries — referencing the outer query
- EXISTS / NOT EXISTS / IN / NOT IN — four ways to ask "is there?"
- Cheat sheet — subquery patterns
- Frequently asked questions
- Practice on PipeCode
1. Why subqueries are the senior-round signal
A sql subquery is just a SELECT inside another SELECT — but the shape decides everything
The one-sentence invariant: a subquery is a SELECT statement nested inside another statement; its position in the parent — SELECT list, FROM clause, WHERE clause, or HAVING clause — decides how the optimizer evaluates it and what shape it must return. Once you internalise that, the rest of the subqueries in sql family becomes a routing decision: scalar (one value), derived (a table), correlated (per-row), or existential (a boolean test).
The four shapes every interview probes.
-
Scalar subquery. Returns exactly one row × one column. Used wherever a single value is expected — a
SELECTcolumn, aWHEREcomparison operand, or aHAVINGpredicate. Returning zero rows quietly producesNULL; returning more than one row is a runtime error. -
Derived table (FROM-clause subquery). Returns a multi-row, multi-column result that the outer query treats as a virtual table. Often the cleanest way to compute aggregates before joining. Semantically equivalent to a CTE (
WITH), just inlined. -
Correlated subquery. References columns from the outer query inside the inner query. Logically runs once per outer row — though most modern optimizers rewrite it as a
JOINor hash semi-join under the hood. -
Existential subquery. Tests whether a related row exists —
EXISTS,NOT EXISTS,IN,NOT IN. The boolean check is the only output; the inner query'sSELECTlist is ignored forEXISTS.
Why interviewers love the topic.
- It's a plan-reading test in disguise. Knowing when an
EXISTSshort-circuits vs anINmaterialises tells the interviewer you've read execution plans, not just SQL textbooks. - It's a NULL test.
NOT INwith aNULLin the list silently returns zero rows. Candidates who don't anticipate it ship production-breaking SQL. - It's a trade-off test. The same query can usually be written three ways — subquery,
JOIN, CTE — and only candidates who can compare the three confidently advance.
Reading a subquery from the inside out.
- Start with the inner query — what does it return? One value? A list? A table? A boolean?
- Look at where the inner sits inside the outer —
SELECT,FROM,WHERE, orHAVING. - Ask yourself: does the inner reference any outer columns? If yes, it's correlated and you should expect per-row semantics in the logical model.
- Finally, ask: can this be rewritten as a
JOINor a CTE? Usually the answer is yes, and a senior interviewer will probe both rewrites.
Subquery → JOIN → CTE: same logic, three syntaxes.
- Subquery shape. Compact, inline, but easy to mis-read when nested deeply.
- JOIN shape. Forces you to think in sets; usually the fastest plan for large inputs.
- CTE shape. Most readable for multi-step pipelines; modern planners typically inline it.
The point: the logic of "customers with above-average orders" is one logical question. Whether you express it as a correlated subquery, a derived table joined to the customers table, or a CTE followed by a join is a style decision — and an optimizer hint, on some platforms.
Worked example — translate "customers who placed any order in 2026" three ways
Detailed explanation. A junior candidate writes this as a JOIN and de-duplicates with DISTINCT. A mid-level candidate writes it as IN. A senior candidate writes EXISTS and explains why it's the most defensible choice — short-circuits on the first match, NULL-safe, and rewrites to a semi-join under the hood.
Question. Given customers(customer_id, name) and orders(order_id, customer_id, order_date), list every customer who placed at least one order in 2026. Show three equivalent queries — JOIN, IN, EXISTS — and explain when each shines.
Input.
| customer_id | name |
|---|---|
| 1 | Alice |
| 2 | Bob |
| 3 | Charlie |
| 4 | Dana |
| order_id | customer_id | order_date |
|---|---|---|
| 1001 | 1 | 2026-01-15 |
| 1002 | 1 | 2026-02-10 |
| 1003 | 2 | 2025-12-30 |
| 1004 | 3 | 2026-04-01 |
Code (three equivalent queries).
-- (a) JOIN + DISTINCT
SELECT DISTINCT c.customer_id, c.name
FROM customers c
JOIN orders o ON o.customer_id = c.customer_id
WHERE o.order_date >= '2026-01-01'
AND o.order_date < '2027-01-01';
-- (b) IN with subquery
SELECT customer_id, name
FROM customers
WHERE customer_id IN (
SELECT customer_id
FROM orders
WHERE order_date >= '2026-01-01'
AND order_date < '2027-01-01'
);
-- (c) EXISTS (the senior signal)
SELECT customer_id, name
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
AND o.order_date >= '2026-01-01'
AND o.order_date < '2027-01-01'
);
Step-by-step explanation.
-
JOIN + DISTINCT — explodes the customer row once per matching order, then collapses with
DISTINCT. Wasted work when a customer has many orders. Also accidentally double-counts in aggregations if you forget theDISTINCT. -
IN with subquery — the planner materialises the inner result as a list of
customer_ids, then probes each outer row against the list. Fine for small lists; awkward when the inner side has millions of rows. -
EXISTS — the planner short-circuits on the first matching order per customer. Modern planners typically rewrite
INandEXISTSas a SEMI JOIN, butEXISTSis the more declarative version of "is there a match?" — and it's safer withNULLs (more on that in §5).
Output.
| customer_id | name |
|---|---|
| 1 | Alice |
| 3 | Charlie |
Rule of thumb. When the question is "does a matching row exist?", reach for EXISTS. When the question is "what are the values?", reach for a JOIN and shape the projection.
Worked example — reading a nested query from the inside out
Detailed explanation. A senior interviewer might hand you a five-line nested query and ask "what does this return, and in what order does the engine evaluate the pieces?" The right move is to read inside-out, narrating each level. This shape recurs in every senior-round whiteboard exercise on subqueries.
Question. Walk through how the engine evaluates the following query. Identify the inner-most subquery, the middle subquery, and the outer query, and state the cardinality each level returns.
Input — payments (compact sample).
| payment_id | customer_id | amount | status | paid_at |
|---|---|---|---|---|
| 1 | 42 | 120 | paid | 2026-04-12 |
| 2 | 42 | 80 | paid | 2026-04-18 |
| 3 | 43 | 200 | failed | 2026-04-20 |
| 4 | 44 | 60 | paid | 2026-05-01 |
Code.
SELECT customer_id, total_paid
FROM (
SELECT customer_id,
SUM(amount) AS total_paid
FROM payments
WHERE paid_at >= (
SELECT MIN(paid_at)
FROM payments
WHERE status = 'paid'
)
AND status = 'paid'
GROUP BY customer_id
) AS by_customer
WHERE total_paid > 100;
Step-by-step explanation.
-
Inner-most subquery.
SELECT MIN(paid_at) FROM payments WHERE status = 'paid'— uncorrelated scalar; returns one row × one column ='2026-04-12'. The planner evaluates it once. -
Middle subquery (derived table
by_customer). Scanspayments, filters to rows withpaid_at >= '2026-04-12'ANDstatus = 'paid', groups bycustomer_id, projects(customer_id, total_paid). Cardinality = one row per distinct paying customer. -
Outer query. Reads
by_customerand keeps rows wheretotal_paid > 100. Final cardinality = customers whose total paid amount exceeds 100. - Why inside-out? SQL is declarative, but the execution still flows from innermost dependency to outermost projection. Reading inside-out gives you the engine's plan-shape mental model.
Output.
| customer_id | total_paid |
|---|---|
| 42 | 200 |
Rule of thumb. When you stare at a nested query, find the innermost subquery first, write down what it returns (value, list, table), then move outward. The shape at each level constrains the syntax of the next.
SQL interview question on translating subquery to JOIN
A common senior probe: "You've written this correlated subquery and the planner shows nested-loop semantics; the table has 50 million rows. How would you rewrite it for the hash-join path?" — testing whether the candidate spots the rewrite from correlated subquery into a derived-table JOIN on a grouped aggregate.
Solution Using a correlated subquery rewritten as a JOIN
-- Original (correlated subquery — logically per-row)
SELECT c.customer_id, c.name
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
AND o.order_date >= '2026-01-01'
);
-- Rewritten (semi-join via INNER JOIN on a grouped derived table)
SELECT c.customer_id, c.name
FROM customers c
JOIN (SELECT DISTINCT customer_id
FROM orders
WHERE order_date >= '2026-01-01') o
ON o.customer_id = c.customer_id;
Step-by-step trace.
| Step | Stage | What happens |
|---|---|---|
| 1 | Inner derived table o
|
scans orders, filters to 2026+, keeps distinct customer_id
|
| 2 | Hash build | planner builds a hash table on o.customer_id (size = distinct active customers) |
| 3 | Hash probe | scans customers, probes each c.customer_id against the hash |
| 4 | Project | emits c.customer_id, c.name for every probe hit |
Output:
| customer_id | name |
|---|---|
| 1 | Alice |
| 3 | Charlie |
Why this works — concept by concept:
- Semi-join semantics — both shapes ask the same question ("does a matching row exist?"). The rewrite makes the hash-join plan explicit so the planner doesn't have to prove it can rewrite.
-
Distinct in the inner — collapses
ordersto one row per active customer, which keeps the join 1:1 and removes the need forDISTINCTon the outer projection. -
Hash join over nested loop — for a large
orderstable, hash-join is O(N + M) versus the nested-loop O(N × matching_M). The rewrite makes the plan deterministic across platforms. -
Modern planner caveat — Postgres, BigQuery, and Snowflake will often rewrite the original
EXISTSto the same hash semi-join automatically. Older MySQL versions (and some constrained planners) will not — which is when the manual rewrite is worth the extra lines. - Cost — derived table = O(orders) scan + O(distinct customers) hash build; outer probe = O(customers); total = O(orders + customers).
SQL
Topic — subqueries
Subquery problems (SQL)
The interview-grade subquery taxonomy
Before diving into the four shapes, it pays to internalise the taxonomy interviewers use when they grade you. Senior data-engineering rounds at FAANG, fintech, and analytics-heavy shops categorise subquery answers along three axes:
-
Result cardinality. Does the subquery return one value (scalar), one column with many rows (column subquery /
IN), or a multi-column table (derived)? Each cardinality constrains where the subquery is allowed to appear. - Correlation. Does the inner reference an outer column? If yes, the logical model is "evaluate inner per outer row"; if no, the inner is independent and the engine evaluates it once.
- Plan shape. What physical operator does the optimizer emit? A hash semi-join? A nested-loop lookup? A materialised CTE? A constant-folded literal? Interviewers reading your answer at the whiteboard expect you to name the physical operator, not just the syntax.
Worked translation table — the same logical question, four syntactic shapes.
| Question | Subquery shape | JOIN equivalent | Window-function equivalent |
|---|---|---|---|
| Customers with above-average orders | Correlated WHERE x > (SELECT AVG ...)
|
JOIN (GROUP BY customer) then filter |
AVG OVER (PARTITION BY customer) |
| Top-N per group | Derived table + ROW_NUMBER
|
self-join with rank predicate (slow) | ROW_NUMBER OVER (PARTITION BY group) |
| Customer's most recent order | Correlated scalar (SELECT MAX...)
|
JOIN (MAX GROUP BY customer) |
ROW_NUMBER OVER (... ORDER BY DESC) |
| Customers with at least one order | EXISTS |
INNER JOIN ... DISTINCT |
— |
| Customers with no orders | NOT EXISTS |
LEFT JOIN ... WHERE NULL |
— |
The exam-ready takeaway: all five rows are the same logical query expressed four different ways. A senior candidate can pick whichever is most readable for the audience and the engine, and explain why the plans converge (or don't).
Where each shape wins.
- Correlated subquery — most English-like expression; safe choice when the dataset is small.
- Derived table — wins for "aggregate then filter / join" pipelines and for window-then-filter.
- CTE — wins for multi-stage pipelines (3+ stages) where named stages improve readability; same plan as derived table on modern engines.
- JOIN — wins for large tables where the planner needs an obvious hash-join target.
- Window function — wins for per-group ranks, running totals, and shifted comparisons.
The wrong choice is rarely "wrong" syntactically — it's "wrong" because it produces a plan the engine can't optimise. That distinction is what senior rounds test.
2. Scalar subqueries — one value, NULL gotchas
A scalar subquery returns exactly one row × one column — the simplest shape, the most NULL traps
The mental model in one line: a scalar subquery is a SELECT that the SQL engine treats as a single value; if it returns zero rows you silently get NULL, if it returns more than one row you get a runtime error. Senior interviewers love this shape because it is deceptively simple — the gotchas are at the boundaries (zero rows, more-than-one row, and the column-expression vs predicate distinction).
Where a scalar subquery is allowed.
-
SELECT clause. As a column expression — e.g.
SELECT name, (SELECT MAX(price) FROM products) AS top_price FROM customers. -
WHERE clause. As an operand of a comparison — e.g.
WHERE salary > (SELECT AVG(salary) FROM emp). -
HAVING clause. Similarly —
HAVING SUM(amount) > (SELECT AVG(total) FROM totals). - VALUES / UPDATE / INSERT. Anywhere SQL expects a single value.
The three failure modes.
- Exactly one row × one column. The happy path. SQL substitutes the value as if you had written a literal.
-
Zero rows. Silently substitutes
NULL. This is the most common source of bugs — a filter that should match never does, or an aggregate that should subtract zero subtractsNULLand corrupts the whole expression. -
More than one row. Most engines raise
subquery returned more than 1 row(Postgres) orSubquery returned more than 1 value(SQL Server). Some MySQL versions return the first row silently, which is even worse — the query "works" until the data shape changes.
Cardinality enforcement keywords.
- Postgres / Oracle / SQL Server enforce the one-row rule at runtime — if the subquery returns >1 row the engine raises.
- MySQL: ANSI-mode raises; legacy mode returns the first row. Never rely on legacy mode.
- BigQuery: enforced; the optimizer often errors at plan time if it can prove >1 row.
Tuning intuition — when scalar subqueries hurt.
- A scalar subquery in the
SELECTclause can run once per outer row if the planner cannot prove it's constant. Wrap it in a CTE or a derived table to force single evaluation. - A scalar subquery in a
WHEREclause is usually evaluated once and treated as a constant. - If the subquery references the outer query, it's no longer scalar in the strict sense — it's a correlated subquery whose result per row happens to be scalar (see §4).
Worked example — embed a scalar subquery in the SELECT clause
Detailed explanation. A reporting query wants every order alongside the maximum price across all orders. The scalar subquery gives a single value that the planner can hoist out of the outer scan and reuse for every row.
Question. Given an orders table, return every order with the global maximum price displayed alongside the order's own price. Compare a scalar-subquery approach to a window-function approach.
Input.
| order_id | price |
|---|---|
| 1001 | 19.99 |
| 1002 | 49.50 |
| 1003 | 12.00 |
| 1004 | 88.40 |
Code (scalar subquery).
SELECT order_id,
price,
(SELECT MAX(price) FROM orders) AS max_price
FROM orders;
Step-by-step explanation.
- The planner sees the inner
SELECT MAX(price) FROM ordersis uncorrelated — it doesn't reference the outer row. - It runs once, returns the single value
88.40, and caches it as a constant. - The outer scan returns every row from
ordersand projects(order_id, price, 88.40). - Total work = one extra scan for the inner (a single aggregate pass) + the outer scan. No per-row penalty.
Output.
| order_id | price | max_price |
|---|---|---|
| 1001 | 19.99 | 88.40 |
| 1002 | 49.50 | 88.40 |
| 1003 | 12.00 | 88.40 |
| 1004 | 88.40 | 88.40 |
Rule of thumb. When the scalar is constant across the whole result set, the scalar subquery and a window function (MAX(price) OVER ()) generate the same plan. When the scalar varies per row, the subquery is correlated and the right tool is usually a window function partitioned by the grouping column.
Worked example — the zero-row NULL trap
Detailed explanation. A scalar subquery that filters by a non-matching key silently produces NULL. If you then use that NULL in an arithmetic expression, the whole expression collapses to NULL and downstream consumers may treat it as "no value" or as zero — both wrong.
Question. A team queries the latest discount for customer 999 — a customer who has no discount record. What does the result look like and why is it dangerous?
Input.
| customer_id | discount_pct |
|---|---|
| 1 | 10 |
| 2 | 5 |
| 3 | 15 |
Code.
SELECT 100 - (SELECT discount_pct FROM discounts WHERE customer_id = 999) AS net_pct;
Step-by-step explanation.
- The inner subquery searches
discountsforcustomer_id = 999. - No row matches → the inner returns zero rows.
- Per ANSI rules, the scalar value is
NULL. - The outer expression
100 - NULLisNULL(any arithmetic withNULLisNULL). - The application reads
NULLas "no discount applied" — silently selling at full price when the business logic expected an error.
Output.
| net_pct |
|---|
| NULL |
The fix — COALESCE or an explicit EXISTS guard.
SELECT 100 - COALESCE(
(SELECT discount_pct FROM discounts WHERE customer_id = 999),
0
) AS net_pct;
Output after fix.
| net_pct |
|---|
| 100 |
Rule of thumb. Every scalar subquery in a SELECT or arithmetic context should either be guaranteed to return one row (by data integrity constraint) or wrapped in COALESCE with a sensible default. Don't trust silent NULL to mean what your business logic expects.
Worked example — the more-than-one-row runtime error
Detailed explanation. The other failure mode of a scalar subquery is "returned more than 1 row." Junior engineers often write subqueries that happen to return one row in dev because the dataset is small, then watch them blow up in prod when the data shape changes. Senior interviewers test whether you can spot the latent risk before it ships.
Question. A team writes the query below to label each order with the customer's email. In dev with 10 rows it works; in prod it raises subquery returned more than 1 row. Find the bug and propose a fix.
Input — customers.
| customer_id | |
|---|---|
| 1 | alice@x.com |
| 1 | alice.duplicate@x.com |
| 2 | bob@x.com |
Input — orders.
| order_id | customer_id |
|---|---|
| 1001 | 1 |
| 1002 | 2 |
Code (the trap).
SELECT o.order_id,
(SELECT email FROM customers c WHERE c.customer_id = o.customer_id) AS email
FROM orders o;
Step-by-step explanation.
- For
order_id = 1001, the inner subquery scanscustomersforcustomer_id = 1and finds two rows (alice@x.comandalice.duplicate@x.com). - Postgres raises
ERROR: more than one row returned by a subquery used as an expression. - The root cause is a missing uniqueness constraint on
customers.customer_id. The data model assumed one row per customer; the data violated the assumption. - The right fix is twofold: (a) add the constraint at the schema level to prevent recurrence; (b) until the schema is fixed, defensively wrap the subquery with
LIMIT 1and an explicitORDER BYso the chosen row is deterministic.
The defensive fix.
SELECT o.order_id,
(SELECT email
FROM customers c
WHERE c.customer_id = o.customer_id
ORDER BY c.created_at DESC
LIMIT 1) AS email
FROM orders o;
Output after fix.
| order_id | |
|---|---|
| 1001 | alice.duplicate@x.com |
| 1002 | bob@x.com |
Rule of thumb. A scalar subquery in SELECT is a cardinality contract. If you can't prove the inner returns exactly one row, either add the schema constraint or add LIMIT 1 + ORDER BY to make the choice deterministic.
SQL interview question on scalar subquery vs window function
A common probe: "Return every employee with their salary and the company-wide average salary, in one query. Two approaches?" — testing whether the candidate recognises that the scalar-subquery form is fine when the aggregate is constant, but reaches for a window function when partition-aware aggregates are needed.
Solution Using a scalar subquery and a window function side by side
-- (a) Scalar subquery — global average
SELECT e.emp_id,
e.name,
e.salary,
(SELECT AVG(salary) FROM employees) AS company_avg
FROM employees e;
-- (b) Window function — per-department average
SELECT e.emp_id,
e.name,
e.dept_id,
e.salary,
AVG(e.salary) OVER (PARTITION BY e.dept_id) AS dept_avg
FROM employees e;
Step-by-step trace.
| Step | Form (a) — scalar subquery | Form (b) — window function |
|---|---|---|
| 1 | Inner SELECT AVG(salary) FROM employees runs once |
One scan of employees; planner partitions by dept_id in-flight |
| 2 | Returns one value, treated as a literal | Per row, planner attaches the partition's AVG(salary)
|
| 3 | Outer scan projects (emp_id, name, salary, const_avg)
|
Outer scan projects (emp_id, name, dept_id, salary, dept_avg)
|
| 4 | Total work = 1 aggregate + 1 scan | Total work = 1 scan with windowed aggregate (sort or hash partitioning under the hood) |
Output (form a, sample).
| emp_id | name | salary | company_avg |
|---|---|---|---|
| 101 | Ada | 120000 | 88500 |
| 102 | Brian | 80000 | 88500 |
| 103 | Cara | 60000 | 88500 |
Output (form b, sample).
| emp_id | name | dept_id | salary | dept_avg |
|---|---|---|---|---|
| 101 | Ada | ENG | 120000 | 100000 |
| 102 | Brian | ENG | 80000 | 100000 |
| 103 | Cara | SALES | 60000 | 60000 |
Why this works — concept by concept:
-
Scalar subquery as constant — when the aggregate is global (no
PARTITION), the planner evaluates the subquery once and substitutes a literal everywhere. Fast and predictable. - Window function for per-group — when the aggregate varies by group, the window function is the right tool because it computes the aggregate per partition in a single pass instead of forcing a correlated subquery to run per row.
- Plan readability — a window function makes the intent of "per-department average" obvious; a correlated subquery hides it inside a nested expression.
-
NULL semantics —
AVGignoresNULLs natively; if some salaries areNULL, both forms exclude them automatically. - Cost — scalar form = O(N) + O(1); window form = O(N) with a sort or hash; cheaper than the correlated equivalent which would be O(N × group_size).
SQL
Topic — subqueries (SQL)
Scalar and embedded subquery problems
3. Derived tables — FROM clause subqueries
A derived table is a subquery in FROM — same semantics as a CTE, different syntax
The mental model in one line: a derived table is a SELECT placed in the FROM clause of an outer query, treated by the outer query as if it were a real table. Most modern engines treat a derived table and a WITH (CTE) as semantically equivalent — the difference is mostly stylistic, with a few platform-specific optimizer hints.
Why derived tables exist.
-
Aggregate-then-join. You want to aggregate one table before joining the result to another. The derived table computes the aggregate; the outer
JOINconsumes the rolled-up rows. -
Window-then-filter. Window functions can't appear in
WHERE. So you compute the window in a derived table, then filter on the result in the outerWHERE. - Self-reference reshape. Two passes over the same table — e.g. one to compute a per-customer max, one to find the row that matches that max.
The platform alias rules.
-
MySQL. Derived tables must have an alias —
(SELECT ...) AS sub. Without it, MySQL raisesEvery derived table must have its own alias. - Postgres. Alias optional but strongly recommended; tools and humans both need it.
- SQL Server. Alias required.
- Oracle. Alias optional, but referencing columns requires one.
- BigQuery / Snowflake. Alias optional but recommended.
Derived table vs CTE — when to prefer which.
- Derived table wins when the subquery is used once and is small enough that inlining keeps the query compact.
- CTE wins when the subquery is used more than once, or when the query is long enough that named, top-of-query pipeline stages improve readability.
-
Optimizer caveat (Postgres ≤ 11). CTEs were materialised — a fence the planner couldn't push predicates through. Postgres 12+ treats them as inline unless
MATERIALIZEDis explicit. Other engines (BigQuery, Snowflake) always treat CTEs as inline. - Optimizer caveat (SQL Server). Both shapes inline by default; the planner picks the plan, not the syntax.
Common shape — aggregate-then-join.
SELECT r.region, r.target, sub.total
FROM region_targets r
JOIN (SELECT region, SUM(amount) AS total
FROM orders
GROUP BY region) AS sub
ON sub.region = r.region;
The derived table sub is a virtual table with two columns: region and total. The outer query joins it to region_targets like any other table.
Worked example — top 5 customers per region using a derived table
Detailed explanation. This is one of the most-asked SQL interview shapes: "top-N per group". The cleanest pattern is a derived table with ROW_NUMBER() followed by an outer filter. The derived table does the windowed assignment; the outer WHERE keeps only the top-N.
Question. Find the top 5 customers by total spend within each region. Use a derived table with ROW_NUMBER() partitioned by region.
Input.
| customer_id | region | total_spend |
|---|---|---|
| 1 | EU | 12000 |
| 2 | EU | 9800 |
| 3 | EU | 7600 |
| 4 | EU | 5400 |
| 5 | EU | 4100 |
| 6 | EU | 2900 |
| 7 | US | 22000 |
| 8 | US | 15000 |
Code.
SELECT customer_id, region, total_spend, rnk
FROM (
SELECT customer_id,
region,
total_spend,
ROW_NUMBER() OVER (PARTITION BY region
ORDER BY total_spend DESC) AS rnk
FROM customer_totals
) AS ranked
WHERE rnk <= 5;
Step-by-step explanation.
- The derived table
rankedcomputes a windowROW_NUMBER()over each region, ordered by spend descending. - Every row in
customer_totalsis assignedrnk = 1, 2, 3, ...within its region. - The outer
SELECTfilters tornk <= 5, keeping only the top 5 customers per region. - The filter cannot be inside the derived table because window functions are not allowed in
WHERE— that's the whole point of this pattern.
Output.
| customer_id | region | total_spend | rnk |
|---|---|---|---|
| 1 | EU | 12000 | 1 |
| 2 | EU | 9800 | 2 |
| 3 | EU | 7600 | 3 |
| 4 | EU | 5400 | 4 |
| 5 | EU | 4100 | 5 |
| 7 | US | 22000 | 1 |
| 8 | US | 15000 | 2 |
Rule of thumb. Window-then-filter is the canonical derived-table pattern. The window goes inside the derived table; the rank filter goes in the outer WHERE.
Worked example — aggregate-then-join for region health
Detailed explanation. Reporting often needs "current values" joined to "rollups". The cleanest pattern: derive the rollup, then JOIN it back to the dimension table.
Question. Compute each region's total order amount and join it to a region_targets table to compute attainment_pct.
Input — orders.
| order_id | region | amount |
|---|---|---|
| 1001 | EU | 200 |
| 1002 | EU | 350 |
| 1003 | US | 800 |
| 1004 | US | 100 |
| 1005 | APAC | 90 |
Input — region_targets.
| region | target |
|---|---|
| EU | 1000 |
| US | 1500 |
| APAC | 500 |
Code.
SELECT r.region,
r.target,
sub.total,
ROUND(100.0 * sub.total / r.target, 1) AS attainment_pct
FROM region_targets r
LEFT JOIN (
SELECT region, SUM(amount) AS total
FROM orders
GROUP BY region
) AS sub
ON sub.region = r.region;
Step-by-step explanation.
- The derived table
subaggregatesordersbyregion, returning one row per region. - The outer
LEFT JOINkeeps every region fromregion_targets(so a region with zero orders still appears). - The outer projection computes
attainment_pctfrom the joined columns. -
LEFT JOINoverINNER JOINis deliberate — we want to see regions that had zero orders too, with aNULLtotal that downstreamCOALESCEcan convert to zero.
Output.
| region | target | total | attainment_pct |
|---|---|---|---|
| EU | 1000 | 550 | 55.0 |
| US | 1500 | 900 | 60.0 |
| APAC | 500 | 90 | 18.0 |
Rule of thumb. When the rollup is keyed by a dimension that the outer table already has, a derived table joined back is almost always cleaner than a correlated subquery.
Worked example — derived table for self-join "compare to previous month"
Detailed explanation. "Compare this month's revenue to last month's revenue per region" is a classic reporting shape. A derived table per side and a join on the shifted month is the cleanest pattern when window functions aren't permitted (legacy MySQL versions) or when the reader prefers an explicit self-join.
Question. Given monthly regional revenue, return each region's revenue this month, revenue last month, and the percentage change.
Input — monthly_revenue.
| region | month | revenue |
|---|---|---|
| EU | 2026-03 | 8000 |
| EU | 2026-04 | 10000 |
| US | 2026-03 | 12000 |
| US | 2026-04 | 11000 |
| APAC | 2026-04 | 3000 |
Code (two derived tables joined on shifted month).
SELECT t.region,
t.month,
t.revenue AS this_month,
p.revenue AS last_month,
ROUND(100.0 * (t.revenue - p.revenue) / NULLIF(p.revenue, 0), 1) AS pct_change
FROM (
SELECT region, month, revenue
FROM monthly_revenue
WHERE month = '2026-04'
) AS t
LEFT JOIN (
SELECT region, revenue
FROM monthly_revenue
WHERE month = '2026-03'
) AS p
ON p.region = t.region;
Step-by-step explanation.
- Derived table
tmaterialises the current month's row per region — one row per active region this month. - Derived table
pmaterialises the previous month's row per region — one row per active region last month. - The outer
LEFT JOINmatches byregion; a region that exists this month but not last month gets aNULLlast_month(handled byNULLIFto avoid divide-by-zero). - The projection computes the percentage change, rounded for readability.
Output.
| region | month | this_month | last_month | pct_change |
|---|---|---|---|---|
| EU | 2026-04 | 10000 | 8000 | 25.0 |
| US | 2026-04 | 11000 | 12000 | -8.3 |
| APAC | 2026-04 | 3000 | NULL | NULL |
Rule of thumb. When the comparison is over a shifted dimension and the engine lacks window functions, two derived tables joined on the shifted key is the cleanest expression. With window functions available, LAG() is even cleaner — LAG(revenue) OVER (PARTITION BY region ORDER BY month).
SQL interview question on derived table vs CTE for a multi-step pipeline
A common probe: "We need to identify each customer's second most recent order. Walk through the query and explain why you chose derived table vs CTE." — testing whether the candidate picks the most readable shape for a multi-stage pipeline.
Solution Using a derived table with ROW_NUMBER
SELECT customer_id, order_id, order_date
FROM (
SELECT customer_id,
order_id,
order_date,
ROW_NUMBER() OVER (PARTITION BY customer_id
ORDER BY order_date DESC) AS rn
FROM orders
) AS ranked
WHERE rn = 2;
Step-by-step trace.
| Step | Stage | What happens |
|---|---|---|
| 1 | Derived scan | reads orders once |
| 2 | Window assignment | assigns rn = 1, 2, 3, ... per customer in date-desc order |
| 3 | Outer filter | keeps only rows where rn = 2
|
| 4 | Project | emits (customer_id, order_id, order_date) for the second-most-recent order |
Sample input.
| customer_id | order_id | order_date |
|---|---|---|
| 1 | 1001 | 2026-01-15 |
| 1 | 1002 | 2026-02-10 |
| 1 | 1003 | 2026-03-05 |
| 2 | 1004 | 2026-04-01 |
| 2 | 1005 | 2026-04-02 |
Output:
| customer_id | order_id | order_date |
|---|---|---|
| 1 | 1002 | 2026-02-10 |
| 2 | 1004 | 2026-04-01 |
Why this works — concept by concept:
-
Window-then-filter — windowed
ROW_NUMBERcannot appear inWHERE, so the derived table is the right shape: assign in the inner, filter in the outer. -
PARTITION BY customer_id — restarts the numbering for each customer, so
rn=2means "second per customer", not "second overall". -
ORDER BY order_date DESC — defines what "second-most-recent" means; ties should be broken with a secondary
ORDER BYlikeorder_id DESCif the data permits same-date orders. - Derived table vs CTE — for a single-use pipeline like this, a derived table is the most compact shape. For three or more pipeline stages, switch to CTEs purely for readability.
-
Cost — one scan of
orders+ one sort or hash for the window = O(N log N) in the worst case; the outer filter is O(N).
SQL
Topic — subqueries (SQL)
Derived table and top-N per group problems
4. Correlated subqueries — referencing the outer query
A correlated subquery references the outer row — logically runs once per outer row
The mental model in one line: a correlated subquery is an inner SELECT that references one or more columns from the outer query; logically it executes once per outer row, although most modern planners rewrite it as a hash join or semi-join under the hood. This is the most senior-signal shape — using it correctly tells the interviewer you can think in terms of "per-row predicate logic" rather than just set operations.
What makes a subquery correlated.
- The inner
SELECTreferences one or more columns from the outer query's table aliases. - The inner cannot be evaluated independently — it depends on the outer row currently being processed.
- Logically, the engine evaluates the inner once per outer row; physically, the optimizer is allowed to rewrite this as a single join.
Where correlated subqueries show up.
-
WHEREpredicate. "Customers with above-average orders" — the inner computes the average for that customer. -
SELECTcolumn. "Each customer's most recent order date" — the inner finds the max date for that customer. -
EXISTStest. "Customers who placed any order" — the inner tests existence per outer row.
Two questions to ask before writing a correlated subquery.
- Can I write this with a
GROUP BYand aJOINinstead? If yes, that's usually faster and more obvious. - Can I write this with a window function? If yes (per-row position or per-row rank), the window is almost always faster.
When correlated subqueries are the right answer.
- The predicate logic depends on a per-row scalar (e.g. "more than this row's department's average").
- The query needs to be self-documenting — the correlated form often reads like the English question.
- The dataset is small enough that the per-row cost doesn't matter (lookups inside a report, dashboards, etc.).
Performance — when correlated subqueries hurt.
- The naïve plan is O(N × M) — N outer rows, M inner scan per row.
- Modern planners (Postgres ≥ 12, BigQuery, Snowflake) rewrite to O(N + M) via hash semi-join.
- Older MySQL versions (≤ 5.7) cannot always rewrite — the naïve nested-loop plan kicks in and the query becomes slow.
- The fix is usually a manual
JOINon aGROUP BYderived table or a CTE.
EXISTS vs IN — both can be correlated, with different NULL semantics.
-
WHERE EXISTS (SELECT 1 FROM ...)— boolean test, short-circuits, NULL-safe. -
WHERE x IN (SELECT y FROM ...)— membership test; if any inneryisNULL, the negation (NOT IN) trips into the NULL trap (see §5).
Worked example — customers with above-average orders
Detailed explanation. A classic interview shape: for each customer, find the orders that exceeded that customer's average. The inner subquery is correlated because it references the outer customer_id to compute the per-customer average.
Question. Given orders(order_id, customer_id, amount), list each order whose amount exceeds the average amount for that order's customer.
Input.
| order_id | customer_id | amount |
|---|---|---|
| 1001 | 1 | 200 |
| 1002 | 1 | 50 |
| 1003 | 1 | 800 |
| 1004 | 2 | 100 |
| 1005 | 2 | 100 |
| 1006 | 2 | 100 |
| 1007 | 3 | 50 |
| 1008 | 3 | 250 |
Code (correlated subquery).
SELECT o.order_id, o.customer_id, o.amount
FROM orders o
WHERE o.amount > (
SELECT AVG(amount)
FROM orders i
WHERE i.customer_id = o.customer_id
);
Step-by-step explanation.
- Logically, the planner walks every row in the outer
orders o. - For each outer row, it computes
AVG(amount)over the innerorders ifiltered toi.customer_id = o.customer_id. Customer 1's average is(200 + 50 + 800) / 3 = 350. Customer 2's average is100. Customer 3's average is150. - It keeps the outer row if
o.amount > the_per_customer_avg. - Customer 1: orders 1001 (200), 1002 (50), 1003 (800). Average = 350. Only 1003 exceeds. Customer 2: all three orders equal the average → none exceed (strictly greater than). Customer 3: orders 1007 (50), 1008 (250). Average = 150. Only 1008 exceeds.
Output.
| order_id | customer_id | amount |
|---|---|---|
| 1003 | 1 | 800 |
| 1008 | 3 | 250 |
Rule of thumb. When the predicate references a per-row aggregate, the correlated subquery is the clearest form. If performance matters and the dataset is large, rewrite as a derived-table JOIN with a grouped average.
Worked example — running total via correlated subquery (and the window-function rewrite)
Detailed explanation. "Running total" is the canonical "correlated subquery vs window function" head-to-head. The correlated form reads like the textbook definition; the window form is faster on every modern engine. Interviewers expect you to write the correlated version first and then immediately propose the window-function rewrite.
Question. Given a sequence of orders ordered by date, return each order with the running total of amount up to and including that order.
Input — orders.
| order_id | order_date | amount |
|---|---|---|
| 1001 | 2026-04-01 | 50 |
| 1002 | 2026-04-02 | 80 |
| 1003 | 2026-04-03 | 30 |
| 1004 | 2026-04-04 | 120 |
Code (correlated form — clear but slow).
SELECT o.order_id,
o.order_date,
o.amount,
(SELECT SUM(amount)
FROM orders i
WHERE i.order_date <= o.order_date) AS running_total
FROM orders o
ORDER BY o.order_date;
Code (window-function rewrite — fast).
SELECT order_id,
order_date,
amount,
SUM(amount) OVER (ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM orders
ORDER BY order_date;
Step-by-step explanation.
- The correlated form computes a per-row aggregate over all prior rows. Naively O(N²) — each of N rows triggers an O(N) scan.
- The window form computes the running sum in a single pass; the engine maintains an accumulator as it walks the ordered stream. O(N) total.
- Both return identical results on the sample dataset; the plans diverge dramatically on large tables.
- The
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROWframe makes the running-sum semantics explicit; the defaultRANGEframe is the same here but differs when there are ties onorder_date.
Output (both forms).
| order_id | order_date | amount | running_total |
|---|---|---|---|
| 1001 | 2026-04-01 | 50 | 50 |
| 1002 | 2026-04-02 | 80 | 130 |
| 1003 | 2026-04-03 | 30 | 160 |
| 1004 | 2026-04-04 | 120 | 280 |
Rule of thumb. Anytime you see a correlated subquery whose inner is SUM, COUNT, MIN, MAX, or AVG over rows "up to / since / above / below" the outer row, that's a window-function candidate. Rewrite for an O(N) plan instead of O(N²).
Worked example — each customer's most recent order
Detailed explanation. "Each customer's latest X" is another shape that maps naturally to a correlated subquery in the SELECT clause. The cleaner modern form is a window function (ROW_NUMBER), but the correlated shape is still tested in interviews.
Question. Return each customer's most recent order date.
Input.
| order_id | customer_id | order_date |
|---|---|---|
| 1001 | 1 | 2026-01-15 |
| 1002 | 1 | 2026-03-10 |
| 1003 | 2 | 2026-02-22 |
| 1004 | 2 | 2026-04-05 |
| 1005 | 3 | 2026-05-01 |
Code (correlated scalar subquery in SELECT).
SELECT c.customer_id,
c.name,
(SELECT MAX(o.order_date)
FROM orders o
WHERE o.customer_id = c.customer_id) AS last_order_date
FROM customers c;
Step-by-step explanation.
- For each row in
customers c, the planner computes the innerMAX(order_date)filtered to that customer. - If the customer has no orders, the inner returns zero rows and the scalar evaluates to
NULL— the outer projection showsNULLforlast_order_date, which is the correct semantics. - Modern planners usually rewrite this as a left join on a grouped derived table:
(customers c LEFT JOIN (SELECT customer_id, MAX(order_date) FROM orders GROUP BY customer_id) g ON g.customer_id = c.customer_id).
Output.
| customer_id | name | last_order_date |
|---|---|---|
| 1 | Alice | 2026-03-10 |
| 2 | Bob | 2026-04-05 |
| 3 | Charlie | 2026-05-01 |
| 4 | Dana | NULL |
Rule of thumb. Correlated subquery in SELECT is fine for small lookups and reports. For warehouse-scale data, prefer the explicit grouped derived-table LEFT JOIN.
Worked example — employees who out-earn their department average
Detailed explanation. "Find rows where the value exceeds the group's average" is the textbook correlated-subquery interview shape. The inner aggregate is parameterised by the outer row's group column — that's exactly what "correlated" means.
Question. Return every employee whose salary is strictly greater than the average salary in their department.
Input — employees.
| emp_id | dept_id | salary |
|---|---|---|
| 101 | ENG | 120000 |
| 102 | ENG | 90000 |
| 103 | ENG | 70000 |
| 104 | SALES | 80000 |
| 105 | SALES | 60000 |
| 106 | SALES | 60000 |
Code (correlated form).
SELECT e.emp_id, e.dept_id, e.salary
FROM employees e
WHERE e.salary > (
SELECT AVG(salary)
FROM employees i
WHERE i.dept_id = e.dept_id
);
Step-by-step explanation.
- For each outer row, the inner computes
AVG(salary)for the samedept_id. - ENG average =
(120000 + 90000 + 70000) / 3 ≈ 93333. SALES average =(80000 + 60000 + 60000) / 3 ≈ 66667. - Keep emp 101 (
120000 > 93333✓), drop emp 102 (90000 > 93333✗), drop 103. Keep 104 (80000 > 66667✓), drop 105, drop 106. - The correlated reference is
i.dept_id = e.dept_id— that's what wires the inner aggregate to the outer row.
Output.
| emp_id | dept_id | salary |
|---|---|---|
| 101 | ENG | 120000 |
| 104 | SALES | 80000 |
The faster rewrite (window function).
SELECT emp_id, dept_id, salary
FROM (
SELECT emp_id,
dept_id,
salary,
AVG(salary) OVER (PARTITION BY dept_id) AS dept_avg
FROM employees
) AS x
WHERE salary > dept_avg;
Rule of thumb. "Above-group-average" is the canonical correlated-subquery interview shape — and the canonical "rewrite as a window function" follow-up. Expect both.
SQL interview question on rewriting a correlated subquery as a JOIN
A common probe: "Your correlated subquery returns the right answer but EXPLAIN shows nested-loop semantics and the table is 200M rows. Rewrite for the hash-join path." — testing whether the candidate spots the manual rewrite path.
Solution Using a grouped derived table joined back to the outer table
-- Original — correlated subquery
SELECT o.order_id, o.customer_id, o.amount
FROM orders o
WHERE o.amount > (
SELECT AVG(amount)
FROM orders i
WHERE i.customer_id = o.customer_id
);
-- Rewritten — JOIN against a grouped derived table
SELECT o.order_id, o.customer_id, o.amount
FROM orders o
JOIN (SELECT customer_id, AVG(amount) AS avg_amt
FROM orders
GROUP BY customer_id) g
ON g.customer_id = o.customer_id
WHERE o.amount > g.avg_amt;
Step-by-step trace.
| Step | Stage | What happens |
|---|---|---|
| 1 | Inner aggregate | one scan of orders, groups by customer_id, computes per-customer average |
| 2 | Hash build | build hash on g.customer_id (size = distinct customers, small) |
| 3 | Outer scan | scan orders once |
| 4 | Hash probe + filter | for each outer row, probe the hash and keep rows where amount > avg_amt
|
| 5 | Project | emit (order_id, customer_id, amount)
|
Output:
| order_id | customer_id | amount |
|---|---|---|
| 1003 | 1 | 800 |
| 1008 | 3 | 250 |
Why this works — concept by concept:
-
Per-row predicate becomes per-group constant — the grouped derived table computes
avg_amtonce per customer and the outer join probes it in O(1). The naïve correlated form would recompute the average for every row. -
Hash join over nested loop — the planner builds a hash table on
customer_idonce (M entries) and probes it N times. Total cost = O(M + N), versus O(M × N) for the naïve correlated nested loop. - Same answer, different plan — every row that satisfied the original correlated predicate also satisfies the join + filter, and vice versa. The rewrite is an optimizer hint, not a logic change.
-
Modern planner caveat — Postgres 13+, BigQuery, and Snowflake will often rewrite the correlated form to the same hash semi-join. Older MySQL and SQL Server before 2017 may not. Run
EXPLAINto verify. - Cost — O(N + M) instead of O(N × M); N = order count, M = distinct customer count; typical speedup is 10× to 1000× on large tables.
SQL
Topic — subqueries
Correlated subquery rewrite problems
5. EXISTS / NOT EXISTS / IN / NOT IN — four ways to ask "is there?"
exists sql short-circuits; IN materialises; NOT IN is the NULL footgun
The mental model in one line: EXISTS, NOT EXISTS, IN, and NOT IN are four operators that test row existence, but they differ in short-circuit behaviour, selectivity preference, and NULL semantics. Senior interviews almost always test whether you can pick the right one for a given dataset shape, not whether you know they exist.
The four operators in one sentence each.
-
EXISTS— returnsTRUEas soon as the inner subquery produces one row; short-circuits on the first match. The innerSELECTlist is irrelevant —SELECT 1,SELECT *, anything works. -
IN— tests whether the outer expression appears in the inner result set (or literal list). The planner materialises the inner side into a list or hash. -
NOT EXISTS—TRUEwhen the inner subquery returns zero rows. NULL-safe by definition. -
NOT IN—TRUEwhen the outer value is not present in the inner list. Returns no rows if any inner value isNULL— the classic NULL footgun.
The NULL footgun explained.
NOT IN (...NULL...) evaluates to UNKNOWN for every row, because SQL three-valued logic says "is x not in a set that contains NULL?" cannot be answered. UNKNOWN filtered through WHERE is treated as FALSE, so the query returns zero rows. Always replace NOT IN with NOT EXISTS when the inner side might contain NULL.
-- ⚠️ Returns ZERO rows if any inner customer_id is NULL
SELECT * FROM customers
WHERE customer_id NOT IN (SELECT customer_id FROM blacklist);
-- ✓ Always-correct NULL-safe alternative
SELECT * FROM customers c
WHERE NOT EXISTS (
SELECT 1 FROM blacklist b WHERE b.customer_id = c.customer_id
);
The four-quadrant decision matrix.
| Selectivity vs NULL-safety | Inner side might be NULL → need NULL-safe | Inner side guaranteed non-NULL |
|---|---|---|
| Inner is highly selective (few rows) |
NOT EXISTS for negation; EXISTS for affirmation |
EXISTS (short-circuits fast) |
| Inner is small literal list |
NOT EXISTS (still safer); or use COALESCE guard |
IN (literal_list) is fine |
JOIN equivalents.
-
EXISTS≅ SEMI JOIN — return outer rows that have a match; no row duplication regardless of how many inner rows match. -
NOT EXISTS≅ ANTI JOIN — return outer rows that have no match. -
IN≅ semi-join after deduplication of the inner. -
NOT IN≅ anti-join only if the inner is guaranteed non-NULL; otherwise three-valued-logic semantics diverge.
Why modern planners often pick the same plan for IN and EXISTS.
Both shapes ask "is there a matching row?". A semi-join is the canonical physical plan for that question. Postgres, BigQuery, Snowflake, SQL Server (2017+), and modern MySQL all converge IN (SELECT ...) and EXISTS (SELECT 1 FROM ... WHERE ...) to the same hash semi-join. The reason to write one over the other is mostly stylistic, with one exception: NULL semantics for the negated forms (NOT IN vs NOT EXISTS).
Worked example — find customers with no orders
Detailed explanation. "Find the rows in A that have no match in B" is the classic anti-join shape. NOT EXISTS is the canonical, NULL-safe form. NOT IN is dangerous; LEFT JOIN ... WHERE inner IS NULL is equivalent but slightly more verbose.
Question. List every customer who has never placed an order.
Input — customers.
| customer_id | name |
|---|---|
| 1 | Alice |
| 2 | Bob |
| 3 | Charlie |
| 4 | Dana |
Input — orders.
| order_id | customer_id |
|---|---|
| 1001 | 1 |
| 1002 | 3 |
Code (the right way: NOT EXISTS).
SELECT c.customer_id, c.name
FROM customers c
WHERE NOT EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
);
Step-by-step explanation.
- For each customer row, the planner scans
orderslooking for any row witho.customer_id = c.customer_id. - As soon as one match is found, the inner returns
TRUEandNOT EXISTSreturnsFALSE— that customer is dropped. - If the inner scan finishes with zero matches,
NOT EXISTSreturnsTRUE— the customer is kept. - Under the hood, the planner builds a hash table on
orders.customer_idand probes per outer row, dropping any outer row whose customer_id is in the hash. This is the ANTI JOIN plan.
Output.
| customer_id | name |
|---|---|
| 2 | Bob |
| 4 | Dana |
Rule of thumb. Use NOT EXISTS for every "no match" question. Use NOT IN only if the inner side is a small literal list with no NULLs.
Worked example — the NOT IN NULL trap
Detailed explanation. A team writes a "customers not in the blacklist" query with NOT IN. One blacklist row has a NULL customer_id due to a data-load bug. The query silently returns zero rows in production. The interviewer wants to know whether you can spot it.
Question. Given a blacklist table where one row has customer_id = NULL, what does SELECT * FROM customers WHERE customer_id NOT IN (SELECT customer_id FROM blacklist) return?
Input — customers.
| customer_id | name |
|---|---|
| 1 | Alice |
| 2 | Bob |
| 3 | Charlie |
Input — blacklist.
| customer_id |
|---|
| 2 |
| NULL |
Code (the trap).
SELECT customer_id, name
FROM customers
WHERE customer_id NOT IN (SELECT customer_id FROM blacklist);
Step-by-step explanation.
- The inner subquery materialises
(2, NULL). - The outer evaluates
1 NOT IN (2, NULL)— three-valued logic says:1 != 2isTRUE, but1 != NULLisUNKNOWN. The whole expression collapses toUNKNOWN. -
WHERE UNKNOWNis filtered asFALSE, so row 1 is dropped. - Same logic for rows 2 and 3 — every row produces
UNKNOWNand is dropped. The query returns zero rows.
Output (the trap).
| (empty result set) |
The fix — NOT EXISTS.
SELECT c.customer_id, c.name
FROM customers c
WHERE NOT EXISTS (
SELECT 1 FROM blacklist b
WHERE b.customer_id = c.customer_id
);
Output after fix.
| customer_id | name |
|---|---|
| 1 | Alice |
| 3 | Charlie |
Why NOT EXISTS is safe. It tests row existence, not value equality. The inner WHERE b.customer_id = c.customer_id cannot match a NULL value (because NULL = NULL is UNKNOWN, not TRUE), so the NULL row in the blacklist simply doesn't participate.
Rule of thumb. Treat NOT IN with a subquery as legacy code unless you can prove the inner side is non-NULL by constraint. NOT EXISTS is always safe.
Worked example — two-column EXISTS on a composite key
Detailed explanation. Some existence checks are composite — "does a row exist with the same (tenant_id, sku) pair?" IN cannot express this on every engine without a row constructor; EXISTS always can. This shape is common in multi-tenant data platforms.
Question. Given an inventory(tenant_id, sku, qty) table and an orders(tenant_id, sku, qty) table, find orders for which there is no matching inventory row for the same (tenant_id, sku) pair.
Input — inventory.
| tenant_id | sku | qty |
|---|---|---|
| T1 | A | 50 |
| T1 | B | 10 |
| T2 | A | 0 |
Input — orders.
| order_id | tenant_id | sku | qty |
|---|---|---|---|
| 1 | T1 | A | 5 |
| 2 | T1 | C | 2 |
| 3 | T2 | A | 1 |
| 4 | T3 | A | 1 |
Code (NOT EXISTS with a two-column correlation).
SELECT o.order_id, o.tenant_id, o.sku, o.qty
FROM orders o
WHERE NOT EXISTS (
SELECT 1
FROM inventory i
WHERE i.tenant_id = o.tenant_id
AND i.sku = o.sku
);
Step-by-step explanation.
- For each order row, the planner scans
inventoryfor any row matching bothtenant_idandsku. - Order 1 (T1, A) — matches
inventory(T1, A)→ drop. - Order 2 (T1, C) — no match → keep.
- Order 3 (T2, A) — matches
inventory(T2, A)→ drop (even thoughqty = 0, the row exists). - Order 4 (T3, A) — no
inventoryrow for tenantT3at all → keep. - Under the hood, the planner builds a hash on
(inventory.tenant_id, inventory.sku)and probes per outer row.
Output.
| order_id | tenant_id | sku | qty |
|---|---|---|---|
| 2 | T1 | C | 2 |
| 4 | T3 | A | 1 |
Rule of thumb. Composite-key existence tests almost always belong in EXISTS / NOT EXISTS. The row-constructor (a, b) IN (...) syntax is supported by Postgres but not every engine; the EXISTS form is universal.
SQL interview question on EXISTS vs IN for an existence check
A common probe: "Find every product that has at least one review. EXISTS or IN — which would you use and why?" — testing whether the candidate can articulate the difference at the level of plans, not just syntax.
Solution Using EXISTS with a correlated reference
SELECT p.product_id, p.name
FROM products p
WHERE EXISTS (
SELECT 1
FROM reviews r
WHERE r.product_id = p.product_id
);
Step-by-step trace.
| Step | Stage | What happens |
|---|---|---|
| 1 | Outer scan | reads each product row from products
|
| 2 | Inner correlated lookup | scans reviews for any row with r.product_id = p.product_id
|
| 3 | Short-circuit | as soon as one matching review is found, the inner returns and EXISTS is TRUE
|
| 4 | Keep / drop | keeps the outer row if EXISTS = TRUE; drops it otherwise |
| 5 | Project | emits (product_id, name)
|
Sample input — products.
| product_id | name |
|---|---|
| 100 | Widget |
| 101 | Gadget |
| 102 | Gizmo |
Sample input — reviews.
| review_id | product_id | rating |
|---|---|---|
| 1 | 100 | 5 |
| 2 | 100 | 4 |
| 3 | 102 | 3 |
Output:
| product_id | name |
|---|---|
| 100 | Widget |
| 102 | Gizmo |
Why this works — concept by concept:
-
EXISTS short-circuits — the inner subquery stops scanning
reviewsas soon as the first match is found per outer row. For a product with 1,000 reviews, the inner reads exactly one row, not 1,000. -
Semi-join semantics — regardless of how many reviews a product has, the outer row appears exactly once. No
DISTINCTneeded, no row explosion. -
NULL-safe — even if
reviews.product_idhasNULLs, the innerWHERE r.product_id = p.product_idsimply doesn't match them;EXISTSis unaffected. -
Hash semi-join under the hood — modern planners build a hash on
reviews.product_id(with duplicates collapsed) and probe per outer row; total cost = O(reviews + products). -
IN equivalence —
WHERE p.product_id IN (SELECT product_id FROM reviews)produces the same plan on Postgres, BigQuery, Snowflake.EXISTSis the more declarative choice for "is there a match?". - Cost — O(R + P) where R = reviews, P = products; the hash-build is O(distinct reviews.product_id).
SQL
Topic — subqueries (SQL)
EXISTS, NOT EXISTS, and NULL-safe anti-join problems
Three more interview-grade patterns for EXISTS / IN
Beyond the canonical four operators, three further patterns recur in senior-round subquery interviews — pattern recognition matters as much as syntax recall.
Pattern 1 — EXISTS as a guard inside a CASE expression.
A common reporting requirement is "flag each customer as active or dormant based on whether they have at least one order in the last 30 days." The cleanest expression embeds EXISTS inside a CASE.
SELECT c.customer_id,
c.name,
CASE
WHEN EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
AND o.order_date >= CURRENT_DATE - INTERVAL '30 days'
) THEN 'active'
ELSE 'dormant'
END AS status
FROM customers c;
EXISTS is the right primitive because it short-circuits on the first match — perfect for a per-row boolean classifier.
Pattern 2 — EXISTS correlated to two outer tables (multi-table correlation).
Sometimes the inner subquery needs to reference more than one outer alias. This is rare but legal — most engines support correlated references to any outer alias in scope.
SELECT c.customer_id, p.product_id
FROM customers c
JOIN products p
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
AND o.product_id = p.product_id
);
Reads naturally as "every customer-product pair where there's at least one matching order." Most planners rewrite this as a hash semi-join across the three tables.
Pattern 3 — IN with a row constructor for multi-column matching (Postgres / DB2).
When the engine supports row constructors, you can express composite-key IN without going through EXISTS.
-- Postgres
SELECT *
FROM orders
WHERE (tenant_id, sku) IN (
SELECT tenant_id, sku
FROM active_skus
);
This is exactly the two-column EXISTS shape from the worked example above, just with row-constructor syntax. The choice is stylistic — pick whichever your team's engine supports and your reader prefers. Note that MySQL historically supports row-constructor IN; older SQL Server does not.
Three NULL gotchas to memorise.
-
WHERE x = NULLis alwaysFALSE(never use=against NULL; useIS NULL). -
x NOT IN (subquery with any NULL)always returns zero rows. -
EXISTS (SELECT NULL ...)is notFALSE— the existence test only cares whether any row appears, not what the inner projection contains.SELECT 1,SELECT NULL,SELECT *all behave identically.
The third bullet is the most common interview trap. Candidates believe EXISTS (SELECT NULL FROM ...) returns FALSE because the projected value is NULL. It doesn't — EXISTS only counts rows.
Cheat sheet — subquery patterns
-
Above-average filter.
WHERE x > (SELECT AVG(x) FROM t)— uncorrelated scalar subquery; the inner runs once. Add a correlatedWHERE t2.group_id = t.group_idto filter per group. -
Top-N per group. Derived table with
ROW_NUMBER() OVER (PARTITION BY group_col ORDER BY metric DESC); outerWHERE rn <= N. Window cannot live inWHERE, so the derived-table wrap is mandatory. -
Find duplicates.
GROUP BY business_key HAVING COUNT(*) > 1— not technically a subquery, but interviewers often expect it inside aWHERE id IN (...)to surface the offending rows. -
Missing rows / anti-join.
WHERE NOT EXISTS (SELECT 1 FROM right_table WHERE ...)— the NULL-safe form of "no match". Equivalent toLEFT JOIN ... WHERE right.pk IS NULL. -
Conditional aggregation as scalar.
SELECT (SELECT COUNT(*) FROM t WHERE cond) AS sub— embeds a count inline. Use sparingly; usually aFILTERclause or aCASEinsideSUMis clearer. -
Two-column EXISTS.
WHERE EXISTS (SELECT 1 FROM t2 WHERE t1.a = t2.a AND t1.b = t2.b)— composite-key existence test. No way to express this withINdirectly (you'd need a row constructor(a, b) IN (...)which not every engine supports). -
First-of-group via correlated subquery.
WHERE order_id = (SELECT MIN(order_id) FROM orders WHERE customer_id = outer.customer_id)— picks the earliest order per customer. Faster modern form: window function withROW_NUMBERpartitioned bycustomer_id. -
Latest-of-group via derived table.
JOIN (SELECT customer_id, MAX(order_date) AS last_dt FROM orders GROUP BY customer_id)— picks the latest date per customer, then joins back for the full row. -
Percentile / Nth value.
SELECT (SELECT amount FROM orders ORDER BY amount DESC LIMIT 1 OFFSET N-1)— Nth largest amount; or usePERCENTILE_CONT/NTILEfor cleaner semantics. -
Update from subquery.
UPDATE customers SET tier = (SELECT tier FROM tiers WHERE tiers.customer_id = customers.customer_id)— correlated scalar in theSETclause. NULL-safe if the subquery is allowed to returnNULL.
Frequently asked questions
Subquery vs CTE — which is faster?
In modern engines (Postgres 12+, BigQuery, Snowflake, SQL Server, modern MySQL), CTEs and equivalent derived tables produce identical plans for single-use pipelines — the planner inlines the CTE just like a subquery. The choice is mostly stylistic: CTEs win when you reuse a stage more than once, or when the query is long enough that named, top-of-query stages improve readability. Older Postgres (≤ 11) treated CTEs as an optimization fence — the planner couldn't push predicates through; that quirk is now gated behind WITH MATERIALIZED. If you want a single-use CTE on Postgres 11 to be inlined, you currently need to switch to a derived table or upgrade. Verifying with EXPLAIN ANALYZE is the only way to know for sure on a given engine.
Does the optimizer always rewrite correlated subqueries as JOINs?
Not always. Postgres 13+, BigQuery, and Snowflake are aggressive about it — they convert correlated EXISTS, IN, and aggregate-in-WHERE patterns into hash semi-joins automatically. MySQL ≤ 5.7 and older SQL Server versions are less aggressive; for them, the naïve nested-loop plan can dominate runtime on large tables. Always check EXPLAIN (or EXPLAIN ANALYZE) — if you see "nested loop" with no "semi join" or "hash join" hint, manually rewrite the query as a JOIN against a grouped derived table. As a quick rule of thumb: if your query runs in seconds on dev and minutes on prod, the optimizer probably failed to rewrite a correlated subquery, and the manual rewrite is the single biggest speedup available.
IN vs EXISTS — which should I prefer?
For the affirmative case (return rows that match), IN and EXISTS produce the same plan on every modern engine — pick whichever reads more clearly. EXISTS is more declarative when the inner has multiple conditions; IN is more compact for a single-column lookup. For the negated case (return rows that don't match), prefer NOT EXISTS — it is NULL-safe, whereas NOT IN silently returns zero rows if any inner value is NULL. A handy interview tie-breaker: when the question reads "is there at least one matching row?" reach for EXISTS; when the question reads "does the value appear in this list?" reach for IN. Both compile to the same semi-join when the planner does its job.
Why does NOT IN return zero rows when there's a NULL in the list?
Because SQL uses three-valued logic. x NOT IN (a, b, NULL) evaluates as x != a AND x != b AND x != NULL. x != NULL evaluates to UNKNOWN (not TRUE), and TRUE AND UNKNOWN collapses to UNKNOWN. WHERE UNKNOWN is treated as FALSE, so every outer row is filtered out. The fix is either NOT EXISTS (which uses = inside, which is UNKNOWN for NULLs and therefore doesn't match) or an explicit WHERE col IS NOT NULL inside the inner subquery.
Can I update a table using a subquery?
Yes, in two shapes. Correlated UPDATE: UPDATE customers SET tier = (SELECT tier FROM tiers t WHERE t.customer_id = customers.customer_id) — runs the subquery once per outer row. UPDATE with FROM (Postgres) or JOIN (MySQL / SQL Server): UPDATE customers c SET tier = t.tier FROM tiers t WHERE t.customer_id = c.customer_id — typically much faster on large tables because the planner can hash-join the two sides. Always guard against the subquery returning NULL (which silently nulls the column) by adding an EXISTS filter or a COALESCE. The same shapes work for DELETE: DELETE FROM customers WHERE NOT EXISTS (SELECT 1 FROM orders WHERE orders.customer_id = customers.customer_id) is the canonical "delete customers with no orders" query, and it is NULL-safe by construction.
Are nested subqueries OK?
Yes, but every layer of nesting is a layer of cognitive load. SQL's grammar allows arbitrarily deep nesting (SELECT (SELECT (SELECT ...))), and modern planners handle three or four levels without breaking a sweat. Once you reach two or three nested layers, refactor into CTEs — the named stages make the query self-documenting and let the planner see the whole pipeline at once. Avoid deep nesting in interviews; even when it's technically correct, it tells the interviewer you don't yet know when to switch shapes. A practical rule: if your query has more than two levels of parentheses around SELECT keywords, rewrite the inner two layers as named CTEs and watch the readability double.
What's the difference between a scalar subquery and a correlated subquery?
A scalar subquery is defined by its result cardinality: exactly one row × one column. A correlated subquery is defined by whether the inner references the outer — orthogonal axis. A subquery can be scalar and correlated (e.g. (SELECT MAX(date) FROM orders WHERE orders.customer_id = c.customer_id) returns one value per outer row), scalar and uncorrelated (e.g. (SELECT MAX(price) FROM products) — one value, evaluated once), or non-scalar correlated (e.g. inside EXISTS, where the inner can return many rows because only existence matters). Senior interviewers will sometimes hand you a subquery and ask "is it scalar, correlated, both, or neither?" — be ready to answer along both axes independently.
Practice on PipeCode
- Drill the subqueries practice library → for end-to-end correlated, scalar, and derived-table problems.
- Sharpen SQL subquery problems → where every problem ships with a sample dataset, an interview prompt, and a marked solution.
- Stack the prerequisites with joins practice → — semi-joins and anti-joins are how subqueries are physically executed.
- Pair subqueries with window-function problems → for the top-N-per-group pattern.
- Build aggregate-then-join muscle with aggregation drills →.
- Take the full SQL course: SQL for Data Engineering Interviews — from zero to FAANG →.
- For the broader interview surface, read top data engineering interview questions →.
- Stack the fundamentals with the only 5 skills you need to become a data engineer →.
Pipecode.ai is Leetcode for Data Engineering — every subquery pattern above ships with hands-on practice rooms where you write the query, run it against a real dataset, and see whether it survives the NULL gotchas and the optimizer rewrite. Start with the subqueries library and work outward; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.