SQL Subqueries: Correlated, Scalar, Derived Tables & EXISTS

python dev.to

SQL Subqueries: Correlated, Scalar, Derived Tables & EXISTS

The sql subquery is the single SQL primitive that splits junior candidates from senior ones in a deep-dive interview. Anyone can write a one-table SELECT; only candidates who genuinely understand subqueries in sql can decide between a correlated subquery, a scalar subquery embedded in a SELECT clause, a derived table in the FROM clause, or an EXISTS semi-join — and then explain why their choice is the right one for the row count and the optimizer plan they expect.

This guide walks through every shape of nested query sql that shows up in FAANG and analytics-heavy interview loops. Each section ships a teaching block plus a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works. By the end you'll be able to defend a derived-table aggregation, debug a NOT IN that silently returns zero rows, rewrite a correlated subquery as a JOIN for performance, and pick the right operator from the exists sql / IN / NOT EXISTS / NOT IN four-quadrant matrix without hesitation.

When you want hands-on reps immediately after reading, drill the subqueries practice library →, then sharpen SQL subquery problems →, and stack the prerequisites with joins practice →.


On this page


1. Why subqueries are the senior-round signal

A sql subquery is just a SELECT inside another SELECT — but the shape decides everything

The one-sentence invariant: a subquery is a SELECT statement nested inside another statement; its position in the parent — SELECT list, FROM clause, WHERE clause, or HAVING clause — decides how the optimizer evaluates it and what shape it must return. Once you internalise that, the rest of the subqueries in sql family becomes a routing decision: scalar (one value), derived (a table), correlated (per-row), or existential (a boolean test).

The four shapes every interview probes.

  • Scalar subquery. Returns exactly one row × one column. Used wherever a single value is expected — a SELECT column, a WHERE comparison operand, or a HAVING predicate. Returning zero rows quietly produces NULL; returning more than one row is a runtime error.
  • Derived table (FROM-clause subquery). Returns a multi-row, multi-column result that the outer query treats as a virtual table. Often the cleanest way to compute aggregates before joining. Semantically equivalent to a CTE (WITH), just inlined.
  • Correlated subquery. References columns from the outer query inside the inner query. Logically runs once per outer row — though most modern optimizers rewrite it as a JOIN or hash semi-join under the hood.
  • Existential subquery. Tests whether a related row exists — EXISTS, NOT EXISTS, IN, NOT IN. The boolean check is the only output; the inner query's SELECT list is ignored for EXISTS.

Why interviewers love the topic.

  • It's a plan-reading test in disguise. Knowing when an EXISTS short-circuits vs an IN materialises tells the interviewer you've read execution plans, not just SQL textbooks.
  • It's a NULL test. NOT IN with a NULL in the list silently returns zero rows. Candidates who don't anticipate it ship production-breaking SQL.
  • It's a trade-off test. The same query can usually be written three ways — subquery, JOIN, CTE — and only candidates who can compare the three confidently advance.

Reading a subquery from the inside out.

  • Start with the inner query — what does it return? One value? A list? A table? A boolean?
  • Look at where the inner sits inside the outer — SELECT, FROM, WHERE, or HAVING.
  • Ask yourself: does the inner reference any outer columns? If yes, it's correlated and you should expect per-row semantics in the logical model.
  • Finally, ask: can this be rewritten as a JOIN or a CTE? Usually the answer is yes, and a senior interviewer will probe both rewrites.

Subquery → JOIN → CTE: same logic, three syntaxes.

  • Subquery shape. Compact, inline, but easy to mis-read when nested deeply.
  • JOIN shape. Forces you to think in sets; usually the fastest plan for large inputs.
  • CTE shape. Most readable for multi-step pipelines; modern planners typically inline it.

The point: the logic of "customers with above-average orders" is one logical question. Whether you express it as a correlated subquery, a derived table joined to the customers table, or a CTE followed by a join is a style decision — and an optimizer hint, on some platforms.

Worked example — translate "customers who placed any order in 2026" three ways

Detailed explanation. A junior candidate writes this as a JOIN and de-duplicates with DISTINCT. A mid-level candidate writes it as IN. A senior candidate writes EXISTS and explains why it's the most defensible choice — short-circuits on the first match, NULL-safe, and rewrites to a semi-join under the hood.

Question. Given customers(customer_id, name) and orders(order_id, customer_id, order_date), list every customer who placed at least one order in 2026. Show three equivalent queries — JOIN, IN, EXISTS — and explain when each shines.

Input.

customer_id name
1 Alice
2 Bob
3 Charlie
4 Dana
order_id customer_id order_date
1001 1 2026-01-15
1002 1 2026-02-10
1003 2 2025-12-30
1004 3 2026-04-01

Code (three equivalent queries).

-- (a) JOIN + DISTINCT
SELECT DISTINCT c.customer_id, c.name
FROM   customers c
JOIN   orders    o ON o.customer_id = c.customer_id
WHERE  o.order_date >= '2026-01-01'
  AND  o.order_date <  '2027-01-01';

-- (b) IN with subquery
SELECT customer_id, name
FROM   customers
WHERE  customer_id IN (
         SELECT customer_id
         FROM   orders
         WHERE  order_date >= '2026-01-01'
           AND  order_date <  '2027-01-01'
       );

-- (c) EXISTS (the senior signal)
SELECT customer_id, name
FROM   customers c
WHERE  EXISTS (
         SELECT 1
         FROM   orders o
         WHERE  o.customer_id = c.customer_id
           AND  o.order_date >= '2026-01-01'
           AND  o.order_date <  '2027-01-01'
       );
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. JOIN + DISTINCT — explodes the customer row once per matching order, then collapses with DISTINCT. Wasted work when a customer has many orders. Also accidentally double-counts in aggregations if you forget the DISTINCT.
  2. IN with subquery — the planner materialises the inner result as a list of customer_ids, then probes each outer row against the list. Fine for small lists; awkward when the inner side has millions of rows.
  3. EXISTS — the planner short-circuits on the first matching order per customer. Modern planners typically rewrite IN and EXISTS as a SEMI JOIN, but EXISTS is the more declarative version of "is there a match?" — and it's safer with NULLs (more on that in §5).

Output.

customer_id name
1 Alice
3 Charlie

Rule of thumb. When the question is "does a matching row exist?", reach for EXISTS. When the question is "what are the values?", reach for a JOIN and shape the projection.

Worked example — reading a nested query from the inside out

Detailed explanation. A senior interviewer might hand you a five-line nested query and ask "what does this return, and in what order does the engine evaluate the pieces?" The right move is to read inside-out, narrating each level. This shape recurs in every senior-round whiteboard exercise on subqueries.

Question. Walk through how the engine evaluates the following query. Identify the inner-most subquery, the middle subquery, and the outer query, and state the cardinality each level returns.

Input — payments (compact sample).

payment_id customer_id amount status paid_at
1 42 120 paid 2026-04-12
2 42 80 paid 2026-04-18
3 43 200 failed 2026-04-20
4 44 60 paid 2026-05-01

Code.

SELECT customer_id, total_paid
FROM (
    SELECT customer_id,
           SUM(amount) AS total_paid
    FROM   payments
    WHERE  paid_at >= (
              SELECT MIN(paid_at)
              FROM   payments
              WHERE  status = 'paid'
           )
      AND  status = 'paid'
    GROUP  BY customer_id
) AS by_customer
WHERE total_paid > 100;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Inner-most subquery. SELECT MIN(paid_at) FROM payments WHERE status = 'paid' — uncorrelated scalar; returns one row × one column = '2026-04-12'. The planner evaluates it once.
  2. Middle subquery (derived table by_customer). Scans payments, filters to rows with paid_at >= '2026-04-12' AND status = 'paid', groups by customer_id, projects (customer_id, total_paid). Cardinality = one row per distinct paying customer.
  3. Outer query. Reads by_customer and keeps rows where total_paid > 100. Final cardinality = customers whose total paid amount exceeds 100.
  4. Why inside-out? SQL is declarative, but the execution still flows from innermost dependency to outermost projection. Reading inside-out gives you the engine's plan-shape mental model.

Output.

customer_id total_paid
42 200

Rule of thumb. When you stare at a nested query, find the innermost subquery first, write down what it returns (value, list, table), then move outward. The shape at each level constrains the syntax of the next.

SQL interview question on translating subquery to JOIN

A common senior probe: "You've written this correlated subquery and the planner shows nested-loop semantics; the table has 50 million rows. How would you rewrite it for the hash-join path?" — testing whether the candidate spots the rewrite from correlated subquery into a derived-table JOIN on a grouped aggregate.

Solution Using a correlated subquery rewritten as a JOIN

-- Original (correlated subquery — logically per-row)
SELECT c.customer_id, c.name
FROM   customers c
WHERE  EXISTS (
         SELECT 1
         FROM   orders o
         WHERE  o.customer_id = c.customer_id
           AND  o.order_date >= '2026-01-01'
       );

-- Rewritten (semi-join via INNER JOIN on a grouped derived table)
SELECT c.customer_id, c.name
FROM   customers c
JOIN  (SELECT DISTINCT customer_id
       FROM   orders
       WHERE  order_date >= '2026-01-01') o
  ON  o.customer_id = c.customer_id;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Stage What happens
1 Inner derived table o scans orders, filters to 2026+, keeps distinct customer_id
2 Hash build planner builds a hash table on o.customer_id (size = distinct active customers)
3 Hash probe scans customers, probes each c.customer_id against the hash
4 Project emits c.customer_id, c.name for every probe hit

Output:

customer_id name
1 Alice
3 Charlie

Why this works — concept by concept:

  • Semi-join semantics — both shapes ask the same question ("does a matching row exist?"). The rewrite makes the hash-join plan explicit so the planner doesn't have to prove it can rewrite.
  • Distinct in the inner — collapses orders to one row per active customer, which keeps the join 1:1 and removes the need for DISTINCT on the outer projection.
  • Hash join over nested loop — for a large orders table, hash-join is O(N + M) versus the nested-loop O(N × matching_M). The rewrite makes the plan deterministic across platforms.
  • Modern planner caveat — Postgres, BigQuery, and Snowflake will often rewrite the original EXISTS to the same hash semi-join automatically. Older MySQL versions (and some constrained planners) will not — which is when the manual rewrite is worth the extra lines.
  • Cost — derived table = O(orders) scan + O(distinct customers) hash build; outer probe = O(customers); total = O(orders + customers).

SQL
Topic — subqueries
Subquery problems (SQL)

Practice →


The interview-grade subquery taxonomy

Before diving into the four shapes, it pays to internalise the taxonomy interviewers use when they grade you. Senior data-engineering rounds at FAANG, fintech, and analytics-heavy shops categorise subquery answers along three axes:

  • Result cardinality. Does the subquery return one value (scalar), one column with many rows (column subquery / IN), or a multi-column table (derived)? Each cardinality constrains where the subquery is allowed to appear.
  • Correlation. Does the inner reference an outer column? If yes, the logical model is "evaluate inner per outer row"; if no, the inner is independent and the engine evaluates it once.
  • Plan shape. What physical operator does the optimizer emit? A hash semi-join? A nested-loop lookup? A materialised CTE? A constant-folded literal? Interviewers reading your answer at the whiteboard expect you to name the physical operator, not just the syntax.

Worked translation table — the same logical question, four syntactic shapes.

Question Subquery shape JOIN equivalent Window-function equivalent
Customers with above-average orders Correlated WHERE x > (SELECT AVG ...) JOIN (GROUP BY customer) then filter AVG OVER (PARTITION BY customer)
Top-N per group Derived table + ROW_NUMBER self-join with rank predicate (slow) ROW_NUMBER OVER (PARTITION BY group)
Customer's most recent order Correlated scalar (SELECT MAX...) JOIN (MAX GROUP BY customer) ROW_NUMBER OVER (... ORDER BY DESC)
Customers with at least one order EXISTS INNER JOIN ... DISTINCT
Customers with no orders NOT EXISTS LEFT JOIN ... WHERE NULL

The exam-ready takeaway: all five rows are the same logical query expressed four different ways. A senior candidate can pick whichever is most readable for the audience and the engine, and explain why the plans converge (or don't).

Where each shape wins.

  • Correlated subquery — most English-like expression; safe choice when the dataset is small.
  • Derived table — wins for "aggregate then filter / join" pipelines and for window-then-filter.
  • CTE — wins for multi-stage pipelines (3+ stages) where named stages improve readability; same plan as derived table on modern engines.
  • JOIN — wins for large tables where the planner needs an obvious hash-join target.
  • Window function — wins for per-group ranks, running totals, and shifted comparisons.

The wrong choice is rarely "wrong" syntactically — it's "wrong" because it produces a plan the engine can't optimise. That distinction is what senior rounds test.


2. Scalar subqueries — one value, NULL gotchas

A scalar subquery returns exactly one row × one column — the simplest shape, the most NULL traps

The mental model in one line: a scalar subquery is a SELECT that the SQL engine treats as a single value; if it returns zero rows you silently get NULL, if it returns more than one row you get a runtime error. Senior interviewers love this shape because it is deceptively simple — the gotchas are at the boundaries (zero rows, more-than-one row, and the column-expression vs predicate distinction).

Where a scalar subquery is allowed.

  • SELECT clause. As a column expression — e.g. SELECT name, (SELECT MAX(price) FROM products) AS top_price FROM customers.
  • WHERE clause. As an operand of a comparison — e.g. WHERE salary > (SELECT AVG(salary) FROM emp).
  • HAVING clause. Similarly — HAVING SUM(amount) > (SELECT AVG(total) FROM totals).
  • VALUES / UPDATE / INSERT. Anywhere SQL expects a single value.

The three failure modes.

  • Exactly one row × one column. The happy path. SQL substitutes the value as if you had written a literal.
  • Zero rows. Silently substitutes NULL. This is the most common source of bugs — a filter that should match never does, or an aggregate that should subtract zero subtracts NULL and corrupts the whole expression.
  • More than one row. Most engines raise subquery returned more than 1 row (Postgres) or Subquery returned more than 1 value (SQL Server). Some MySQL versions return the first row silently, which is even worse — the query "works" until the data shape changes.

Cardinality enforcement keywords.

  • Postgres / Oracle / SQL Server enforce the one-row rule at runtime — if the subquery returns >1 row the engine raises.
  • MySQL: ANSI-mode raises; legacy mode returns the first row. Never rely on legacy mode.
  • BigQuery: enforced; the optimizer often errors at plan time if it can prove >1 row.

Tuning intuition — when scalar subqueries hurt.

  • A scalar subquery in the SELECT clause can run once per outer row if the planner cannot prove it's constant. Wrap it in a CTE or a derived table to force single evaluation.
  • A scalar subquery in a WHERE clause is usually evaluated once and treated as a constant.
  • If the subquery references the outer query, it's no longer scalar in the strict sense — it's a correlated subquery whose result per row happens to be scalar (see §4).

Worked example — embed a scalar subquery in the SELECT clause

Detailed explanation. A reporting query wants every order alongside the maximum price across all orders. The scalar subquery gives a single value that the planner can hoist out of the outer scan and reuse for every row.

Question. Given an orders table, return every order with the global maximum price displayed alongside the order's own price. Compare a scalar-subquery approach to a window-function approach.

Input.

order_id price
1001 19.99
1002 49.50
1003 12.00
1004 88.40

Code (scalar subquery).

SELECT order_id,
       price,
       (SELECT MAX(price) FROM orders) AS max_price
FROM   orders;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The planner sees the inner SELECT MAX(price) FROM orders is uncorrelated — it doesn't reference the outer row.
  2. It runs once, returns the single value 88.40, and caches it as a constant.
  3. The outer scan returns every row from orders and projects (order_id, price, 88.40).
  4. Total work = one extra scan for the inner (a single aggregate pass) + the outer scan. No per-row penalty.

Output.

order_id price max_price
1001 19.99 88.40
1002 49.50 88.40
1003 12.00 88.40
1004 88.40 88.40

Rule of thumb. When the scalar is constant across the whole result set, the scalar subquery and a window function (MAX(price) OVER ()) generate the same plan. When the scalar varies per row, the subquery is correlated and the right tool is usually a window function partitioned by the grouping column.

Worked example — the zero-row NULL trap

Detailed explanation. A scalar subquery that filters by a non-matching key silently produces NULL. If you then use that NULL in an arithmetic expression, the whole expression collapses to NULL and downstream consumers may treat it as "no value" or as zero — both wrong.

Question. A team queries the latest discount for customer 999 — a customer who has no discount record. What does the result look like and why is it dangerous?

Input.

customer_id discount_pct
1 10
2 5
3 15

Code.

SELECT 100 - (SELECT discount_pct FROM discounts WHERE customer_id = 999) AS net_pct;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The inner subquery searches discounts for customer_id = 999.
  2. No row matches → the inner returns zero rows.
  3. Per ANSI rules, the scalar value is NULL.
  4. The outer expression 100 - NULL is NULL (any arithmetic with NULL is NULL).
  5. The application reads NULL as "no discount applied" — silently selling at full price when the business logic expected an error.

Output.

net_pct
NULL

The fix — COALESCE or an explicit EXISTS guard.

SELECT 100 - COALESCE(
         (SELECT discount_pct FROM discounts WHERE customer_id = 999),
         0
       ) AS net_pct;
Enter fullscreen mode Exit fullscreen mode

Output after fix.

net_pct
100

Rule of thumb. Every scalar subquery in a SELECT or arithmetic context should either be guaranteed to return one row (by data integrity constraint) or wrapped in COALESCE with a sensible default. Don't trust silent NULL to mean what your business logic expects.

Worked example — the more-than-one-row runtime error

Detailed explanation. The other failure mode of a scalar subquery is "returned more than 1 row." Junior engineers often write subqueries that happen to return one row in dev because the dataset is small, then watch them blow up in prod when the data shape changes. Senior interviewers test whether you can spot the latent risk before it ships.

Question. A team writes the query below to label each order with the customer's email. In dev with 10 rows it works; in prod it raises subquery returned more than 1 row. Find the bug and propose a fix.

Input — customers.

customer_id email
1 alice@x.com
1 alice.duplicate@x.com
2 bob@x.com

Input — orders.

order_id customer_id
1001 1
1002 2

Code (the trap).

SELECT o.order_id,
       (SELECT email FROM customers c WHERE c.customer_id = o.customer_id) AS email
FROM   orders o;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. For order_id = 1001, the inner subquery scans customers for customer_id = 1 and finds two rows (alice@x.com and alice.duplicate@x.com).
  2. Postgres raises ERROR: more than one row returned by a subquery used as an expression.
  3. The root cause is a missing uniqueness constraint on customers.customer_id. The data model assumed one row per customer; the data violated the assumption.
  4. The right fix is twofold: (a) add the constraint at the schema level to prevent recurrence; (b) until the schema is fixed, defensively wrap the subquery with LIMIT 1 and an explicit ORDER BY so the chosen row is deterministic.

The defensive fix.

SELECT o.order_id,
       (SELECT email
        FROM   customers c
        WHERE  c.customer_id = o.customer_id
        ORDER BY c.created_at DESC
        LIMIT 1) AS email
FROM   orders o;
Enter fullscreen mode Exit fullscreen mode

Output after fix.

order_id email
1001 alice.duplicate@x.com
1002 bob@x.com

Rule of thumb. A scalar subquery in SELECT is a cardinality contract. If you can't prove the inner returns exactly one row, either add the schema constraint or add LIMIT 1 + ORDER BY to make the choice deterministic.

SQL interview question on scalar subquery vs window function

A common probe: "Return every employee with their salary and the company-wide average salary, in one query. Two approaches?" — testing whether the candidate recognises that the scalar-subquery form is fine when the aggregate is constant, but reaches for a window function when partition-aware aggregates are needed.

Solution Using a scalar subquery and a window function side by side

-- (a) Scalar subquery — global average
SELECT e.emp_id,
       e.name,
       e.salary,
       (SELECT AVG(salary) FROM employees) AS company_avg
FROM   employees e;

-- (b) Window function — per-department average
SELECT e.emp_id,
       e.name,
       e.dept_id,
       e.salary,
       AVG(e.salary) OVER (PARTITION BY e.dept_id) AS dept_avg
FROM   employees e;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Form (a) — scalar subquery Form (b) — window function
1 Inner SELECT AVG(salary) FROM employees runs once One scan of employees; planner partitions by dept_id in-flight
2 Returns one value, treated as a literal Per row, planner attaches the partition's AVG(salary)
3 Outer scan projects (emp_id, name, salary, const_avg) Outer scan projects (emp_id, name, dept_id, salary, dept_avg)
4 Total work = 1 aggregate + 1 scan Total work = 1 scan with windowed aggregate (sort or hash partitioning under the hood)

Output (form a, sample).

emp_id name salary company_avg
101 Ada 120000 88500
102 Brian 80000 88500
103 Cara 60000 88500

Output (form b, sample).

emp_id name dept_id salary dept_avg
101 Ada ENG 120000 100000
102 Brian ENG 80000 100000
103 Cara SALES 60000 60000

Why this works — concept by concept:

  • Scalar subquery as constant — when the aggregate is global (no PARTITION), the planner evaluates the subquery once and substitutes a literal everywhere. Fast and predictable.
  • Window function for per-group — when the aggregate varies by group, the window function is the right tool because it computes the aggregate per partition in a single pass instead of forcing a correlated subquery to run per row.
  • Plan readability — a window function makes the intent of "per-department average" obvious; a correlated subquery hides it inside a nested expression.
  • NULL semanticsAVG ignores NULLs natively; if some salaries are NULL, both forms exclude them automatically.
  • Cost — scalar form = O(N) + O(1); window form = O(N) with a sort or hash; cheaper than the correlated equivalent which would be O(N × group_size).

SQL
Topic — subqueries (SQL)
Scalar and embedded subquery problems

Practice →


3. Derived tables — FROM clause subqueries

A derived table is a subquery in FROM — same semantics as a CTE, different syntax

The mental model in one line: a derived table is a SELECT placed in the FROM clause of an outer query, treated by the outer query as if it were a real table. Most modern engines treat a derived table and a WITH (CTE) as semantically equivalent — the difference is mostly stylistic, with a few platform-specific optimizer hints.

Why derived tables exist.

  • Aggregate-then-join. You want to aggregate one table before joining the result to another. The derived table computes the aggregate; the outer JOIN consumes the rolled-up rows.
  • Window-then-filter. Window functions can't appear in WHERE. So you compute the window in a derived table, then filter on the result in the outer WHERE.
  • Self-reference reshape. Two passes over the same table — e.g. one to compute a per-customer max, one to find the row that matches that max.

The platform alias rules.

  • MySQL. Derived tables must have an alias — (SELECT ...) AS sub. Without it, MySQL raises Every derived table must have its own alias.
  • Postgres. Alias optional but strongly recommended; tools and humans both need it.
  • SQL Server. Alias required.
  • Oracle. Alias optional, but referencing columns requires one.
  • BigQuery / Snowflake. Alias optional but recommended.

Derived table vs CTE — when to prefer which.

  • Derived table wins when the subquery is used once and is small enough that inlining keeps the query compact.
  • CTE wins when the subquery is used more than once, or when the query is long enough that named, top-of-query pipeline stages improve readability.
  • Optimizer caveat (Postgres ≤ 11). CTEs were materialised — a fence the planner couldn't push predicates through. Postgres 12+ treats them as inline unless MATERIALIZED is explicit. Other engines (BigQuery, Snowflake) always treat CTEs as inline.
  • Optimizer caveat (SQL Server). Both shapes inline by default; the planner picks the plan, not the syntax.

Common shape — aggregate-then-join.

SELECT r.region, r.target, sub.total
FROM   region_targets r
JOIN  (SELECT region, SUM(amount) AS total
       FROM   orders
       GROUP  BY region) AS sub
  ON   sub.region = r.region;
Enter fullscreen mode Exit fullscreen mode

The derived table sub is a virtual table with two columns: region and total. The outer query joins it to region_targets like any other table.

Worked example — top 5 customers per region using a derived table

Detailed explanation. This is one of the most-asked SQL interview shapes: "top-N per group". The cleanest pattern is a derived table with ROW_NUMBER() followed by an outer filter. The derived table does the windowed assignment; the outer WHERE keeps only the top-N.

Question. Find the top 5 customers by total spend within each region. Use a derived table with ROW_NUMBER() partitioned by region.

Input.

customer_id region total_spend
1 EU 12000
2 EU 9800
3 EU 7600
4 EU 5400
5 EU 4100
6 EU 2900
7 US 22000
8 US 15000

Code.

SELECT customer_id, region, total_spend, rnk
FROM (
    SELECT customer_id,
           region,
           total_spend,
           ROW_NUMBER() OVER (PARTITION BY region
                              ORDER BY total_spend DESC) AS rnk
    FROM   customer_totals
) AS ranked
WHERE rnk <= 5;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The derived table ranked computes a window ROW_NUMBER() over each region, ordered by spend descending.
  2. Every row in customer_totals is assigned rnk = 1, 2, 3, ... within its region.
  3. The outer SELECT filters to rnk <= 5, keeping only the top 5 customers per region.
  4. The filter cannot be inside the derived table because window functions are not allowed in WHERE — that's the whole point of this pattern.

Output.

customer_id region total_spend rnk
1 EU 12000 1
2 EU 9800 2
3 EU 7600 3
4 EU 5400 4
5 EU 4100 5
7 US 22000 1
8 US 15000 2

Rule of thumb. Window-then-filter is the canonical derived-table pattern. The window goes inside the derived table; the rank filter goes in the outer WHERE.

Worked example — aggregate-then-join for region health

Detailed explanation. Reporting often needs "current values" joined to "rollups". The cleanest pattern: derive the rollup, then JOIN it back to the dimension table.

Question. Compute each region's total order amount and join it to a region_targets table to compute attainment_pct.

Input — orders.

order_id region amount
1001 EU 200
1002 EU 350
1003 US 800
1004 US 100
1005 APAC 90

Input — region_targets.

region target
EU 1000
US 1500
APAC 500

Code.

SELECT r.region,
       r.target,
       sub.total,
       ROUND(100.0 * sub.total / r.target, 1) AS attainment_pct
FROM   region_targets r
LEFT JOIN (
    SELECT region, SUM(amount) AS total
    FROM   orders
    GROUP  BY region
) AS sub
  ON sub.region = r.region;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The derived table sub aggregates orders by region, returning one row per region.
  2. The outer LEFT JOIN keeps every region from region_targets (so a region with zero orders still appears).
  3. The outer projection computes attainment_pct from the joined columns.
  4. LEFT JOIN over INNER JOIN is deliberate — we want to see regions that had zero orders too, with a NULL total that downstream COALESCE can convert to zero.

Output.

region target total attainment_pct
EU 1000 550 55.0
US 1500 900 60.0
APAC 500 90 18.0

Rule of thumb. When the rollup is keyed by a dimension that the outer table already has, a derived table joined back is almost always cleaner than a correlated subquery.

Worked example — derived table for self-join "compare to previous month"

Detailed explanation. "Compare this month's revenue to last month's revenue per region" is a classic reporting shape. A derived table per side and a join on the shifted month is the cleanest pattern when window functions aren't permitted (legacy MySQL versions) or when the reader prefers an explicit self-join.

Question. Given monthly regional revenue, return each region's revenue this month, revenue last month, and the percentage change.

Input — monthly_revenue.

region month revenue
EU 2026-03 8000
EU 2026-04 10000
US 2026-03 12000
US 2026-04 11000
APAC 2026-04 3000

Code (two derived tables joined on shifted month).

SELECT t.region,
       t.month,
       t.revenue       AS this_month,
       p.revenue       AS last_month,
       ROUND(100.0 * (t.revenue - p.revenue) / NULLIF(p.revenue, 0), 1) AS pct_change
FROM (
    SELECT region, month, revenue
    FROM   monthly_revenue
    WHERE  month = '2026-04'
) AS t
LEFT JOIN (
    SELECT region, revenue
    FROM   monthly_revenue
    WHERE  month = '2026-03'
) AS p
  ON p.region = t.region;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Derived table t materialises the current month's row per region — one row per active region this month.
  2. Derived table p materialises the previous month's row per region — one row per active region last month.
  3. The outer LEFT JOIN matches by region; a region that exists this month but not last month gets a NULL last_month (handled by NULLIF to avoid divide-by-zero).
  4. The projection computes the percentage change, rounded for readability.

Output.

region month this_month last_month pct_change
EU 2026-04 10000 8000 25.0
US 2026-04 11000 12000 -8.3
APAC 2026-04 3000 NULL NULL

Rule of thumb. When the comparison is over a shifted dimension and the engine lacks window functions, two derived tables joined on the shifted key is the cleanest expression. With window functions available, LAG() is even cleaner — LAG(revenue) OVER (PARTITION BY region ORDER BY month).

SQL interview question on derived table vs CTE for a multi-step pipeline

A common probe: "We need to identify each customer's second most recent order. Walk through the query and explain why you chose derived table vs CTE." — testing whether the candidate picks the most readable shape for a multi-stage pipeline.

Solution Using a derived table with ROW_NUMBER

SELECT customer_id, order_id, order_date
FROM (
    SELECT customer_id,
           order_id,
           order_date,
           ROW_NUMBER() OVER (PARTITION BY customer_id
                              ORDER BY order_date DESC) AS rn
    FROM   orders
) AS ranked
WHERE rn = 2;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Stage What happens
1 Derived scan reads orders once
2 Window assignment assigns rn = 1, 2, 3, ... per customer in date-desc order
3 Outer filter keeps only rows where rn = 2
4 Project emits (customer_id, order_id, order_date) for the second-most-recent order

Sample input.

customer_id order_id order_date
1 1001 2026-01-15
1 1002 2026-02-10
1 1003 2026-03-05
2 1004 2026-04-01
2 1005 2026-04-02

Output:

customer_id order_id order_date
1 1002 2026-02-10
2 1004 2026-04-01

Why this works — concept by concept:

  • Window-then-filter — windowed ROW_NUMBER cannot appear in WHERE, so the derived table is the right shape: assign in the inner, filter in the outer.
  • PARTITION BY customer_id — restarts the numbering for each customer, so rn=2 means "second per customer", not "second overall".
  • ORDER BY order_date DESC — defines what "second-most-recent" means; ties should be broken with a secondary ORDER BY like order_id DESC if the data permits same-date orders.
  • Derived table vs CTE — for a single-use pipeline like this, a derived table is the most compact shape. For three or more pipeline stages, switch to CTEs purely for readability.
  • Cost — one scan of orders + one sort or hash for the window = O(N log N) in the worst case; the outer filter is O(N).

SQL
Topic — subqueries (SQL)
Derived table and top-N per group problems

Practice →


4. Correlated subqueries — referencing the outer query

A correlated subquery references the outer row — logically runs once per outer row

The mental model in one line: a correlated subquery is an inner SELECT that references one or more columns from the outer query; logically it executes once per outer row, although most modern planners rewrite it as a hash join or semi-join under the hood. This is the most senior-signal shape — using it correctly tells the interviewer you can think in terms of "per-row predicate logic" rather than just set operations.

What makes a subquery correlated.

  • The inner SELECT references one or more columns from the outer query's table aliases.
  • The inner cannot be evaluated independently — it depends on the outer row currently being processed.
  • Logically, the engine evaluates the inner once per outer row; physically, the optimizer is allowed to rewrite this as a single join.

Where correlated subqueries show up.

  • WHERE predicate. "Customers with above-average orders" — the inner computes the average for that customer.
  • SELECT column. "Each customer's most recent order date" — the inner finds the max date for that customer.
  • EXISTS test. "Customers who placed any order" — the inner tests existence per outer row.

Two questions to ask before writing a correlated subquery.

  • Can I write this with a GROUP BY and a JOIN instead? If yes, that's usually faster and more obvious.
  • Can I write this with a window function? If yes (per-row position or per-row rank), the window is almost always faster.

When correlated subqueries are the right answer.

  • The predicate logic depends on a per-row scalar (e.g. "more than this row's department's average").
  • The query needs to be self-documenting — the correlated form often reads like the English question.
  • The dataset is small enough that the per-row cost doesn't matter (lookups inside a report, dashboards, etc.).

Performance — when correlated subqueries hurt.

  • The naïve plan is O(N × M) — N outer rows, M inner scan per row.
  • Modern planners (Postgres ≥ 12, BigQuery, Snowflake) rewrite to O(N + M) via hash semi-join.
  • Older MySQL versions (≤ 5.7) cannot always rewrite — the naïve nested-loop plan kicks in and the query becomes slow.
  • The fix is usually a manual JOIN on a GROUP BY derived table or a CTE.

EXISTS vs IN — both can be correlated, with different NULL semantics.

  • WHERE EXISTS (SELECT 1 FROM ...) — boolean test, short-circuits, NULL-safe.
  • WHERE x IN (SELECT y FROM ...) — membership test; if any inner y is NULL, the negation (NOT IN) trips into the NULL trap (see §5).

Worked example — customers with above-average orders

Detailed explanation. A classic interview shape: for each customer, find the orders that exceeded that customer's average. The inner subquery is correlated because it references the outer customer_id to compute the per-customer average.

Question. Given orders(order_id, customer_id, amount), list each order whose amount exceeds the average amount for that order's customer.

Input.

order_id customer_id amount
1001 1 200
1002 1 50
1003 1 800
1004 2 100
1005 2 100
1006 2 100
1007 3 50
1008 3 250

Code (correlated subquery).

SELECT o.order_id, o.customer_id, o.amount
FROM   orders o
WHERE  o.amount > (
         SELECT AVG(amount)
         FROM   orders i
         WHERE  i.customer_id = o.customer_id
       );
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Logically, the planner walks every row in the outer orders o.
  2. For each outer row, it computes AVG(amount) over the inner orders i filtered to i.customer_id = o.customer_id. Customer 1's average is (200 + 50 + 800) / 3 = 350. Customer 2's average is 100. Customer 3's average is 150.
  3. It keeps the outer row if o.amount > the_per_customer_avg.
  4. Customer 1: orders 1001 (200), 1002 (50), 1003 (800). Average = 350. Only 1003 exceeds. Customer 2: all three orders equal the average → none exceed (strictly greater than). Customer 3: orders 1007 (50), 1008 (250). Average = 150. Only 1008 exceeds.

Output.

order_id customer_id amount
1003 1 800
1008 3 250

Rule of thumb. When the predicate references a per-row aggregate, the correlated subquery is the clearest form. If performance matters and the dataset is large, rewrite as a derived-table JOIN with a grouped average.

Worked example — running total via correlated subquery (and the window-function rewrite)

Detailed explanation. "Running total" is the canonical "correlated subquery vs window function" head-to-head. The correlated form reads like the textbook definition; the window form is faster on every modern engine. Interviewers expect you to write the correlated version first and then immediately propose the window-function rewrite.

Question. Given a sequence of orders ordered by date, return each order with the running total of amount up to and including that order.

Input — orders.

order_id order_date amount
1001 2026-04-01 50
1002 2026-04-02 80
1003 2026-04-03 30
1004 2026-04-04 120

Code (correlated form — clear but slow).

SELECT o.order_id,
       o.order_date,
       o.amount,
       (SELECT SUM(amount)
        FROM   orders i
        WHERE  i.order_date <= o.order_date) AS running_total
FROM   orders o
ORDER  BY o.order_date;
Enter fullscreen mode Exit fullscreen mode

Code (window-function rewrite — fast).

SELECT order_id,
       order_date,
       amount,
       SUM(amount) OVER (ORDER BY order_date
                         ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM   orders
ORDER  BY order_date;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The correlated form computes a per-row aggregate over all prior rows. Naively O(N²) — each of N rows triggers an O(N) scan.
  2. The window form computes the running sum in a single pass; the engine maintains an accumulator as it walks the ordered stream. O(N) total.
  3. Both return identical results on the sample dataset; the plans diverge dramatically on large tables.
  4. The ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW frame makes the running-sum semantics explicit; the default RANGE frame is the same here but differs when there are ties on order_date.

Output (both forms).

order_id order_date amount running_total
1001 2026-04-01 50 50
1002 2026-04-02 80 130
1003 2026-04-03 30 160
1004 2026-04-04 120 280

Rule of thumb. Anytime you see a correlated subquery whose inner is SUM, COUNT, MIN, MAX, or AVG over rows "up to / since / above / below" the outer row, that's a window-function candidate. Rewrite for an O(N) plan instead of O(N²).

Worked example — each customer's most recent order

Detailed explanation. "Each customer's latest X" is another shape that maps naturally to a correlated subquery in the SELECT clause. The cleaner modern form is a window function (ROW_NUMBER), but the correlated shape is still tested in interviews.

Question. Return each customer's most recent order date.

Input.

order_id customer_id order_date
1001 1 2026-01-15
1002 1 2026-03-10
1003 2 2026-02-22
1004 2 2026-04-05
1005 3 2026-05-01

Code (correlated scalar subquery in SELECT).

SELECT c.customer_id,
       c.name,
       (SELECT MAX(o.order_date)
        FROM   orders o
        WHERE  o.customer_id = c.customer_id) AS last_order_date
FROM   customers c;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. For each row in customers c, the planner computes the inner MAX(order_date) filtered to that customer.
  2. If the customer has no orders, the inner returns zero rows and the scalar evaluates to NULL — the outer projection shows NULL for last_order_date, which is the correct semantics.
  3. Modern planners usually rewrite this as a left join on a grouped derived table: (customers c LEFT JOIN (SELECT customer_id, MAX(order_date) FROM orders GROUP BY customer_id) g ON g.customer_id = c.customer_id).

Output.

customer_id name last_order_date
1 Alice 2026-03-10
2 Bob 2026-04-05
3 Charlie 2026-05-01
4 Dana NULL

Rule of thumb. Correlated subquery in SELECT is fine for small lookups and reports. For warehouse-scale data, prefer the explicit grouped derived-table LEFT JOIN.

Worked example — employees who out-earn their department average

Detailed explanation. "Find rows where the value exceeds the group's average" is the textbook correlated-subquery interview shape. The inner aggregate is parameterised by the outer row's group column — that's exactly what "correlated" means.

Question. Return every employee whose salary is strictly greater than the average salary in their department.

Input — employees.

emp_id dept_id salary
101 ENG 120000
102 ENG 90000
103 ENG 70000
104 SALES 80000
105 SALES 60000
106 SALES 60000

Code (correlated form).

SELECT e.emp_id, e.dept_id, e.salary
FROM   employees e
WHERE  e.salary > (
         SELECT AVG(salary)
         FROM   employees i
         WHERE  i.dept_id = e.dept_id
       );
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. For each outer row, the inner computes AVG(salary) for the same dept_id.
  2. ENG average = (120000 + 90000 + 70000) / 3 ≈ 93333. SALES average = (80000 + 60000 + 60000) / 3 ≈ 66667.
  3. Keep emp 101 (120000 > 93333 ✓), drop emp 102 (90000 > 93333 ✗), drop 103. Keep 104 (80000 > 66667 ✓), drop 105, drop 106.
  4. The correlated reference is i.dept_id = e.dept_id — that's what wires the inner aggregate to the outer row.

Output.

emp_id dept_id salary
101 ENG 120000
104 SALES 80000

The faster rewrite (window function).

SELECT emp_id, dept_id, salary
FROM (
    SELECT emp_id,
           dept_id,
           salary,
           AVG(salary) OVER (PARTITION BY dept_id) AS dept_avg
    FROM   employees
) AS x
WHERE salary > dept_avg;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb. "Above-group-average" is the canonical correlated-subquery interview shape — and the canonical "rewrite as a window function" follow-up. Expect both.

SQL interview question on rewriting a correlated subquery as a JOIN

A common probe: "Your correlated subquery returns the right answer but EXPLAIN shows nested-loop semantics and the table is 200M rows. Rewrite for the hash-join path." — testing whether the candidate spots the manual rewrite path.

Solution Using a grouped derived table joined back to the outer table

-- Original — correlated subquery
SELECT o.order_id, o.customer_id, o.amount
FROM   orders o
WHERE  o.amount > (
         SELECT AVG(amount)
         FROM   orders i
         WHERE  i.customer_id = o.customer_id
       );

-- Rewritten — JOIN against a grouped derived table
SELECT o.order_id, o.customer_id, o.amount
FROM   orders o
JOIN  (SELECT customer_id, AVG(amount) AS avg_amt
       FROM   orders
       GROUP  BY customer_id) g
  ON  g.customer_id = o.customer_id
WHERE  o.amount > g.avg_amt;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Stage What happens
1 Inner aggregate one scan of orders, groups by customer_id, computes per-customer average
2 Hash build build hash on g.customer_id (size = distinct customers, small)
3 Outer scan scan orders once
4 Hash probe + filter for each outer row, probe the hash and keep rows where amount > avg_amt
5 Project emit (order_id, customer_id, amount)

Output:

order_id customer_id amount
1003 1 800
1008 3 250

Why this works — concept by concept:

  • Per-row predicate becomes per-group constant — the grouped derived table computes avg_amt once per customer and the outer join probes it in O(1). The naïve correlated form would recompute the average for every row.
  • Hash join over nested loop — the planner builds a hash table on customer_id once (M entries) and probes it N times. Total cost = O(M + N), versus O(M × N) for the naïve correlated nested loop.
  • Same answer, different plan — every row that satisfied the original correlated predicate also satisfies the join + filter, and vice versa. The rewrite is an optimizer hint, not a logic change.
  • Modern planner caveat — Postgres 13+, BigQuery, and Snowflake will often rewrite the correlated form to the same hash semi-join. Older MySQL and SQL Server before 2017 may not. Run EXPLAIN to verify.
  • Cost — O(N + M) instead of O(N × M); N = order count, M = distinct customer count; typical speedup is 10× to 1000× on large tables.

SQL
Topic — subqueries
Correlated subquery rewrite problems

Practice →


5. EXISTS / NOT EXISTS / IN / NOT IN — four ways to ask "is there?"

exists sql short-circuits; IN materialises; NOT IN is the NULL footgun

The mental model in one line: EXISTS, NOT EXISTS, IN, and NOT IN are four operators that test row existence, but they differ in short-circuit behaviour, selectivity preference, and NULL semantics. Senior interviews almost always test whether you can pick the right one for a given dataset shape, not whether you know they exist.

The four operators in one sentence each.

  • EXISTS — returns TRUE as soon as the inner subquery produces one row; short-circuits on the first match. The inner SELECT list is irrelevant — SELECT 1, SELECT *, anything works.
  • IN — tests whether the outer expression appears in the inner result set (or literal list). The planner materialises the inner side into a list or hash.
  • NOT EXISTSTRUE when the inner subquery returns zero rows. NULL-safe by definition.
  • NOT INTRUE when the outer value is not present in the inner list. Returns no rows if any inner value is NULL — the classic NULL footgun.

The NULL footgun explained.

NOT IN (...NULL...) evaluates to UNKNOWN for every row, because SQL three-valued logic says "is x not in a set that contains NULL?" cannot be answered. UNKNOWN filtered through WHERE is treated as FALSE, so the query returns zero rows. Always replace NOT IN with NOT EXISTS when the inner side might contain NULL.

-- ⚠️ Returns ZERO rows if any inner customer_id is NULL
SELECT * FROM customers
WHERE customer_id NOT IN (SELECT customer_id FROM blacklist);

-- ✓ Always-correct NULL-safe alternative
SELECT * FROM customers c
WHERE NOT EXISTS (
  SELECT 1 FROM blacklist b WHERE b.customer_id = c.customer_id
);
Enter fullscreen mode Exit fullscreen mode

The four-quadrant decision matrix.

Selectivity vs NULL-safety Inner side might be NULL → need NULL-safe Inner side guaranteed non-NULL
Inner is highly selective (few rows) NOT EXISTS for negation; EXISTS for affirmation EXISTS (short-circuits fast)
Inner is small literal list NOT EXISTS (still safer); or use COALESCE guard IN (literal_list) is fine

JOIN equivalents.

  • EXISTSSEMI JOIN — return outer rows that have a match; no row duplication regardless of how many inner rows match.
  • NOT EXISTSANTI JOIN — return outer rows that have no match.
  • IN ≅ semi-join after deduplication of the inner.
  • NOT IN ≅ anti-join only if the inner is guaranteed non-NULL; otherwise three-valued-logic semantics diverge.

Why modern planners often pick the same plan for IN and EXISTS.

Both shapes ask "is there a matching row?". A semi-join is the canonical physical plan for that question. Postgres, BigQuery, Snowflake, SQL Server (2017+), and modern MySQL all converge IN (SELECT ...) and EXISTS (SELECT 1 FROM ... WHERE ...) to the same hash semi-join. The reason to write one over the other is mostly stylistic, with one exception: NULL semantics for the negated forms (NOT IN vs NOT EXISTS).

Worked example — find customers with no orders

Detailed explanation. "Find the rows in A that have no match in B" is the classic anti-join shape. NOT EXISTS is the canonical, NULL-safe form. NOT IN is dangerous; LEFT JOIN ... WHERE inner IS NULL is equivalent but slightly more verbose.

Question. List every customer who has never placed an order.

Input — customers.

customer_id name
1 Alice
2 Bob
3 Charlie
4 Dana

Input — orders.

order_id customer_id
1001 1
1002 3

Code (the right way: NOT EXISTS).

SELECT c.customer_id, c.name
FROM   customers c
WHERE  NOT EXISTS (
         SELECT 1
         FROM   orders o
         WHERE  o.customer_id = c.customer_id
       );
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. For each customer row, the planner scans orders looking for any row with o.customer_id = c.customer_id.
  2. As soon as one match is found, the inner returns TRUE and NOT EXISTS returns FALSE — that customer is dropped.
  3. If the inner scan finishes with zero matches, NOT EXISTS returns TRUE — the customer is kept.
  4. Under the hood, the planner builds a hash table on orders.customer_id and probes per outer row, dropping any outer row whose customer_id is in the hash. This is the ANTI JOIN plan.

Output.

customer_id name
2 Bob
4 Dana

Rule of thumb. Use NOT EXISTS for every "no match" question. Use NOT IN only if the inner side is a small literal list with no NULLs.

Worked example — the NOT IN NULL trap

Detailed explanation. A team writes a "customers not in the blacklist" query with NOT IN. One blacklist row has a NULL customer_id due to a data-load bug. The query silently returns zero rows in production. The interviewer wants to know whether you can spot it.

Question. Given a blacklist table where one row has customer_id = NULL, what does SELECT * FROM customers WHERE customer_id NOT IN (SELECT customer_id FROM blacklist) return?

Input — customers.

customer_id name
1 Alice
2 Bob
3 Charlie

Input — blacklist.

customer_id
2
NULL

Code (the trap).

SELECT customer_id, name
FROM   customers
WHERE  customer_id NOT IN (SELECT customer_id FROM blacklist);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The inner subquery materialises (2, NULL).
  2. The outer evaluates 1 NOT IN (2, NULL) — three-valued logic says: 1 != 2 is TRUE, but 1 != NULL is UNKNOWN. The whole expression collapses to UNKNOWN.
  3. WHERE UNKNOWN is filtered as FALSE, so row 1 is dropped.
  4. Same logic for rows 2 and 3 — every row produces UNKNOWN and is dropped. The query returns zero rows.

Output (the trap).

| (empty result set) |

The fix — NOT EXISTS.

SELECT c.customer_id, c.name
FROM   customers c
WHERE  NOT EXISTS (
         SELECT 1 FROM blacklist b
         WHERE  b.customer_id = c.customer_id
       );
Enter fullscreen mode Exit fullscreen mode

Output after fix.

customer_id name
1 Alice
3 Charlie

Why NOT EXISTS is safe. It tests row existence, not value equality. The inner WHERE b.customer_id = c.customer_id cannot match a NULL value (because NULL = NULL is UNKNOWN, not TRUE), so the NULL row in the blacklist simply doesn't participate.

Rule of thumb. Treat NOT IN with a subquery as legacy code unless you can prove the inner side is non-NULL by constraint. NOT EXISTS is always safe.

Worked example — two-column EXISTS on a composite key

Detailed explanation. Some existence checks are composite — "does a row exist with the same (tenant_id, sku) pair?" IN cannot express this on every engine without a row constructor; EXISTS always can. This shape is common in multi-tenant data platforms.

Question. Given an inventory(tenant_id, sku, qty) table and an orders(tenant_id, sku, qty) table, find orders for which there is no matching inventory row for the same (tenant_id, sku) pair.

Input — inventory.

tenant_id sku qty
T1 A 50
T1 B 10
T2 A 0

Input — orders.

order_id tenant_id sku qty
1 T1 A 5
2 T1 C 2
3 T2 A 1
4 T3 A 1

Code (NOT EXISTS with a two-column correlation).

SELECT o.order_id, o.tenant_id, o.sku, o.qty
FROM   orders o
WHERE  NOT EXISTS (
         SELECT 1
         FROM   inventory i
         WHERE  i.tenant_id = o.tenant_id
           AND  i.sku       = o.sku
       );
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. For each order row, the planner scans inventory for any row matching both tenant_id and sku.
  2. Order 1 (T1, A) — matches inventory(T1, A) → drop.
  3. Order 2 (T1, C) — no match → keep.
  4. Order 3 (T2, A) — matches inventory(T2, A) → drop (even though qty = 0, the row exists).
  5. Order 4 (T3, A) — no inventory row for tenant T3 at all → keep.
  6. Under the hood, the planner builds a hash on (inventory.tenant_id, inventory.sku) and probes per outer row.

Output.

order_id tenant_id sku qty
2 T1 C 2
4 T3 A 1

Rule of thumb. Composite-key existence tests almost always belong in EXISTS / NOT EXISTS. The row-constructor (a, b) IN (...) syntax is supported by Postgres but not every engine; the EXISTS form is universal.

SQL interview question on EXISTS vs IN for an existence check

A common probe: "Find every product that has at least one review. EXISTS or IN — which would you use and why?" — testing whether the candidate can articulate the difference at the level of plans, not just syntax.

Solution Using EXISTS with a correlated reference

SELECT p.product_id, p.name
FROM   products p
WHERE  EXISTS (
         SELECT 1
         FROM   reviews r
         WHERE  r.product_id = p.product_id
       );
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Stage What happens
1 Outer scan reads each product row from products
2 Inner correlated lookup scans reviews for any row with r.product_id = p.product_id
3 Short-circuit as soon as one matching review is found, the inner returns and EXISTS is TRUE
4 Keep / drop keeps the outer row if EXISTS = TRUE; drops it otherwise
5 Project emits (product_id, name)

Sample input — products.

product_id name
100 Widget
101 Gadget
102 Gizmo

Sample input — reviews.

review_id product_id rating
1 100 5
2 100 4
3 102 3

Output:

product_id name
100 Widget
102 Gizmo

Why this works — concept by concept:

  • EXISTS short-circuits — the inner subquery stops scanning reviews as soon as the first match is found per outer row. For a product with 1,000 reviews, the inner reads exactly one row, not 1,000.
  • Semi-join semantics — regardless of how many reviews a product has, the outer row appears exactly once. No DISTINCT needed, no row explosion.
  • NULL-safe — even if reviews.product_id has NULLs, the inner WHERE r.product_id = p.product_id simply doesn't match them; EXISTS is unaffected.
  • Hash semi-join under the hood — modern planners build a hash on reviews.product_id (with duplicates collapsed) and probe per outer row; total cost = O(reviews + products).
  • IN equivalenceWHERE p.product_id IN (SELECT product_id FROM reviews) produces the same plan on Postgres, BigQuery, Snowflake. EXISTS is the more declarative choice for "is there a match?".
  • Cost — O(R + P) where R = reviews, P = products; the hash-build is O(distinct reviews.product_id).

SQL
Topic — subqueries (SQL)
EXISTS, NOT EXISTS, and NULL-safe anti-join problems

Practice →


Three more interview-grade patterns for EXISTS / IN

Beyond the canonical four operators, three further patterns recur in senior-round subquery interviews — pattern recognition matters as much as syntax recall.

Pattern 1 — EXISTS as a guard inside a CASE expression.

A common reporting requirement is "flag each customer as active or dormant based on whether they have at least one order in the last 30 days." The cleanest expression embeds EXISTS inside a CASE.

SELECT c.customer_id,
       c.name,
       CASE
         WHEN EXISTS (
           SELECT 1
           FROM   orders o
           WHERE  o.customer_id = c.customer_id
             AND  o.order_date >= CURRENT_DATE - INTERVAL '30 days'
         ) THEN 'active'
         ELSE 'dormant'
       END AS status
FROM   customers c;
Enter fullscreen mode Exit fullscreen mode

EXISTS is the right primitive because it short-circuits on the first match — perfect for a per-row boolean classifier.

Pattern 2 — EXISTS correlated to two outer tables (multi-table correlation).

Sometimes the inner subquery needs to reference more than one outer alias. This is rare but legal — most engines support correlated references to any outer alias in scope.

SELECT c.customer_id, p.product_id
FROM   customers c
JOIN   products  p
WHERE  EXISTS (
         SELECT 1
         FROM   orders o
         WHERE  o.customer_id = c.customer_id
           AND  o.product_id  = p.product_id
       );
Enter fullscreen mode Exit fullscreen mode

Reads naturally as "every customer-product pair where there's at least one matching order." Most planners rewrite this as a hash semi-join across the three tables.

Pattern 3 — IN with a row constructor for multi-column matching (Postgres / DB2).

When the engine supports row constructors, you can express composite-key IN without going through EXISTS.

-- Postgres
SELECT *
FROM   orders
WHERE  (tenant_id, sku) IN (
         SELECT tenant_id, sku
         FROM   active_skus
       );
Enter fullscreen mode Exit fullscreen mode

This is exactly the two-column EXISTS shape from the worked example above, just with row-constructor syntax. The choice is stylistic — pick whichever your team's engine supports and your reader prefers. Note that MySQL historically supports row-constructor IN; older SQL Server does not.

Three NULL gotchas to memorise.

  • WHERE x = NULL is always FALSE (never use = against NULL; use IS NULL).
  • x NOT IN (subquery with any NULL) always returns zero rows.
  • EXISTS (SELECT NULL ...) is not FALSE — the existence test only cares whether any row appears, not what the inner projection contains. SELECT 1, SELECT NULL, SELECT * all behave identically.

The third bullet is the most common interview trap. Candidates believe EXISTS (SELECT NULL FROM ...) returns FALSE because the projected value is NULL. It doesn't — EXISTS only counts rows.


Cheat sheet — subquery patterns

  • Above-average filter. WHERE x > (SELECT AVG(x) FROM t) — uncorrelated scalar subquery; the inner runs once. Add a correlated WHERE t2.group_id = t.group_id to filter per group.
  • Top-N per group. Derived table with ROW_NUMBER() OVER (PARTITION BY group_col ORDER BY metric DESC); outer WHERE rn <= N. Window cannot live in WHERE, so the derived-table wrap is mandatory.
  • Find duplicates. GROUP BY business_key HAVING COUNT(*) > 1 — not technically a subquery, but interviewers often expect it inside a WHERE id IN (...) to surface the offending rows.
  • Missing rows / anti-join. WHERE NOT EXISTS (SELECT 1 FROM right_table WHERE ...) — the NULL-safe form of "no match". Equivalent to LEFT JOIN ... WHERE right.pk IS NULL.
  • Conditional aggregation as scalar. SELECT (SELECT COUNT(*) FROM t WHERE cond) AS sub — embeds a count inline. Use sparingly; usually a FILTER clause or a CASE inside SUM is clearer.
  • Two-column EXISTS. WHERE EXISTS (SELECT 1 FROM t2 WHERE t1.a = t2.a AND t1.b = t2.b) — composite-key existence test. No way to express this with IN directly (you'd need a row constructor (a, b) IN (...) which not every engine supports).
  • First-of-group via correlated subquery. WHERE order_id = (SELECT MIN(order_id) FROM orders WHERE customer_id = outer.customer_id) — picks the earliest order per customer. Faster modern form: window function with ROW_NUMBER partitioned by customer_id.
  • Latest-of-group via derived table. JOIN (SELECT customer_id, MAX(order_date) AS last_dt FROM orders GROUP BY customer_id) — picks the latest date per customer, then joins back for the full row.
  • Percentile / Nth value. SELECT (SELECT amount FROM orders ORDER BY amount DESC LIMIT 1 OFFSET N-1) — Nth largest amount; or use PERCENTILE_CONT / NTILE for cleaner semantics.
  • Update from subquery. UPDATE customers SET tier = (SELECT tier FROM tiers WHERE tiers.customer_id = customers.customer_id) — correlated scalar in the SET clause. NULL-safe if the subquery is allowed to return NULL.

Frequently asked questions

Subquery vs CTE — which is faster?

In modern engines (Postgres 12+, BigQuery, Snowflake, SQL Server, modern MySQL), CTEs and equivalent derived tables produce identical plans for single-use pipelines — the planner inlines the CTE just like a subquery. The choice is mostly stylistic: CTEs win when you reuse a stage more than once, or when the query is long enough that named, top-of-query stages improve readability. Older Postgres (≤ 11) treated CTEs as an optimization fence — the planner couldn't push predicates through; that quirk is now gated behind WITH MATERIALIZED. If you want a single-use CTE on Postgres 11 to be inlined, you currently need to switch to a derived table or upgrade. Verifying with EXPLAIN ANALYZE is the only way to know for sure on a given engine.

Does the optimizer always rewrite correlated subqueries as JOINs?

Not always. Postgres 13+, BigQuery, and Snowflake are aggressive about it — they convert correlated EXISTS, IN, and aggregate-in-WHERE patterns into hash semi-joins automatically. MySQL ≤ 5.7 and older SQL Server versions are less aggressive; for them, the naïve nested-loop plan can dominate runtime on large tables. Always check EXPLAIN (or EXPLAIN ANALYZE) — if you see "nested loop" with no "semi join" or "hash join" hint, manually rewrite the query as a JOIN against a grouped derived table. As a quick rule of thumb: if your query runs in seconds on dev and minutes on prod, the optimizer probably failed to rewrite a correlated subquery, and the manual rewrite is the single biggest speedup available.

IN vs EXISTS — which should I prefer?

For the affirmative case (return rows that match), IN and EXISTS produce the same plan on every modern engine — pick whichever reads more clearly. EXISTS is more declarative when the inner has multiple conditions; IN is more compact for a single-column lookup. For the negated case (return rows that don't match), prefer NOT EXISTS — it is NULL-safe, whereas NOT IN silently returns zero rows if any inner value is NULL. A handy interview tie-breaker: when the question reads "is there at least one matching row?" reach for EXISTS; when the question reads "does the value appear in this list?" reach for IN. Both compile to the same semi-join when the planner does its job.

Why does NOT IN return zero rows when there's a NULL in the list?

Because SQL uses three-valued logic. x NOT IN (a, b, NULL) evaluates as x != a AND x != b AND x != NULL. x != NULL evaluates to UNKNOWN (not TRUE), and TRUE AND UNKNOWN collapses to UNKNOWN. WHERE UNKNOWN is treated as FALSE, so every outer row is filtered out. The fix is either NOT EXISTS (which uses = inside, which is UNKNOWN for NULLs and therefore doesn't match) or an explicit WHERE col IS NOT NULL inside the inner subquery.

Can I update a table using a subquery?

Yes, in two shapes. Correlated UPDATE: UPDATE customers SET tier = (SELECT tier FROM tiers t WHERE t.customer_id = customers.customer_id) — runs the subquery once per outer row. UPDATE with FROM (Postgres) or JOIN (MySQL / SQL Server): UPDATE customers c SET tier = t.tier FROM tiers t WHERE t.customer_id = c.customer_id — typically much faster on large tables because the planner can hash-join the two sides. Always guard against the subquery returning NULL (which silently nulls the column) by adding an EXISTS filter or a COALESCE. The same shapes work for DELETE: DELETE FROM customers WHERE NOT EXISTS (SELECT 1 FROM orders WHERE orders.customer_id = customers.customer_id) is the canonical "delete customers with no orders" query, and it is NULL-safe by construction.

Are nested subqueries OK?

Yes, but every layer of nesting is a layer of cognitive load. SQL's grammar allows arbitrarily deep nesting (SELECT (SELECT (SELECT ...))), and modern planners handle three or four levels without breaking a sweat. Once you reach two or three nested layers, refactor into CTEs — the named stages make the query self-documenting and let the planner see the whole pipeline at once. Avoid deep nesting in interviews; even when it's technically correct, it tells the interviewer you don't yet know when to switch shapes. A practical rule: if your query has more than two levels of parentheses around SELECT keywords, rewrite the inner two layers as named CTEs and watch the readability double.

What's the difference between a scalar subquery and a correlated subquery?

A scalar subquery is defined by its result cardinality: exactly one row × one column. A correlated subquery is defined by whether the inner references the outer — orthogonal axis. A subquery can be scalar and correlated (e.g. (SELECT MAX(date) FROM orders WHERE orders.customer_id = c.customer_id) returns one value per outer row), scalar and uncorrelated (e.g. (SELECT MAX(price) FROM products) — one value, evaluated once), or non-scalar correlated (e.g. inside EXISTS, where the inner can return many rows because only existence matters). Senior interviewers will sometimes hand you a subquery and ask "is it scalar, correlated, both, or neither?" — be ready to answer along both axes independently.

Practice on PipeCode

Pipecode.ai is Leetcode for Data Engineering — every subquery pattern above ships with hands-on practice rooms where you write the query, run it against a real dataset, and see whether it survives the NULL gotchas and the optimizer rewrite. Start with the subqueries library and work outward; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.

Practice subqueries now →
SQL subquery drills →

Source: dev.to

arrow_back Back to Tutorials