Slowly Changing Dimensions (SCD) for Data Engineering Interviews: Type 1, 2, 3, 6 with SQL & dbt

python dev.to

slowly changing dimensions interview questions are the dimensional-modelling round every senior data engineering loop opens with after warehouse design. Interviewers don't stop at "what's a dim table?" — they probe whether you understand scd type 2 as the production default, the scd type 1 vs type 2 vs type 3 trade-offs, the surrogate key as the dim's join key, and the late-arriving / retroactive-delete gotchas that fail most candidates.

This guide walks through the seven SCD primitives that show up most often in data engineering interview questions at FAANG and warehouse-heavy shops (Snowflake, Databricks, Netflix, Airbnb, Stripe). Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works. By the end you'll be able to design a Type 2 dim with MERGE INTO, defend dbt snapshot timestamp vs check strategy, and walk through the surrogate-key collision that breaks every junior's first SCD implementation — the exact shape dimensional modelling interview rounds reward when fact dim join and type 6 hybrid scd come up.

When you want hands-on reps immediately after reading, drill the dimensional modelling library →, browse slowly-changing-data practice →, and rehearse join patterns → plus window-function drills → on the effective-date side.


On this page


1. Why SCD shows up in every senior data engineering interview

SCD is how dimensional warehouses preserve "what was true when" — the core of every analytics question

The one-sentence invariant: a slowly changing dimension is a dim table where the attributes of a business entity (customer, product, employee) change over time, and the warehouse must decide whether to preserve, partially preserve, or overwrite the history. Once you internalise that — history preservation is a modelling choice, not a feature — every slowly changing dimensions interview questions prompt resolves to "Type 1, 2, 3, or 6, and why."

Why interviewers love SCD questions.

  • It surfaces dimensional-modelling instincts. Does the candidate know the difference between a dimension and a fact?
  • It tests historical reasoning. Can the answer reconstruct "what was the customer's segment when this order was placed in 2024?"
  • It probes SQL fluency. MERGE INTO with multiple WHEN branches, effective-date predicates, surrogate keys.
  • It hits operational realism. Late-arriving updates, retroactive deletes, sk collisions — every team has been bitten at least once.

The four key concepts every SCD interview opens with.

  • Business key vs surrogate key. Business key = the natural id (e.g. customer_id). Surrogate key = a synthetic id (sk) that uniquely identifies a version of the row.
  • History preservation. SCD type decides whether old values are kept (Type 2, 3, 6) or overwritten (Type 1).
  • Effective dates. valid_from + valid_to columns; valid_to=NULL (or far-future date) marks the current row.
  • is_current flag. A redundant-but-useful boolean for fast WHERE is_current = true lookups.

What interviewers listen for.

  • Do you reach for Type 2 when "preserve history" is mentioned? — required answer.
  • Do you separate business key from surrogate key? — senior signal.
  • Do you mention effective-date predicates when joining fact to dim? — senior-architectural signal.
  • Do you bring up MERGE INTO as the standard SCD-2 update pattern? — required answer.

Worked example — pick the SCD type for a customer dimension

Detailed explanation. A common opener: "We have a customers table with customer_id, name, email, city, segment, lifetime_value. The warehouse needs to answer 'what was the customer's segment when they placed an order?' What SCD type and why?"

Question. Justify the SCD type per column.

Input (column-by-column).

Column Changes? Need history? SCD type
customer_id never n/a (business key) n/a
name rarely no Type 1
email sometimes no Type 1
city sometimes yes (location at order time) Type 2
segment yes yes (segment at order time) Type 2
lifetime_value continuously no — just current Type 1

Code (resulting dim_customer schema).

CREATE TABLE dim_customer (
    sk              BIGSERIAL PRIMARY KEY,        -- surrogate key
    customer_id     BIGINT NOT NULL,              -- business key
    name            VARCHAR(200) NOT NULL,        -- Type 1 (overwrite)
    email           VARCHAR(200) NOT NULL,        -- Type 1
    city            VARCHAR(100) NOT NULL,        -- Type 2 (history)
    segment         VARCHAR(50)  NOT NULL,        -- Type 2
    lifetime_value  NUMERIC(12,2),                -- Type 1
    valid_from      DATE NOT NULL,
    valid_to        DATE,                         -- NULL = current
    is_current      BOOLEAN NOT NULL DEFAULT TRUE
);

CREATE UNIQUE INDEX uq_dim_customer_current
    ON dim_customer (customer_id) WHERE is_current = true;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Business key = customer_id. It identifies the entity.
  2. Surrogate key = sk. Auto-generated; unique per row version.
  3. Type 1 columns (name, email, lifetime_value) — overwritten in place on every change.
  4. Type 2 columns (city, segment) — when these change, the current row's valid_to is set to today and a new row is inserted with valid_from = today, is_current = true.
  5. Partial-index uniqueness — at most one current row per customer_id.

Output (joining fact_orders to dim_customer with effective-date predicate).

SELECT f.order_id, f.amount, c.segment AS segment_at_order_time
FROM fact_orders f
JOIN dim_customer c
  ON c.customer_id = f.customer_id
 AND f.order_date >= c.valid_from
 AND (f.order_date < c.valid_to OR c.valid_to IS NULL);
Enter fullscreen mode Exit fullscreen mode

Rule of thumb. Type 2 for any attribute that drives reporting "as of"; Type 1 for everything else. When in doubt, ask the interviewer "do we need to reconstruct the past value?"

SCD interview question on column-by-column SCD typing

A senior probe: "Walk through each column and pick its SCD type." This tests whether the candidate can reason at the attribute level, not just the table level.

Solution Using hybrid per-column typing within one dim

-- dim_customer combining Type 1 (Volatile attributes overwritten)
-- and Type 2 (historical attributes versioned)
-- Single row per (customer_id, valid_from); is_current=true is the latest.
CREATE TABLE dim_customer (
    sk              BIGSERIAL PRIMARY KEY,
    customer_id     BIGINT NOT NULL,
    name            VARCHAR(200) NOT NULL,    -- T1
    email           VARCHAR(200) NOT NULL,    -- T1
    city            VARCHAR(100) NOT NULL,    -- T2
    segment         VARCHAR(50)  NOT NULL,    -- T2
    lifetime_value  NUMERIC(12,2),            -- T1
    valid_from      DATE NOT NULL,
    valid_to        DATE,
    is_current      BOOLEAN NOT NULL DEFAULT TRUE
);
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Change Action
email changed UPDATE all rows for customer_id SET email = new_email
name changed UPDATE all rows SET name = new_name
city changed close current row (set valid_to = today, is_current = false), INSERT new row
segment changed close current row, INSERT new row
lifetime_value changed UPDATE all rows SET lifetime_value = new_value

Output:

Behaviour Per-column SCD Pure SCD2 (every col) Pure SCD1 (no history)
Storage bounded by T2-column change rate unbounded by every-column changes minimal
Historical accuracy T2 columns preserved every column preserved none
Read complexity medium (effective-date join) medium trivial

Why this works — concept by concept:

  • Business key vs surrogate keycustomer_id is what humans recognise; sk is what facts join on. Decoupling them is the foundation of SCD2.
  • Type 1 columns update all rows — when email changes, every historical row's email becomes the new value (so historical reports still show current email but historical city / segment).
  • Type 2 columns version the row — when city changes, a new row is born; the old row is closed.
  • Hybrid per-column typing — the most common production pattern. Type 6 (next sections) formalises this.
  • Cost — storage = O(rows × Type-2-column-change-rate); read cost = one join with effective-date predicate.

Dimensional modelling
Topic — dimensional-modeling
Dimensional modelling problems

Practice →


2. SCD Type 1 — overwrite in place, no history

scd type 1 is "UPDATE the row in place" — the cheapest, most-common, history-free strategy

The mental model: Type 1 is for attributes where you only ever care about the current value — name, email, address — and where reconstructing the past would mislead the reader of any report. There's exactly one row per business key. UPDATE in place.

When to use Type 1.

  • The attribute is descriptive, not analytical (display name, contact info).
  • Historical reports would be misleading or unnecessary.
  • Storage / cost matters more than audit trail.

When NOT to use Type 1.

  • The attribute drives slicing in reports ("revenue by region" → region must be Type 2).
  • Compliance requires an audit trail.
  • The attribute changes in step with billing or contractual terms.

The classic Type 1 trap.

A team Type-1's the customer's country column. Six months later, finance asks "revenue by country last year" — every historical order now shows the customer's current country. The dim was destroying its own history.

Worked example — Type 1 update via MERGE

Detailed explanation. Even Type 1 benefits from MERGE INTO over plain UPDATE: it handles INSERT (new customer) and UPDATE (existing customer) in one statement.

Question. Write the MERGE that Type-1-updates dim_customer from a daily stg_customer extract.

Input (stg_customer).

customer_id name email city
42 Ada Lovelace ada@example.com Seattle
43 Alan Turing alan@example.com London

Existing dim_customer.

sk customer_id name email city
101 42 Ada Lovelace ada@old.com Austin
(no row) 43

Code.

MERGE INTO dim_customer t
USING stg_customer s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN
  UPDATE SET name = s.name, email = s.email, city = s.city
WHEN NOT MATCHED THEN
  INSERT (customer_id, name, email, city)
  VALUES (s.customer_id, s.name, s.email, s.city);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The MERGE walks every row in stg_customer.
  2. For customer_id=42 — matched → UPDATE — email goes from ada@old.com to ada@example.com, city from Austin to Seattle.
  3. For customer_id=43 — not matched → INSERT new row.
  4. Surrogate key sk=101 is preserved on the update.
  5. End state has one row per customer_id; the prior email and city values are gone.

Output.

sk customer_id name email city
101 42 Ada Lovelace ada@example.com Seattle
102 43 Alan Turing alan@example.com London

Rule of thumb. Always use MERGE for SCD operations, even Type 1 — it handles both UPDATE and INSERT in one statement and is replay-safe.

SCD interview question on Type 1 idempotency

The probe: "If we re-run the same MERGE twice, what happens?"

Solution Using deterministic MERGE that's idempotent

-- Same MERGE as above
MERGE INTO dim_customer t
USING stg_customer s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET name = s.name, email = s.email, city = s.city
WHEN NOT MATCHED THEN INSERT VALUES (s.customer_id, s.name, s.email, s.city);
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Run Source row Action End state
1 (42, Ada, ada@example.com, Seattle) MATCHED → UPDATE row 101 reflects source
2 (same source) (42, Ada, ada@example.com, Seattle) MATCHED → UPDATE (no-op effect) row 101 unchanged

Output:

Property Value
Idempotent? yes — re-running produces the same end state
Side effects? minor — UPDATE timestamp on metadata layer changes
Recommended? yes — keeps the load script idempotent

Why this works — concept by concept:

  • MERGE is deterministic on the same input — second run finds the same matched/unmatched partition and applies the same updates.
  • Type 1 has no row-creation on UPDATE — re-running doesn't multiply rows.
  • sk preserved across re-runs — the fact tables that reference this sk are unaffected.
  • Idempotency = replay safety — Airflow / dbt / cron can re-run without polluting state.
  • Cost — O(stg rows) compute, O(stg rows) UPDATE writes. Trivial at most scales.

Dimensional modelling
Topic — joins
SCD Type 1 + MERGE pattern problems

Practice →


3. SCD Type 2 — add row with valid_from, valid_to, is_current

scd type 2 is the production default — close the old row, insert the new row, preserve all history

The mental model: Type 2 preserves the full history of every change as separate rows; each row has a valid_from and valid_to (or is_current) marking its lifetime; facts join on the surrogate key based on event date.

The three columns every SCD-2 dim has.

  • valid_from (DATE / TIMESTAMP) — when this version became active.
  • valid_to (DATE / TIMESTAMP, NULL or far-future for current) — when this version stopped being active.
  • is_current (BOOLEAN) — TRUE for the current row, FALSE for all others. Redundant but useful for fast lookups.

The MERGE that drives Type 2.

  • WHEN MATCHED AND is_current AND any-tracked-column changed → UPDATE to close the row (valid_to = today, is_current = false).
  • WHEN NOT MATCHED (by customer_id) → INSERT a new current row.
  • For an UPDATE-then-INSERT pattern that needs both branches, use two passes or MERGE INTO with multiple statements.

Fact-to-dim joins.

  • Always join on the surrogate key with an effective-date predicate.
  • fact.event_date BETWEEN dim.valid_from AND COALESCE(dim.valid_to, '9999-12-31').

Worked example — Type 2 MERGE for a customer city change

Detailed explanation. Customer 42's city changes from Austin to Seattle on 2026-05-01. Walk through how the dim table transforms.

Question. Show the staging table, the MERGE, and the final dim state.

Input (stg_customer).

customer_id city snapshot_date
42 Seattle 2026-05-01

Existing dim_customer.

sk customer_id city valid_from valid_to is_current
101 42 Austin 2024-01-01 NULL true

Code (PostgreSQL).

BEGIN;

-- Step 1: close the current row when city changed
UPDATE dim_customer
SET valid_to = s.snapshot_date - INTERVAL '1 day',
    is_current = false
FROM stg_customer s
WHERE dim_customer.customer_id = s.customer_id
  AND dim_customer.is_current = true
  AND dim_customer.city <> s.city;

-- Step 2: insert the new current row
INSERT INTO dim_customer (customer_id, city, valid_from, valid_to, is_current)
SELECT s.customer_id, s.city, s.snapshot_date, NULL, true
FROM stg_customer s
WHERE NOT EXISTS (
    SELECT 1 FROM dim_customer d
    WHERE d.customer_id = s.customer_id
      AND d.is_current = true
);

COMMIT;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Step 1 UPDATE — for every customer whose current row has a different city than the staging row, close the current row by setting valid_to = snapshot_date - 1 day and is_current = false.
  2. Step 2 INSERT — for every customer who no longer has a current row (because Step 1 just closed it) OR who didn't have one at all, insert a new row with valid_from = snapshot_date, valid_to = NULL, is_current = true.
  3. The two-step pattern ensures that the dim never has two current rows for the same customer.
  4. Wrap both in a transaction so partial failure can't leave the dim in an inconsistent state.

Output (after MERGE).

sk customer_id city valid_from valid_to is_current
101 42 Austin 2024-01-01 2026-04-30 false
102 42 Seattle 2026-05-01 NULL true

Rule of thumb. is_current is a redundant convenience — always derivable from valid_to IS NULL, but worth keeping because the partial-index uniqueness constraint depends on it.

SCD interview question on the fact-to-dim join

The probe: "How do I get the customer's city at the time of every order?"

Solution Using surrogate-key + effective-date predicate

SELECT
    f.order_id,
    f.order_date,
    c.customer_id,
    c.city AS city_at_order_time
FROM fact_orders f
JOIN dim_customer c
  ON c.customer_id = f.customer_id
 AND f.order_date >= c.valid_from
 AND (f.order_date < c.valid_to OR c.valid_to IS NULL);
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

order_id order_date dim_customer match city_at_order_time
7001 2024-06-15 sk=101 (valid 2024-01-01..2026-04-30) Austin
7002 2026-04-20 sk=101 (still in range) Austin
7003 2026-05-15 sk=102 (valid 2026-05-01..NULL) Seattle

Output:

order_id order_date city_at_order_time
7001 2024-06-15 Austin
7002 2026-04-20 Austin
7003 2026-05-15 Seattle

Why this works — concept by concept:

  • Effective-date predicate — the BETWEEN-style predicate ensures each fact row matches exactly one dim row.
  • Surrogate key in fact (alternative)__ — many warehouses store the dim's sk directly in the fact, so the join becomes fact.customer_sk = dim.sk with no date predicate. Faster join but the fact must look up the right sk at load time.
  • COALESCE on valid_to — required because NULL doesn't behave normally in range predicates; using 9999-12-31 as a sentinel is the alternative.
  • Partial-index uniqueness on (customer_id) WHERE is_current = true — guarantees exactly one current row per business key.
  • Cost — one join with effective-date predicate; indexed on (customer_id, valid_from) for performance.

Dimensional modelling
Topic — slowly-changing-data
SCD Type 2 + effective-date join problems

Practice →


4. SCD Type 3 — add column for previous value

scd type 3 is "I want to know the previous value, but not every value" — limited history, no row explosion

The mental model: Type 3 stores N previous values as extra columns on a single row per business key. Most common form is "current" + "previous" — two columns per tracked attribute. The dim never grows beyond one row per business key.

When to use Type 3.

  • You only need "before" and "after" comparisons.
  • The attribute changes rarely (a few times per row's lifetime).
  • Storage is a serious concern.
  • Reports always show "current vs previous" patterns.

When NOT to use Type 3.

  • You need more than the last N values.
  • The attribute changes more than once per reporting window.
  • You need point-in-time reconstruction.

The typical schema.

CREATE TABLE dim_customer_t3 (
    customer_id  BIGINT PRIMARY KEY,
    name         VARCHAR(200) NOT NULL,
    city         VARCHAR(100) NOT NULL,         -- current
    prev_city    VARCHAR(100),                  -- previous
    city_changed_at DATE,                       -- when did it change?
    segment      VARCHAR(50) NOT NULL,
    prev_segment VARCHAR(50),
    segment_changed_at DATE
);
Enter fullscreen mode Exit fullscreen mode

Worked example — Type 3 update for a customer city change

Detailed explanation. When city changes, shift the current to prev_city, set the new value, and record the change date.

Question. Write the MERGE for Type 3.

Code.

MERGE INTO dim_customer_t3 t
USING stg_customer s
ON t.customer_id = s.customer_id
WHEN MATCHED AND t.city <> s.city THEN
  UPDATE SET prev_city = t.city,
             city = s.city,
             city_changed_at = s.snapshot_date
WHEN NOT MATCHED THEN
  INSERT (customer_id, name, city, prev_city, city_changed_at, segment)
  VALUES (s.customer_id, s.name, s.city, NULL, NULL, s.segment);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. For matched rows where city changed → shift current to prev_city, set new value, record changed-date.
  2. For matched rows where city is unchanged → no UPDATE (MATCHED AND clause filters out unchanged rows).
  3. For unmatched → INSERT new row with prev_city = NULL (no history yet).
  4. Re-running with the same input is idempotent — the MATCHED-AND condition is false on the second pass.

Output.

Before:

customer_id name city prev_city city_changed_at
42 Ada Austin NULL NULL

Source: (42, Ada, Seattle, 2026-05-01).

After:

customer_id name city prev_city city_changed_at
42 Ada Seattle Austin 2026-05-01

After another change to "Boston" on 2026-06-01:

customer_id name city prev_city city_changed_at
42 Ada Boston Seattle 2026-06-01

— note: "Austin" is gone forever after the second change.

Rule of thumb. Type 3 trades unbounded history for a single-row schema. Use only when you genuinely only need "current vs previous."

SCD interview question on Type 3 vs Type 2 trade-off

The probe: "When would you pick Type 3 over Type 2?"

Solution Using a trade-off decision table

Concern Type 2 Type 3
Number of historical values kept unbounded N (usually 1)
Row count per business key grows with changes always 1
Storage larger minimal
Fact-to-dim join complexity effective-date predicate trivial — no date join
Point-in-time reporting yes no
"Current vs previous" reporting yes (filter is_current) yes (single row)

Step-by-step trace.

Question Type 2 Type 3
"What was the city when order 7001 was placed?" yes (effective-date join) no — only current + previous available
"What's the previous city?" yes (filter by valid_to ordering) yes (read prev_city column)
"How many times has this customer moved?" count of rows no (only the most recent move)
"Latest city" filter is_current = true read city column

Output:

Question Answer
Need full history? Type 2
Need only "before/after" comparison? Type 3
Need a single-row dim for cost reasons? Type 3
Need reporting "as of date"? Type 2

Why this works — concept by concept:

  • Bounded vs unbounded history — Type 3 commits to a constant-size schema; Type 2 lets the table grow with changes.
  • Join complexity — Type 3 joins are trivial (no date predicate); Type 2 requires effective-date predicates.
  • Storage / cost — Type 3 is cheap; Type 2 grows with change-rate.
  • Reporting power — Type 2 wins on any "as of" question; Type 3 wins on "current vs previous" cards.
  • Cost — choice cost is the application's specific reporting needs; the technical cost is the row-count difference at scale.

Dimensional modelling
Topic — dimensional-modeling
SCD type-selection problems

Practice →


5. SCD Type 6 — hybrid 1+2+3 for the messy real world

scd type 6 combines Type 1 (current), Type 2 (history rows), and Type 3 (previous column) — the production reality

The mental model: Type 6 is "all of the above" — multiple rows per business key (Type 2), each with current_* columns updated on every change (Type 1), and optional prev_* columns for the last-known-prior value (Type 3). Sounds messy because it is — but it's how real dims look.

Type 6 columns.

  • Type 2 history columns: city, valid_from, valid_to, is_current.
  • Type 1 "current" columns: current_city — set to the same value on EVERY row of this business key. Lets you do "show me the current city for every historical row" without a sub-query.
  • Type 3 "previous" columns (optional): prev_city — the value just before the row's valid_from.

Why Type 6 exists.

  • Reporting often asks two questions at once: "what was the city at order time?" AND "what is the current city?" — Type 6 answers both without joins.
  • "Show me orders by current segment" without a sub-query — current_segment is duplicated on every historical row.

Type 6 trade-offs.

  • Storage: bigger than Type 2 (more columns); bigger than Type 1 (more rows).
  • Update cost: every row of the business key must be updated when the Type 1 column changes.
  • Read flexibility: highest of any SCD type.

Worked example — Type 6 dim with current_city kept in sync

Detailed explanation. Walk through inserting two history rows for customer 42 and updating current_city on both when the city changes.

Question. Show the dim with two history rows and a city change, both before and after.

Input. Customer 42 was in Austin (2024-01-01 → 2026-04-30), then moved to Seattle (2026-05-01).

Schema.

CREATE TABLE dim_customer_t6 (
    sk              BIGSERIAL PRIMARY KEY,
    customer_id     BIGINT NOT NULL,
    city            VARCHAR(100) NOT NULL,      -- Type 2 historical
    current_city    VARCHAR(100) NOT NULL,      -- Type 1 (same on all rows)
    valid_from      DATE NOT NULL,
    valid_to        DATE,
    is_current      BOOLEAN NOT NULL
);
Enter fullscreen mode Exit fullscreen mode

Code (the Type 6 update on a city change).

BEGIN;

-- Step 1: close the current row (Type 2 part)
UPDATE dim_customer_t6
SET valid_to = '2026-04-30', is_current = false
WHERE customer_id = 42 AND is_current = true;

-- Step 2: insert the new current row (Type 2 part)
INSERT INTO dim_customer_t6 (customer_id, city, current_city, valid_from, valid_to, is_current)
VALUES (42, 'Seattle', 'Seattle', '2026-05-01', NULL, true);

-- Step 3: Type 1 update — set current_city on ALL rows for this customer
UPDATE dim_customer_t6
SET current_city = 'Seattle'
WHERE customer_id = 42;

COMMIT;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Step 1 — Type 2 — close the existing current row.
  2. Step 2 — Type 2 — insert the new current row.
  3. Step 3 — Type 1 — update current_city on every row of this business key (including the historical one) so a join showing "current city" doesn't need a sub-query.
  4. End state has two history rows and identical current_city values on both.

Output (after the change).

sk customer_id city current_city valid_from valid_to is_current
101 42 Austin Seattle 2024-01-01 2026-04-30 false
102 42 Seattle Seattle 2026-05-01 NULL true

Rule of thumb. current_city (Type 1) is denormalised redundancy that buys read simplicity. Use Type 6 when both "as of" and "current" reports are common.

SCD interview question on Type 6 fact-to-dim joins

The probe: "With Type 6, can I report both 'orders by historical city' and 'orders by current city' in one query?"

Solution Using both city and current_city from the same row

SELECT
    c.customer_id,
    c.city          AS city_at_order_time,    -- Type 2
    c.current_city  AS city_now,              -- Type 1
    COUNT(*)        AS orders
FROM fact_orders f
JOIN dim_customer_t6 c
  ON c.customer_id = f.customer_id
 AND f.order_date >= c.valid_from
 AND (f.order_date < c.valid_to OR c.valid_to IS NULL)
GROUP BY 1, 2, 3
ORDER BY 1;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

order order_date matched dim row city (T2) current_city (T1)
7001 2024-06-15 sk=101 (Austin) Austin Seattle
7002 2026-05-15 sk=102 (Seattle) Seattle Seattle

Output:

customer_id city_at_order_time city_now orders
42 Austin Seattle 1
42 Seattle Seattle 1

Why this works — concept by concept:

  • Two attributes from one row — Type 6 lets one effective-date join return both the historical and the current values.
  • No sub-query for "current city"current_city is denormalised on every history row.
  • Update cost — every Type 1 update touches every history row of the business key (potentially expensive on hot keys).
  • Storage — extra column on every row; usually a small price.
  • Cost — read query = O(matching dim rows); write update = O(history rows per business key).

Dimensional modelling
Topic — joins
Type 6 hybrid dim modelling problems

Practice →


6. SCD with dbt snapshots — timestamp vs check strategy

dbt snapshot automates SCD Type 2 — pick timestamp or check strategy

dbt snapshots are the canonical way to materialise SCD Type 2 in the modern data stack. dbt handles the surrogate key, dbt_valid_from, dbt_valid_to, and detects changes automatically.

dbt snapshot file shape.

-- snapshots/customers_snapshot.sql
{% snapshot customers_snapshot %}

{{
  config(
    target_schema='snapshots',
    unique_key='customer_id',
    strategy='timestamp',       -- or 'check'
    updated_at='updated_at',    -- timestamp strategy
    -- check_cols=['city','segment'],  -- check strategy
  )
}}

SELECT * FROM {{ source('app', 'customers') }}

{% endsnapshot %}
Enter fullscreen mode Exit fullscreen mode

The two strategies.

  • timestamp — relies on a monotonically-updated updated_at column in the source. Fast: dbt compares updated_at against the snapshot's dbt_valid_from. Best when the source guarantees an accurate updated_at.
  • check — compares a list of named columns row-by-row; if any differ, dbt closes the old row and inserts a new one. More flexible but slightly slower (column-by-column hash comparison).

Generated columns.

  • dbt_valid_from — when this version became current.
  • dbt_valid_to — when this version stopped being current (NULL for current).
  • dbt_scd_id — surrogate key, hash of (unique_key, dbt_updated_at).
  • dbt_updated_at — copy of the updated_at column (timestamp strategy only).

Running snapshots.

  • dbt snapshot — runs every snapshot in the project; idempotent.
  • Schedule daily/hourly via Airflow / dbt Cloud / cron.

Worked example — timestamp strategy snapshot

Detailed explanation. A customer's city changes; dbt detects the updated_at increase and rolls the snapshot.

Question. Trace one snapshot run for a city change.

Input (source customers).

customer_id city updated_at
42 Austin 2024-01-01

After the change:

customer_id city updated_at
42 Seattle 2026-05-01

Code.

{% snapshot customers_snapshot %}

{{
  config(
    target_schema='snapshots',
    unique_key='customer_id',
    strategy='timestamp',
    updated_at='updated_at',
  )
}}

SELECT * FROM {{ source('app', 'customers') }}

{% endsnapshot %}
Enter fullscreen mode Exit fullscreen mode

Run:

dbt snapshot
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. First run (2024-01-01) — source has (42, Austin, 2024-01-01). Snapshot table is empty → INSERT row with dbt_valid_from=2024-01-01, dbt_valid_to=NULL, dbt_scd_id=hash(42, 2024-01-01).
  2. Second run (2026-05-01) — source now has (42, Seattle, 2026-05-01). dbt compares updated_at=2026-05-01 against the snapshot's current dbt_updated_at=2024-01-01 — they differ.
  3. dbt closes the old row: UPDATE … SET dbt_valid_to='2026-05-01' WHERE dbt_scd_id=hash(42, 2024-01-01).
  4. dbt inserts the new row: INSERT … (42, 'Seattle', '2026-05-01', dbt_valid_from='2026-05-01', dbt_valid_to=NULL, dbt_scd_id=hash(42, '2026-05-01')).

Output (snapshot table after the change).

customer_id city updated_at dbt_valid_from dbt_valid_to dbt_scd_id
42 Austin 2024-01-01 2024-01-01 2026-05-01 hash(42, 2024-01-01)
42 Seattle 2026-05-01 2026-05-01 NULL hash(42, 2026-05-01)

Rule of thumb. timestamp strategy if the source has a reliable updated_at; check strategy otherwise. Never use check when timestamp would work — it's noticeably slower on big tables.

dbt snapshot interview question on check strategy

The probe: "Our source DB doesn't have an updated_at column. SCD2 anyway?"

Solution Using check strategy with named columns

{% snapshot customers_snapshot_check %}

{{
  config(
    target_schema='snapshots',
    unique_key='customer_id',
    strategy='check',
    check_cols=['city', 'segment', 'lifetime_value'],
  )
}}

SELECT * FROM {{ source('app', 'customers') }}

{% endsnapshot %}
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Run Source Snapshot state Action
1 (42, Austin, premium, 1000) empty INSERT row, dbt_valid_from = current_timestamp
2 (no change) (42, Austin, premium, 1000) one row NO-OP — hash(city, segment, lifetime_value) unchanged
3 (city changes) (42, Seattle, premium, 1000) one row close old row, INSERT new — hash changed
4 (lifetime_value changes) (42, Seattle, premium, 1500) two rows close current row, INSERT new

Output (snapshot table after 4 runs):

customer_id city segment lifetime_value dbt_valid_from dbt_valid_to
42 Austin premium 1000 run-1-time run-3-time
42 Seattle premium 1000 run-3-time run-4-time
42 Seattle premium 1500 run-4-time NULL

Why this works — concept by concept:

  • check_cols — dbt hashes the listed columns each run; any difference triggers a snapshot row update.
  • No source updated_at needed — dbt uses the run timestamp (dbt_valid_from = current_timestamp).
  • Trade-offcheck strategy reads more data and does more comparison than timestamp; performance overhead grows with table size and column count.
  • Idempotent runs — running dbt snapshot twice with no source change produces no new rows.
  • Cost — read = O(rows), hash = O(rows × columns), write = O(changed rows). Acceptable up to ~tens of millions of rows.

Dimensional modelling
Topic — slowly-changing-data
dbt snapshot / SCD design problems

Practice →


7. SCD interview gotchas — late updates, retroactive deletes, surrogate-key collisions

The three failure modes every senior slowly changing dimensions interview questions round probes

Gotcha 1 — Late-arriving updates.

  • Scenario: a customer update from 2026-04-15 arrives on 2026-05-15, but the dim already has a current row from 2026-04-20.
  • Naive code closes the wrong row or duplicates rows.
  • Fix: insert the late row by splitting the valid_from / valid_to window of the existing row that "owned" the affected date range. Slot the new version in the middle.

Gotcha 2 — Retroactive deletes.

  • Scenario: a row is deleted in the source DB.
  • Naive code does nothing (the staging extract doesn't include the deleted row, so no update fires).
  • Fix: anti-join staging against the dim — for every business key in the dim that's NOT in the source extract, flip is_current = false, valid_to = today on the current row.

Gotcha 3 — Surrogate-key collisions on re-run.

  • Scenario: a snapshot re-runs and assigns a new sk to the same (customer_id, valid_from) combination. Fact tables that referenced the old sk are now orphaned.
  • Naive code uses BIGSERIAL for sk, so re-runs always create new sks.
  • Fix: use a deterministic hash of (customer_id, valid_from) as the sk — re-runs produce the same value, fact joins stay intact.

Worked example — handle a late-arriving update

Detailed explanation. A 2026-04-15 update arrives on 2026-05-15; the dim has a current row from 2026-04-20.

Question. Show the dim before, the late row, and the dim after a correct late-arriving fix.

Input.

Existing dim:

sk customer_id city valid_from valid_to is_current
101 42 Austin 2024-01-01 NULL true

Late row (arriving 2026-05-15): (customer_id=42, city='Boston', effective_from='2026-04-15').

Code (corrected late-arriving update).

BEGIN;

-- Step 1: split the overlapping current row at the late row's effective date
INSERT INTO dim_customer (customer_id, city, valid_from, valid_to, is_current)
SELECT customer_id, city, valid_from, '2026-04-14'::DATE, false
FROM dim_customer
WHERE sk = 101;

-- Step 2: insert the late row (Boston, effective 2026-04-15, valid_to=NULL)
INSERT INTO dim_customer (customer_id, city, valid_from, valid_to, is_current)
VALUES (42, 'Boston', '2026-04-15', NULL, true);

-- Step 3: delete (or mark) the original overlapping row
DELETE FROM dim_customer WHERE sk = 101;

COMMIT;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The original row sk=101 covered 2024-01-01 → NULL. The late update says Boston was true from 2026-04-15.
  2. Split the old row into: 2024-01-01 → 2026-04-14 (the period when Austin was actually true).
  3. Insert the late Boston row: 2026-04-15 → NULL.
  4. Delete the original sk=101 because it has been superseded by the split rows.
  5. End state has TWO rows: Austin (2024-01-01 → 2026-04-14) and Boston (2026-04-15 → NULL).

Output.

customer_id city valid_from valid_to is_current
42 Austin 2024-01-01 2026-04-14 false
42 Boston 2026-04-15 NULL true

Rule of thumb. Late-arriving updates need window splitting, not append-only inserts. Always validate that effective dates don't overlap within the same business key.

SCD interview question on deterministic surrogate keys

The probe: "Our backfill re-runs the snapshot and now half the fact joins are broken. Why?"

Solution Using a hash-based deterministic surrogate key

-- Use a deterministic hash instead of BIGSERIAL
CREATE TABLE dim_customer (
    sk              BIGINT PRIMARY KEY,          -- BIGSERIAL replaced by hash
    customer_id     BIGINT NOT NULL,
    city            VARCHAR(100) NOT NULL,
    valid_from      DATE NOT NULL,
    valid_to        DATE,
    is_current      BOOLEAN NOT NULL DEFAULT TRUE
);

-- compute sk deterministically on insert (PostgreSQL, hashtextextended)
INSERT INTO dim_customer (sk, customer_id, city, valid_from, valid_to, is_current)
SELECT
    hashtextextended(customer_id::text || '|' || valid_from::text, 0) AS sk,
    customer_id, city, valid_from, valid_to, is_current
FROM stg;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Run Source row Computed sk Result
1 (42, Austin, 2024-01-01) hash('42 2024-01-01') = -7842341
2 (backfill) (42, Austin, 2024-01-01) hash('42 2024-01-01') = -7842341
1 (42, Seattle, 2026-05-01) hash('42 2026-05-01') = 1893221
2 (backfill) (42, Seattle, 2026-05-01) hash('42 2026-05-01') = 1893221

Output:

sk customer_id city valid_from Stable across re-runs?
-7842341 42 Austin 2024-01-01 yes
1893221 42 Seattle 2026-05-01 yes

Why this works — concept by concept:

  • Deterministic hashhash(business_key, valid_from) produces the same sk on every re-run.
  • No fact-table orphans — fact rows referencing sk = -7842341 always resolve to the same dim row.
  • BIGSERIAL collision avoided — re-running with BIGSERIAL would assign new sequential ids and break all existing joins.
  • dbt does this for youdbt_scd_id is exactly this hash (over unique_key + dbt_updated_at).
  • Cost — hash compute is trivial; the bigger win is that the sks are stable across data refreshes.

Dimensional modelling
Topic — dimensional-modeling
SCD gotchas + deterministic-sk problems

Practice →


Choosing the right SCD type (cheat sheet)

  • Need only current value? Type 1 — UPDATE in place, MERGE on business key.
  • Need full history for "as of" reports? Type 2 — close old row, insert new row, surrogate key + effective dates.
  • Need only "before/after" comparison? Type 3 — extra prev_* columns on a single row.
  • Need both current and historical without sub-queries? Type 6 — Type 2 history rows with denormalised current_* columns kept in sync.
  • Need automated SCD2 in dbt? dbt snapshot with strategy='timestamp' if source has reliable updated_at, otherwise strategy='check' with column list.
  • Need stable surrogate keys across re-runs? Deterministic hash of (business_key, valid_from), not BIGSERIAL.
  • Need to fact-join on a Type 2 dim? Surrogate key + effective-date predicate (fact_date BETWEEN valid_from AND COALESCE(valid_to, '9999-12-31')).
  • Handle late-arriving updates? Split the existing valid_from/valid_to window at the late effective date.
  • Handle retroactive deletes? Anti-join staging against dim and flip is_current=false, valid_to=today on missing keys.
  • Optimize Type 2 read performance? Partial index (business_key) WHERE is_current = true + index on (business_key, valid_from).

Frequently asked questions

What's the difference between SCD Type 1, Type 2, Type 3, and Type 6?

Type 1 overwrites in place — one row per business key, no history. Type 2 preserves full history by inserting a new row on every change, with valid_from, valid_to, and is_current columns. Type 3 keeps only "current" and "previous" values as separate columns on a single row — bounded history, no row growth. Type 6 is a hybrid: Type 2 history rows plus Type 1 current_* columns kept in sync across every row of the business key, plus optional Type 3 prev_* columns. Type 2 is the production default; Type 6 is for when both "as of" and "current" reports are common.

How do I implement SCD Type 2 with SQL MERGE INTO?

Use a two-step pattern wrapped in a transaction: (1) UPDATE the current row to set valid_to = snapshot_date - 1 day and is_current = false where the tracked column changed; (2) INSERT a new current row with valid_from = snapshot_date, valid_to = NULL, is_current = true. Some warehouses (Snowflake, BigQuery, Databricks) let you express both branches in a single MERGE INTO ... WHEN MATCHED ... WHEN NOT MATCHED ... statement. Always wrap the operations in a transaction so partial failure can't leave the dim in an inconsistent state.

What's the difference between a business key and a surrogate key in SCD?

The business key is the natural identifier from the source system — customer_id, product_sku, employee_number. It identifies the entity across time. The surrogate key (sk) is a synthetic id that uniquely identifies a version of the row — one sk per Type 2 row. Fact tables join to dims on the surrogate key, not the business key, because the same customer_id can have multiple historical versions. Use a deterministic hash of (business_key, valid_from) as the surrogate key so re-runs of the snapshot produce stable sks and don't orphan fact joins.

How does dbt snapshot implement SCD Type 2?

dbt snapshot is dbt's built-in SCD Type 2 materialisation. You write a snapshot file with a unique_key and a strategy of either timestamp (uses the source's updated_at column) or check (compares a named list of columns). dbt generates dbt_valid_from, dbt_valid_to, dbt_scd_id, and dbt_updated_at columns automatically. Each dbt snapshot run closes old versions and inserts new ones for any source row whose tracked attributes have changed. Schedule snapshots daily/hourly via Airflow, dbt Cloud, or cron.

How do I join a fact table to an SCD Type 2 dim?

Join on the business key with an effective-date predicate: JOIN dim_customer c ON c.customer_id = f.customer_id AND f.order_date >= c.valid_from AND (f.order_date < c.valid_to OR c.valid_to IS NULL). This ensures each fact row matches exactly one dim version — the one that was current at the time of the fact event. Alternatively, store the dim's surrogate key directly in the fact table at load time; then the join is simply fact.customer_sk = dim.sk with no date predicate, but the fact load must look up the right sk for the event's date.

What are the most common SCD interview gotchas?

Three failure modes every senior round probes: (1) late-arriving updates — a change with an effective date earlier than the current dim row's valid_from; the fix is to split the existing valid_from/valid_to window at the late effective date. (2) Retroactive deletes — a row deleted in the source DB; the fix is to anti-join the staging extract against the dim and flip is_current=false, valid_to=today on any business key missing from the source. (3) Surrogate-key collisions on re-runBIGSERIAL assigns new sks on backfill, breaking fact joins; the fix is a deterministic hash of (business_key, valid_from) as the sk.

Practice on PipeCode

Pipecode.ai is Leetcode for Data Engineering — every SCD concept above ships with hands-on practice rooms where you design dim tables, write SCD2 MERGEs, and trace fact-dim joins. Start with the dimensional-modelling library and work outward; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.

Practice dimensional modelling →
Slowly-changing-data drills →

Source: dev.to

arrow_back Back to Tutorials