Lakehouse Data Mesh: Domain Ownership, Contracts & Federated Governance

data mesh is the architecture pattern that finally answers a question every senior data engineer has been forced to answer since 2018: what do you do when one central data team becomes a ticket queue and every domain in the business is blocked behind it? The honest answer — the one Zhamak Dehghani published in 2019, refined through six years of production lessons — is to decentralise the data the way the business is already decentralised, put each domain in charge of its own data, and let a small platform team write the policies and primitives that keep the whole organisation from devolving into ungoverned anarchy. The new substrate that makes that decentralisation feasible at scale is the lakehouse architecture — Delta, Iceberg, and Hudi on object storage, with a single cross-domain catalog on top.

This guide is the senior-engineer field manual for designing a real mesh on top of a real lakehouse. It walks the four principles in plain English, draws the bounded-context map (raw / derived / product tiers), names the six fields every data contract YAML must have, sketches the federated computational governance loop with Open Policy Agent and Unity Catalog, and is brutally honest about when mesh is the wrong answer. Each section ships an architecture-interview answer — diagrams, code, a step-by-step trace, an output card, and a concept-by-concept walkthrough of why the pattern wins.

When you want hands-on reps immediately after reading, drill the data modeling practice library →, rehearse on dimensional modeling problems →, and stack the platform muscles with ETL design drills →.

On this page

Why centralized data teams stop scaling at 100+ engineers
The four principles of data mesh, made concrete
Domain bounded contexts — drawing the lines
Data contracts — the API of a data product
Federated computational governance — policy as code
Cheat sheet — data mesh implementation recipes
Frequently asked questions
Practice on PipeCode

1. Why centralized data teams stop scaling at 100+ engineers

The central data team is a ticket queue — and Conway's law says monolithic data orgs produce monolithic data warehouses

The one-sentence invariant: as an organisation grows past roughly one hundred product engineers, the central data team becomes a ticket queue whose length grows linearly with org size — every new domain (marketing, payments, inventory, support, ML) adds requests that one team has to triage, prioritise, and ship. The result is six-week SLAs on simple metric requests, a backlog that never goes down, and a "shadow IT" pattern where every domain spins up its own pipeline outside the central platform just to ship anything at all.

Three failure patterns of the centralised model.

Ticket queue grows linearly. Each new business domain adds 2-5 standing analytical requests per week. A central team of 20 engineers can ship maybe 30 requests per week; once you have 7 domains, the queue is permanently saturated and lead time blows past one quarter.
The central team has zero domain context. The marketing team knows what "qualified lead" means today (and that the definition changed two weeks ago). The data engineer in the central team learns this on the seventh slack ping during PR review.
Conway's law. "Any organisation that designs a system will produce a design whose structure is a copy of the organisation's communication structure." A single central team produces a single monolithic warehouse — one schema, one repo, one CI pipeline, one on-call rotation — and that monolith carries every domain's quirks at once.

Conway's law in one sentence.

Conway's law (Melvin Conway, 1967) says software architecture mirrors org structure. The inverse for data is just as true: if you want a federated data architecture, you need a federated data organisation. Trying to bolt mesh onto a centralised team is a contradiction in terms.

What interviewers listen for.

Do you say "the queue grows linearly with org size, but team headcount only grows by a hire per quarter" when asked about scaling pain? — senior signal.
Do you mention Conway's law and connect it back to "a single team produces a single monolith"? — senior signal.
Do you correctly identify mesh as a socio-technical pattern (not a technology purchase)? — required answer.
Do you push back on "we need mesh" when the org has 30 engineers? — senior signal (knows when not to apply it).

The 2026 reality.

Lakehouse formats — Delta, Iceberg, Hudi — turn object storage into a multi-engine substrate. A parquet/iceberg table can be queried from Spark, Trino, Snowflake, BigQuery, and DuckDB without copying data. That is the technical precondition that makes "one domain, one substrate, many consumers" feasible.
Cross-engine catalogs — Unity Catalog OSS (Databricks), Polaris (Snowflake), Apache Gravitino — standardise how domain ownership, access policies, and tags travel across compute engines. Without a unifying catalog, mesh devolves into a per-engine permission matrix nobody can audit.
Open Policy Agent (OPA) is the de-facto policy-as-code engine. Every modern CI now ships an OPA evaluation step that can block a PR if it violates a central policy.
The "anti-mesh" pattern is real. A 2024-2025 wave of "we tried data mesh and it became anarchy" post-mortems traces back to teams that adopted the domain ownership principle without the federated governance principle. Mesh without policy-as-code is exactly what those post-mortems said it would be.

Worked example — the ticket-queue bottleneck in one back-of-envelope calculation

Detailed explanation. Most data leaders intuit that "the central team is overwhelmed," but cannot quantify the bottleneck for the CFO. A simple Little's-Law-style calculation turns the intuition into a number that justifies the org change.

Question. A central data team has 20 engineers. The business has 7 domains, each filing an average of 3 standing analytical requests per week. Each request takes the central team an average of 1.5 engineer-weeks to ship. What is the steady-state queue depth and lead time?

Input.

Variable	Value
Central team size	20 engineers
Throughput per engineer-week	0.67 requests (1 / 1.5)
Weekly arrival rate	7 × 3 = 21 requests
Weekly service rate	20 × 0.67 = 13.3 requests

Code.

# Little's Law: L = λ × W
# When arrival rate > service rate, queue grows without bound.
arrival_per_week = 7 * 3                  # 21
service_per_week = 20 * (1 / 1.5)         # 13.3
utilization      = arrival_per_week / service_per_week   # 1.58

# Queue growth per week (deterministic approximation)
backlog_growth = arrival_per_week - service_per_week     # +7.7 per week
weeks_to_one_quarter_lead_time = 13 * service_per_week / backlog_growth
print(f"utilization: {utilization:.2f}")
print(f"backlog grows by {backlog_growth:.1f} requests/week")
print(f"lead time hits 1 quarter in ~{weeks_to_one_quarter_lead_time:.0f} weeks")

Step-by-step explanation.

Arrival rate (21 / week) exceeds service rate (13.3 / week). Utilisation is 1.58 — already above 1.0, which is the queueing-theory red line for "unbounded growth."
Every week the backlog grows by about 7.7 requests. After 8 weeks the backlog is roughly 60 open requests on top of in-flight work.
Lead time — the time from "request filed" to "request shipped" — grows linearly with backlog. The team hits a one-quarter (13-week) average lead time after about 23 weeks.
Adding 4 engineers to the central team raises service rate to 16.0 / week. That is still below 21 / week — the queue still grows. The arrival rate is dominated by organisational structure (number of domains), so the only way to keep up is to change the structure: let each domain serve its own queue.

Output.

Metric	Value
Utilisation	1.58 (unsustainable)
Backlog growth	+7.7 requests / week
Time to one-quarter lead time	~23 weeks
Engineers needed to break even	32 (60% headcount increase)
Engineers needed under mesh	20 (same headcount, queue per domain)

Rule of thumb. When utilisation crosses 0.85, lead time spikes; when it crosses 1.0, the queue is mathematically unbounded. The fix is not "more engineers on the central team" — it is distributing the arrival rate across owners who already know the domain context.

Worked example — Conway's law applied to the warehouse schema

Detailed explanation. Walk into a centralised warehouse and look at the schema. You will see one giant fact_events table fed by every domain, conflicting column conventions, mixed grain rows, and a metadata blob column nobody understands. That is not a schema-design failure — it is the org chart leaking into the database.

Question. A central team of 8 maintains a single fact_events table fed by marketing, payments, inventory, and support events. List three structural pathologies you predict in that schema purely from Conway's law, and the data-mesh restructure that fixes each.

Input — central warehouse symptoms.

Symptom	Conway's law prediction
`payload_json` blob with 47 keys	one schema cannot encode four domains' semantics
`event_type` column with 380 values	each domain reuses the column for its own taxonomy
`is_test` column NULL for 80% of rows	one domain's `is_test` semantic is not the others'
Single PR queue with 18 reviewers	one repo for four domain teams' work

Code (sketch).

-- Centralised schema — every domain crammed into one table
CREATE TABLE fact_events (
    event_id     STRING,
    domain       STRING,         -- 'marketing' | 'payments' | 'inventory' | 'support'
    event_type   STRING,         -- 380 distinct values across all domains
    user_id      STRING,
    ts           TIMESTAMP_TZ,
    payload_json STRING,         -- 47 union'd keys
    is_test      BOOLEAN
);

-- Mesh schema — one product table per domain, each with its own contract
CREATE TABLE marketing.product_lead_event (
    lead_id     STRING,
    campaign_id STRING,
    user_id     STRING,
    ts          TIMESTAMP_TZ,
    score       DECIMAL(5, 2)
);

CREATE TABLE payments.product_payment_event (
    payment_id  STRING,
    order_id    STRING,
    amount      DECIMAL(12, 2),
    currency    STRING,
    ts          TIMESTAMP_TZ,
    status      STRING
);

Step-by-step explanation.

The centralised payload_json blob is the literal manifestation of Conway's law: four domains forced into one schema produce one column that must encode all four. Splitting into four domain-owned tables collapses 47 union keys into four cohesive schemas.
The 380-value event_type becomes a per-domain enum of 30-80 values, each documented in the domain's own data contract.
The is_test NULL contract is now per-domain — the payments product can require non-NULL, the marketing product can default to FALSE — and there is no longer a "what does NULL mean in this row" mystery.
The single 18-reviewer PR queue is split into four domain queues of 4-6 reviewers each, all of whom already understand the domain. PR review time drops from days to hours.

Output.

Layer	Central warehouse	Mesh restructure
Tables for events	1 (`fact_events`)	4 (one per domain)
Schema columns	7 + 47-key JSON	5-8 per table
Distinct event_types	380 across one column	30-80 per domain, separately typed
PR queue	1 × 18 reviewers	4 × 4-6 reviewers
Domain-context owner	central team (none)	each domain team

Rule of thumb. If the central warehouse already has a fact_everything table with a JSON blob and 300+ enum values in one column, the org has been silently telling you it needs a mesh for a year. The schema is the symptom; the org is the cause.

Worked example — when mesh is wrong (the honest 200-engineer line)

Detailed explanation. The most senior signal in a mesh interview is the willingness to say "no, do not do mesh here." Mesh costs are real — platform team headcount, contract tooling, federated governance setup, multi-quarter migration — and below a certain org size they outweigh the benefits.

Question. A 60-engineer SaaS company with three product domains has a 5-engineer central data team and a six-week metric SLA. Should they adopt data mesh? Justify with three concrete checks.

Input — org snapshot.

Variable	Value
Total engineers	60
Product domains	3
Central data team	5
Current metric SLA	6 weeks
Platform team budget	0 engineers
Domain teams' data fluency	low (no embedded analysts)

Code (back-of-envelope cost model).

# Cost / benefit estimate for a 60-engineer org
mesh_setup_cost_eng_quarters = (
    4   # platform team build-out (2 eng × 2 quarters)
  + 3   # per-domain onboarding (1 eng-quarter × 3 domains)
  + 2   # governance tooling (OPA + contract CLI)
)  # = 9 eng-quarters

current_central_pain_eng_quarters_saved_per_year = (
    # 6-week SLA is bad, but only 30 metric requests/quarter at this size
    # central team can ship them in 2.5 eng-quarters/year of extra capacity
    2.5
)

# Payback period in years
payback_years = mesh_setup_cost_eng_quarters / (
    4 * current_central_pain_eng_quarters_saved_per_year
)
print(f"payback: {payback_years:.1f} years")

Step-by-step explanation.

Mesh setup cost for a small org is roughly 9 engineer-quarters: a 2-person platform team for 2 quarters, plus domain onboarding, plus governance tooling.
The pain the central team is currently absorbing is ~2.5 engineer-quarters per year of overflow. The mesh setup costs roughly 4 years of overflow pain to recoup.
None of the three domains has embedded analysts yet. Pushing data ownership onto teams that have never owned a pipeline is adding a queue, not removing one.
The right answer for this org is: hire one more central engineer, invest in self-serve metric tooling, and revisit mesh when the org passes 150 engineers and at least three domains have embedded data engineers.

Output.

Check	Threshold	This org	Verdict
Domain count	≥ 4 domains with embedded DEs	3 domains, 0 embedded DEs	fail
Engineer count	≥ 150-200 product engineers	60	fail
Platform budget	≥ 2 platform engineers funded	0	fail
Current SLA pain	central team utilisation ≥ 0.9	6-week SLA but underloaded	fail

Rule of thumb. The rough industry threshold is 200 product engineers and 4+ domains with embedded data engineers. Below that line, mesh setup costs dwarf central-team pain. Above that line, the central team is mathematically incapable of keeping up and mesh is the only path.

Architecture interview question on scaling the central data team

A senior interviewer often opens with: "Your central data team of 20 is drowning. The CFO asks whether mesh is the answer. Walk me through the diagnostic — utilisation, Conway's law symptoms, org size — and give a clear yes / no with the three follow-up investments." It blends queueing math, org-design intuition, and the honesty to say "not yet."

Solution Using a four-axis diagnostic before recommending mesh

def mesh_readiness(org):
    """Score an org against four data-mesh prerequisites.

    Returns a verdict string + the three biggest gaps.
    """
    checks = {
        "scale":           org.product_engineers >= 200,
        "domains":         org.domains_with_embedded_de >= 4,
        "platform_budget": org.platform_engineers_funded >= 2,
        "central_pain":    org.central_team_utilization >= 0.9,
    }
    failed = [k for k, ok in checks.items() if not ok]
    if not failed:
        return "ready", []
    if len(failed) <= 1:
        return "almost — fix the gap first", failed
    return "not yet — hire central, invest in self-serve", failed

Step-by-step trace.

Org	engineers	embedded DEs	platform budget	utilisation	failed checks	verdict
60-eng SaaS	60	0	0	0.6	scale, domains, platform_budget	not yet
150-eng fintech	150	2	1	0.95	scale, domains, platform_budget	not yet
400-eng retail	400	5	3	0.92	(none)	ready
800-eng marketplace	800	7	4	0.96	(none)	ready

The diagnostic gives the CFO a binary answer plus a specific gap list. "Not yet" comes with the next investment ("hire central, build self-serve"); "ready" comes with the platform-team build-out timeline.

Output:

Org	Verdict	Next investment
60-eng SaaS	not yet	+1 central engineer; build self-serve metric tooling
150-eng fintech	not yet	embed DEs in 2 more domains; fund platform team
400-eng retail	ready	spin up 3-engineer platform team; pilot 1 domain
800-eng marketplace	ready	full mesh rollout over 4 quarters

Why this works — concept by concept:

Scale gate — mesh setup cost is roughly fixed (platform team + tooling + governance). The cost / benefit only makes sense at orgs large enough to keep that team busy. Industry rule of thumb is 200+ product engineers.
Domain readiness gate — a domain that has never owned a pipeline cannot suddenly own a product tier with a contract and an on-call rotation. Embedded data engineers are the precondition.
Platform funding gate — without a paid platform team, "self-serve" turns into "everyone for themselves." Mesh assumes the central team converts from model-writers to platform-builders, not disappears.
Central-pain gate — if the central team is not saturated, mesh is solving a non-problem. The pain itself is the signal that decentralisation will pay back.
Cost — mesh setup is 8-15 engineer-quarters depending on org size. Below the readiness line, simpler interventions (more central headcount, self-serve metric layers) pay back faster.

Data Architecture
Topic — design
Data architecture design problems

Practice →

2. The four principles of data mesh, made concrete

Mesh is four principles, four artifacts, and four failure modes — name each pair or you are buzzword-engineering

The mental model in one line: data mesh is four principles (domain ownership, data as product, self-serve platform, federated computational governance) that each map to a concrete artifact (domain repo, contract YAML, platform CLI, OPA policy) and a specific failure mode if you skip the artifact. Once you can quote the principle, the artifact, and the failure mode for all four, you can defend the architecture in any review meeting.

The four principles in one table.

#	Principle	One-line definition	Concrete artifact	Typical failure mode
1	Domain ownership	the team that owns the business logic owns the data	one repo per domain, one on-call rotation	shadow IT in marketing — domains write pipelines outside platform
2	Data as a product	datasets have SLAs, versioning, consumers, a discoverable interface	`product.contract.yaml` in domain repo + catalog page	data dumped in S3 with no contract — "is this even fresh?"
3	Self-serve platform	central team provides substrate, not models	platform CLI, standard CI, golden paths for new domains	"self-serve" platform that needs platform-team tickets to use
4	Federated governance	central policy-as-code, domains comply automatically	OPA policies in git, Unity Catalog tags	Confluence pages nobody reads — governance via vibes

Lakehouse as the substrate.

One storage layer, many engines. Delta / Iceberg / Hudi tables live in S3 / GCS / ABFS. Each domain owns its bucket prefix. Compute engines (Spark, Trino, Snowflake, BigQuery) read the same files — domains are not forced to use one engine.
One catalog, many domains. Unity Catalog, Polaris, or Gravitino provides cross-domain discovery, access control, and tag propagation. Each domain owns a schema (catalog → schema → table); the catalog is the cross-domain marketplace.
One identity layer, many policies. Workload identity (OIDC, IAM Roles for Service Accounts) ties pipeline runs to a domain. Policies then reason about "which domain is asking" without per-engine permission matrices.

The "anti-mesh" pattern — what real mesh is not.

"We abandoned governance." Domain ownership without federated governance is anarchy. The whole point of the federated qualifier is that central guardrails coexist with domain autonomy.
"Every team rolls their own stack." That is the absence of a self-serve platform. Domain ownership of data does not mean domain ownership of infrastructure. The platform team's job is to make sure every domain can ship a new product tier in one day, not one quarter.
"We renamed the central data team." Renaming the central warehouse "the mesh" without changing who writes the SQL is theatre. Mesh requires org change as much as architecture change.

Common architecture-interview probes.

"Name the four principles in order." — required answer. Listing them out of order is a yellow flag.
"Map each principle to an artifact." — senior signal. Knowing the artifact means you have actually built one.
"Name the typical failure mode for each principle." — staff signal. Knowing the failure means you have seen one in production.
"Which principle does Unity Catalog implement?" — federated governance plus self-serve platform (it is the substrate for both). Knowing it spans two principles is the architect's answer.

Worked example — the four-principle audit on an e-commerce org

Detailed explanation. Apply the four-principle / four-artifact / four-failure rubric to a concrete e-commerce organisation with four domains (orders, inventory, customer, payments). The output is an honest audit that surfaces which principles the org has half-implemented.

Question. Given an e-commerce org with four domains, walk through each of the four principles and identify the current state (implemented / half-implemented / missing) plus the next investment.

Input — domain inventory.

Domain	Owns business logic?	Has own repo?	Publishes contract?	On-call?
`orders`	yes	yes	half (schema only)	yes
`inventory`	yes	yes	no	no
`customer`	yes	shared with `auth`	no	no
`payments`	yes	yes	yes	yes

Code — the audit harness.

PRINCIPLES = [
    ("domain_ownership",   ["owns_logic", "own_repo", "on_call"]),
    ("data_as_product",    ["publishes_contract"]),
    ("self_serve_platform",["uses_platform_cli", "uses_standard_ci"]),
    ("federated_gov",      ["opa_policies_pass", "tagged_in_catalog"]),
]

def audit_domain(domain):
    return {
        principle: all(getattr(domain, f) for f in fields)
        for principle, fields in PRINCIPLES
    }

Step-by-step explanation.

orders owns its logic and has a repo + on-call but only publishes a schema, not a full contract. Half-implemented data_as_product. Next investment: ship the SLA and semantics sections of the contract.
inventory owns the logic and has a repo, but no contract and no on-call. Two of four principles broken. Next investment: name a domain lead and fund an on-call rotation before publishing cross-domain.
customer shares its repo with auth. That is a broken domain_ownership boundary — the bounded context is fuzzy. Next investment: split the repo or formalise the joint ownership.
payments is fully implemented across all four principles. Use it as the lighthouse domain when onboarding the other three.

Output.

Domain	Ownership	Product	Self-serve	Governance	Verdict
orders	yes	half	yes	yes	promote to full once contract complete
inventory	half	no	yes	no	needs on-call + contract
customer	half	no	half	no	split repo or formalise joint ownership
payments	yes	yes	yes	yes	lighthouse

Rule of thumb. Score each domain quarterly against the four-principle rubric. Promote one "lighthouse" domain to full implementation first, then use it as the migration template — fastest way to convert the rest of the org.

Worked example — mapping a principle to its artifact for a new domain

Detailed explanation. A new loyalty domain is being stood up. The platform team's job is to make sure every principle has a concrete artifact at day-one — not "we will add the contract later." Skipping any artifact at onboarding is how mesh devolves into pre-mesh chaos with a new name.

Question. Walk through the platform-team checklist for onboarding a new loyalty domain — name the artifact for each of the four principles and the day-one acceptance test.

Input — onboarding checklist (skeleton).

Principle	Artifact	Day-one acceptance test
domain ownership	repo + CODEOWNERS + on-call schedule	PR auto-assigns to domain lead
data as product	`loyalty/product.contract.yaml`	contract validates in CI
self-serve platform	`platform-cli init loyalty`	CI passes on `main` within 30 minutes
federated governance	OPA policies attached, catalog tags set	CI fails if PII column unmasked

Code — onboarding script.

# One-command domain onboarding via the platform CLI
$ platform-cli init loyalty --owner @loyalty-team --on-call loyalty-pager

# Creates:
#   • git repo loyalty/ with CODEOWNERS pre-filled
#   • product.contract.yaml stub (name, version, owner, schema, sla, semantics)
#   • Standard CI workflow (dbt build → contract validate → OPA check → publish)
#   • Catalog schema `loyalty` registered in Unity Catalog
#   • OPA policy bundle attached (pii_masking.rego, retention.rego, cross_region.rego)
#   • On-call rotation wired to PagerDuty

Step-by-step explanation.

The platform CLI is the artifact for self-serve platform. One command stands up every artifact the new domain needs. If the CLI does not exist, every onboarding becomes a multi-day ticket — and self-serve is theatre.
The product.contract.yaml stub is the artifact for data as product. It is intentionally incomplete (just the schema sketch); the domain team fills in the SLA and semantics in the first sprint.
The CODEOWNERS file and on-call rotation are the artifacts for domain ownership. From day one, PRs route to the domain team, and incidents page the domain on-call — not the central team.
The OPA policy bundle is the artifact for federated governance. The same set of policies that runs against orders, payments, etc. is attached to the new domain. The central team never has to manually review each PR — the policies do it.

Output.

Hour	Artifact created	Day-one acceptance test
0:00	git repo + CODEOWNERS	PR auto-assigns
0:05	contract YAML stub	`validate-contract` passes
0:10	CI workflow installed	`main` build green
0:15	catalog schema registered	`SHOW SCHEMAS` includes `loyalty`
0:20	OPA policies attached	mock PR with PII fails CI
0:30	on-call rotation live	PagerDuty test ping reaches lead

Rule of thumb. If onboarding a new domain takes more than half a day, the platform is not self-serve. Treat the onboarding time as the single most important platform-team KPI — anything over 4 hours means the CLI is missing a step.

Worked example — the "self-serve that needs platform tickets" anti-pattern

Detailed explanation. Mesh fails most often not because a principle is missing but because it is named without being implemented. The classic failure: the platform team builds a "self-serve" platform that requires a platform-team ticket to create a new dataset. Domain teams treat it like a ticket queue and the central pain returns under a new name.

Question. A platform team claims their platform is self-serve, but the onboarding doc has 9 manual steps and a "file a ticket with platform" item. Identify the three concrete fixes and the metric that proves the fix worked.

Input — current onboarding doc.

Step	Owner	Manual?
1. Create AWS sub-account	platform	yes (ticket)
2. Register Unity Catalog schema	platform	yes (ticket)
3. Set up dbt project	domain	yes
4. Add to CI pipeline	platform	yes (ticket)
5. Attach OPA policy bundle	platform	yes (ticket)
6. Register in catalog	domain	yes
7. PagerDuty rotation	platform	yes (ticket)
8. Backstage entry	platform	yes (ticket)
9. SLO dashboard	platform	yes (ticket)

Code — the fix.

# Replace 5 platform-owned ticket steps with one CLI call
$ platform-cli init loyalty --owner @loyalty-team --on-call loyalty-pager

# Internally automates: sub-account, catalog schema, CI workflow,
# policy bundle, PagerDuty, Backstage, SLO dashboard.

Step-by-step explanation.

Every step labelled "platform — ticket" is a queue. Five tickets across one onboarding means lead time is at least the sum of the five SLAs — and they cannot run in parallel because they have dependencies.
The fix is automation, not delegation. Each ticket-step becomes a Terraform module or platform-CLI subcommand. The domain team triggers the whole sequence with one command.
The metric that proves the fix worked is median onboarding lead time. Pre-fix: 9-14 days (5 tickets × 2 days each + serial dependencies). Post-fix: 30 minutes (one CLI invocation).
The platform team's new ticket queue is not onboarding; it is platform feature requests ("can the CLI also set up a Trino catalog?"). That queue is bounded and predictable — the onboarding queue was not.

Output.

Metric	Before	After
Manual platform-team tickets per onboarding	5	0
Median onboarding lead time	9-14 days	30 minutes
Platform-team onboarding workload / quarter	5 × N domains	1 (CLI maintenance)
Domain-team frustration	high	low

Rule of thumb. The platform team's success metric is "self-serve onboarding lead time" — measured at the p50. If a new domain cannot stand up its first product table without a ticket, the platform is mis-named.

Architecture interview question on the four principles

A senior interviewer often frames this as: "Map a hypothetical e-commerce platform's four domains against the four mesh principles. Score each cell. Name the next investment that buys the biggest reduction in central-team workload." It probes whether you can use the rubric as a real diagnostic instead of as architecture-deck filler.

Solution Using the four-by-four mesh-readiness matrix

def mesh_audit(domain):
    """Score a domain on the four mesh principles."""
    return {
        "domain_ownership":     score(domain.has_repo, domain.has_oncall, domain.codeowners),
        "data_as_product":      score(domain.has_contract, domain.has_versioning, domain.has_consumers),
        "self_serve_platform":  score(domain.uses_cli, domain.uses_standard_ci, domain.onboarding_minutes < 60),
        "federated_governance": score(domain.opa_passing, domain.tags_set, domain.pii_lineage_clean),
    }

def score(*flags):
    return sum(1 for f in flags if f)

Step-by-step trace.

Domain	ownership / 3	product / 3	self-serve / 3	governance / 3	total / 12
orders	3	2	3	3	11
inventory	2	0	3	1	6
customer	1	0	2	1	4
payments	3	3	3	3	12

The lowest column total (governance, here at 8 / 12 across four domains) flags the biggest org-wide gap. Investing one quarter to ship the OPA pii_masking.rego policy bundle across all four domains lifts governance to 12 / 12 and removes the most expensive central-team workload (manual PII review).

Output:

Investment	Total score lift	Central-team load saved
Ship OPA PII policy bundle	+4 (governance column)	~3 reviews / week
Add SLA section to `inventory` contract	+1 (product column)	~1 ticket / week
Split `customer` repo from `auth`	+1 (ownership column)	~1 review / week
Complete the audit quarterly	(process)	catches regressions early

Why this works — concept by concept:

Four-axis matrix — turns "are we doing mesh right?" from a vibe into a number. Each axis is a column in the matrix; each domain is a row.
Score per principle — three sub-checks per principle prevents binary "yes / no" gaming. A domain with a repo but no on-call is not "yes" on ownership; it is 1 / 3.
Lighthouse pattern — the highest-scoring domain (here payments) becomes the template. The platform team converts other domains into copies of the lighthouse, not greenfield each time.
Investment = lowest column total — the column with the worst aggregate score is the place where one centralised investment (a policy bundle, a CLI feature) buys the biggest lift across the whole org.
Cost — running the audit takes one engineer-day per quarter; the data it produces drives the platform team's roadmap for the entire next quarter. Cheap insurance against silent mesh erosion.

Data Architecture
Topic — dimensional modeling
Dimensional modelling for mesh domains

Practice →

3. Domain bounded contexts — drawing the lines

A bounded context is the unit of ownership — only the `product` tier crosses the line

The mental model in one line: a domain owns three internal tiers (raw, derived, product), but only the product tier is consumable by other domains — every cross-domain read goes through the catalog, never reaches into raw or derived storage. Once you can say "raw is private, derived is private, product is the API," the entire bounded-context interview surface collapses to enforcing one rule in CI.

The three-tier model per domain.

Raw tier (private). Whatever the source systems send: Kafka topics, CDC log files, vendor CSV drops. Schema may change daily. Only the domain pipeline reads it.
Derived tier (private). Cleaned, conformed, deduplicated, partitioned — the domain's internal working sets. May be joined heavily, may contain PII, may be expensive to recompute. Only the domain pipeline reads it.
Product tier (public, contract-bound). What other domains consume. Schema is locked by a contract. Versioned with semver. PII is masked or hashed. Documented in the catalog. Has an SLA and an on-call rotation.

The cross-domain consumption rule.

Cross-domain reads only touch the product tier. Marketing reads orders.order_facts (product); marketing never reads orders.orders_kafka_landing (raw) or orders.orders_clean_partitioned (derived).
Cross-domain reads go through the catalog. Unity Catalog / Polaris / Gravitino resolves the table name, applies access control, applies row / column policies, propagates tags. No direct S3 reads across domains.
Cross-domain joins on raw are banned. A platform-level OPA policy blocks any pipeline whose dependency graph reads another domain's raw or derived tier. The CI rejects the PR with a clear message.

The subscription model.

Consumers pin to a major.minor version. Marketing pins to order_facts >= 2.1, < 3.0. They get patch upgrades automatically; minor upgrades are additive; major upgrades require an explicit pin bump.
Producers signal breaking changes via PR. A schema change that breaks downstream consumers fails CI unless the major version is bumped AND a deprecation notice was posted ≥ 1 quarter ago.
The catalog tracks subscribers. Every product table page lists its current downstream consumers — making "who is reading this?" a one-page lookup instead of a Slack archaeology dig.

Common architecture-interview probes on bounded contexts.

"Can the marketing domain JOIN orders.orders_kafka_landing for performance?" — no, that breaks the bounded context. The right answer is publishing the needed fields as a product table or extending an existing product. Saying "yes for performance" is an automatic fail signal.
"Who owns dim_date?" — the platform team, almost always. Conformed dimensions (dim_date, dim_geo, dim_currency) are platform-tier products, not domain-tier products. Treating them as just-another-domain creates a circular ownership problem.
"What happens when two domains need to join their product tables?" — they JOIN through the catalog, both reading product. If the join is hot, model it once at the platform tier as a curated cross-domain mart.
"How do you stop a consumer from reaching into raw?" — OPA policy on lineage scan: if a downstream pipeline's lineage includes another domain's raw or derived tier, CI fails.

Worked example — drawing the bounded-context map for an e-commerce platform

Detailed explanation. Walk through a concrete e-commerce org's bounded contexts. Each domain owns a verb (placing orders, tracking inventory, knowing customers, processing payments). The product tier is the cross-domain language; the raw and derived tiers are private vocabulary.

Question. Define the four domains for an e-commerce platform, list their three tiers each, and identify two cross-domain consumption flows that must use the product tier.

Input — domain inventory.

Domain	Verb	Source systems	Product tables
orders	place + fulfill orders	order-service Kafka, OMS CDC	`order_facts` v2.1
inventory	track stock	inventory-service Kafka, warehouse RFID	`sku_inventory` v1.3
customer	know who the customer is	auth-service CDC, support tickets	`customer_dim` v3.0
payments	process money movement	stripe webhooks, ledger CDC	`payment_facts` v2.0

Code — the tier layout (Iceberg / Delta naming).

-- orders domain (Unity Catalog)
CREATE TABLE orders.raw.orders_kafka_landing (...);       -- private
CREATE TABLE orders.derived.orders_clean (...);           -- private
CREATE TABLE orders.product.order_facts (                 -- public, contract-bound
    order_id     STRING,
    customer_id  STRING,
    order_ts     TIMESTAMP_TZ,
    amount       DECIMAL(12, 2),
    currency     STRING,
    status       STRING
);

-- inventory domain
CREATE TABLE inventory.raw.inventory_kafka_landing (...); -- private
CREATE TABLE inventory.derived.inventory_clean (...);     -- private
CREATE TABLE inventory.product.sku_inventory (...);       -- public

-- Cross-domain flow: marketing reads orders.product.order_facts
-- joined with customer.product.customer_dim
SELECT
    c.segment,
    SUM(o.amount) AS total_revenue
FROM orders.product.order_facts o
JOIN customer.product.customer_dim c
  ON o.customer_id = c.customer_id
WHERE o.order_ts >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY c.segment;

Step-by-step explanation.

Each domain owns its raw + derived + product schemas under its catalog namespace. The schema namespace (orders.raw.*) makes ownership unambiguous and the access-control story trivial — grant the orders group write on orders.*, grant everyone else read on orders.product.* only.
The marketing query joins two product tables — that is the legal cross-domain pattern. The query never touches orders.raw.* or orders.derived.*, so the bounded context holds.
If marketing needs a new field (e.g. discount_amount) and it lives in orders.derived.orders_clean but not in orders.product.order_facts, the answer is not "let marketing read derived." It is "open a PR against the orders domain to extend the order_facts product schema as a v2.2 minor version."
The platform team owns platform.product.dim_date, platform.product.dim_geo, and other conformed dimensions. Every domain joins to those — domain teams do not each maintain their own date dimension.

Output.

Layer	Schemas	Cross-domain access
Raw	`<domain>.raw.*`	domain-internal only
Derived	`<domain>.derived.*`	domain-internal only
Product	`<domain>.product.*`	discoverable, contract-bound
Platform conformed dims	`platform.product.dim_*`	shared by every domain
Cross-domain marts	`platform.curated.*`	platform-managed joins of hot product tables

Rule of thumb. Build the catalog with a <domain>.<tier>.* namespace from day one. Reading from another domain's raw or derived is a CI failure — and the namespace makes the violation obvious in the SQL itself.

Worked example — the cross-domain consumer subscription with version pinning

Detailed explanation. The marketing analytics domain wants to consume orders.product.order_facts. Without a contract and version-pinning, every schema change in orders silently breaks marketing. With the subscription model, the relationship is explicit, the pinning is mechanical, and breaking changes require a one-quarter deprecation window.

Question. Walk through the subscription handshake when marketing consumes orders.product.order_facts. Show how a minor version bump is silent for the consumer, and how a major version bump goes through a deprecation flow.

Input — initial state.

Producer	Consumer	Pinned version	Current version
`orders.product.order_facts`	`marketing.derived.orders_enriched`	`^2.1` (any 2.x ≥ 2.1)	`2.1.4`

Code — the marketing consumer config.

# marketing/dbt_project.yml (snippet)
consumers:
  - source: orders.product.order_facts
    pin:    "^2.1"          # any 2.x at or above 2.1
    on_breaking: "fail_ci"  # bumping major requires explicit consumer-side PR

Step-by-step explanation.

orders ships 2.1.5 — patch only (bug fix). Marketing's pin (^2.1) accepts it. No PR, no review, no notification. Patches are silent.
orders ships 2.2.0 — minor (additive: new column discount_amount appended). Marketing's pin (^2.1) accepts it. The new column is ignored by marketing until they choose to use it. Additive minor bumps are also silent.
orders opens a PR for 3.0.0 — major (breaking: amount renamed to gross_amount). The PR includes a 90-day deprecation notice. Marketing's CI now displays a "your producer is sunsetting ^2.1 in 90 days" warning on every run.
Marketing opens its own PR to update the pin to ^3.0 and rename the column reference. The two PRs merge in coordinated order. The deprecation window prevents the producer from breaking the consumer mid-quarter.

Output.

Producer ship	Marketing's pin	Marketing PR needed?	Lead time
2.1.4 → 2.1.5	`^2.1`	no	0 (silent)
2.1.5 → 2.2.0	`^2.1`	no	0 (silent)
2.2.0 → 3.0.0	`^2.1`	yes	up to 1 quarter

Rule of thumb. Always pin with caret (^major.minor), never just latest. The caret lets patches and additive minors flow through, but turns major bumps into deliberate, reviewable events with a quarter of lead time.

Worked example — the "cross-domain JOIN on raw" anti-pattern, caught in CI

Detailed explanation. A junior engineer in marketing finds a join on orders.derived.orders_clean is 4x faster than the same join on orders.product.order_facts (the product tier has extra masking and view overhead). They submit a PR that reads from derived. The OPA policy catches the violation in CI and fails the PR with a clear message.

Question. Show the OPA policy that detects cross-domain reads outside the product tier, and the CI failure message a violating PR would receive.

Input — the violating SQL.

-- BAD — marketing reaching into orders' derived tier for performance
SELECT m.campaign_id, SUM(o.amount) AS revenue
FROM   marketing.derived.lead_attribution m
JOIN   orders.derived.orders_clean       o   -- ← bounded-context violation
  ON   o.order_id = m.order_id
GROUP BY m.campaign_id;

Code — the OPA policy (Rego).

package mesh.bounded_context

# Deny any pipeline whose lineage includes another domain's raw or derived tier.
deny[msg] {
    pipeline_domain := input.pipeline.domain
    upstream        := input.lineage[_]
    upstream.domain != pipeline_domain
    upstream.tier   != "product"
    msg := sprintf(
        "pipeline %v (domain=%v) reads %v.%v.%v — only the product tier of another domain is consumable",
        [input.pipeline.name, pipeline_domain, upstream.domain, upstream.tier, upstream.table],
    )
}

Step-by-step explanation.

The OPA policy inspects the pipeline's lineage manifest (produced by the dbt / lineage scanner during CI).
For every upstream dependency, it checks: is the upstream in a different domain and in a tier other than product? If yes, the policy fires.
The CI step opa eval returns non-zero, the PR comment shows the formatted message, and the merge button is greyed out.
The fix is for marketing to open a PR against orders requesting a new field on the product tier (a minor version bump), then update its query to read from product.

Output — the CI failure message.

CI check	Result	Message
dbt build	pass	12 models built
contract validate	pass	schema matches contract
bounded-context (OPA)	fail	pipeline `marketing.derived.lead_attribution` (domain=marketing) reads `orders.derived.orders_clean` — only the product tier of another domain is consumable
pii_masking (OPA)	pass	no PII columns detected

Rule of thumb. Codify the bounded-context rule as a policy that runs in CI, not as a Confluence page. The message needs to be specific enough that the engineer who tripped it can fix it in one PR — vague "this is not allowed" failures produce platform-team tickets, not fixes.

Architecture interview question on cross-domain consumption

A senior interviewer often frames this as: "Your marketing domain needs a field that exists in orders derived tier but not in the order_facts product. Walk me through the three options, recommend one, and explain the trade-offs." It probes whether you understand that the bounded-context rule has a cost and that mesh requires a clear escalation path.

Solution Using a product-extension PR with a minor version bump

# orders/product/order_facts.contract.yaml — proposed minor bump 2.1 → 2.2
name:        order_facts
version:     "2.2.0"    # minor: additive only
owner:       "@orders-team"
schema:
  - { name: order_id,        type: string,          required: true }
  - { name: customer_id,     type: string,          required: true }
  - { name: order_ts,        type: timestamp_tz,    required: true }
  - { name: amount,          type: "decimal(12,2)", required: true }
  - { name: currency,        type: string,          required: true }
  - { name: status,          type: string,          required: true }
  - { name: discount_amount, type: "decimal(12,2)", required: false } # NEW
sla:
  freshness:    "15m"
  completeness: "99.9%"
  accuracy:     "99%"
semantics:
  discount_amount: "USDdiscountapplied;0ifnodiscount;NULLonlyforlegacy<2.2"

Step-by-step trace.

Option	Cost	Time	Verdict
1. Marketing reads derived	0 (today)	1 day	breaks bounded context — fail CI
2. Marketing forks the cleaning pipeline	high (duplicate logic)	2 weeks	duplicates ownership; smell
3. Extend `order_facts` to v2.2 with `discount_amount`	low (one schema change)	3 days	recommended

The recommended path is a minor version bump on orders.product.order_facts. Adding an optional field is additive — no consumer breaks because the new field is opt-in. The orders team ships the change as v2.2.0; marketing updates its consumer in a follow-up PR; the bounded context holds.

Output:

Step	Owner	Artifact
1. PR to extend contract	orders + marketing co-authored	`product.contract.yaml`
2. CI: schema diff, semver check	platform CI	green / red
3. Producer ships v2.2	orders	new column in `order_facts`
4. Consumer updates query	marketing	reads new column
5. Catalog page auto-updates	platform	docs reflect new schema

Why this works — concept by concept:

Additive minor is the cheap escape valve — adding an optional column never breaks consumers. The contract semver rules turn "I need a new field" into a one-PR change instead of a cross-domain negotiation.
Bounded context preserved — marketing never reaches into orders' derived tier. The interview-grade signal is recognising that the performance argument for reaching into derived is short-term thinking that breaks the org.
Co-authored PR — the consumer (marketing) and producer (orders) co-author the PR. That is the right collaboration shape; it documents the consumer's need in the producer's repo and aligns review.
Catalog as marketplace — the catalog page is the source of truth for what order_facts v2.2 looks like. Search, schema, owner, freshness — all in one place. Marketing finds the new column there, not in Slack.
Cost — one schema change, one minor version bump, one PR each side. The alternative ("just read derived") costs a CI failure plus future architecture debt every time the schema drifts.

Data Architecture
Topic — data modeling
Domain modeling problems

Practice →

4. Data contracts — the API of a data product

A data contract is the YAML that turns a table into an API — six required fields, semver, git-reviewed

The mental model in one line: a data contract is a versioned YAML file (product.contract.yaml) that declares six fields — name, version, owner, schema, sla, semantics — lives in the domain's git repo, is reviewed via PR, and is enforced in CI on every write. Once the contract exists, the product table behaves like a REST API: callers know what they get, breaking changes are visible, and "is this field nullable?" is a one-line lookup, not a Slack archaeology dig.

The six required fields.

name — fully-qualified table name (orders.product.order_facts). Unique across the catalog.
version — semver string (2.1.0). Patch = bug fix, minor = additive, major = breaking.
owner — the domain team's git handle + on-call rotation (@orders-team + orders-pager). PRs auto-assign; incidents auto-page.
schema — column list with name, type, nullable, required, default. The structural promise.
sla — freshness window, completeness threshold, accuracy bound. The runtime promise.
semantics — what each column means in business terms: units, NULL semantics, business rules. The interpretive promise.

Schema enforcement at write-time.

dbt contracts (since 1.5) — contract: enforced: true in the model YAML makes dbt fail the run if the materialised table does not match the declared schema.
Great Expectations / Soda / Schemata — assertion frameworks that run in CI; check column types, ranges, uniqueness, freshness.
Lakehouse-level constraints — Delta and Iceberg both support NOT NULL, CHECK, and primary-key constraints at the table level. The contract maps to those.
Catalog-level pinning — Unity Catalog / Polaris records the contract version per table; readers see the schema as of the version they pinned to.

Semver rules — read aloud.

Patch (2.1.0 → 2.1.1). Bug fix, no schema change. Consumers do nothing.
Minor (2.1.0 → 2.2.0). Additive only — new optional columns, broader nullable values. Existing consumers' queries still compile and produce the same answer.
Major (2.1.0 → 3.0.0). Breaking — column rename, type narrow, drop, NULL contract change. Requires a deprecation window (typically ≥ 1 quarter) and explicit consumer-side opt-in.

SLAs as enforceable thresholds.

Freshness — freshness: "15m" means the latest row's event_ts is within 15 minutes of now(). Monitored by the platform; consumers see "stale" status in the catalog.
Completeness — completeness: "99.9%" means at most 0.1% of expected rows are missing in a window. Computed against a reference (CDC source, upstream count, etc.).
Accuracy — accuracy: "99%" means at most 1% of rows fail a domain-defined sample test (e.g. amount > 0, currency in standard_codes).
All three are observable — the platform publishes them as Prometheus metrics; the catalog page shows green / amber / red.

Common architecture-interview probes on contracts.

"What are the minimum fields in a data contract?" — name, version, owner, schema, sla, semantics. Six. Missing any is half a contract.
"Where does the contract live?" — in the domain's git repo, reviewed via PR. Never in a UI nobody opens. A contract in a UI is a Confluence page.
"What is the difference between patch, minor, and major?" — patch = bug, minor = additive, major = breaking. Treat the rule mechanically; do not "feel out" what's breaking.
"How do you stop a producer from shipping a breaking change in patch / minor?" — CI runs a schema-diff against the previous version; if the diff is non-additive and the semver bump is not major, CI fails.

Worked example — the full `order_facts.contract.yaml`

Detailed explanation. A complete, ship-it-today contract for order_facts v2.1 in the orders domain. Each field is filled with concrete values that pass a CI dry-run.

Question. Write the complete YAML for orders.product.order_facts v2.1 including all six required fields. Show how the producer-side dbt model references the contract.

Input — the schema sketch.

Column	Type	Nullable	Notes
order_id	string	no	UUID v4
customer_id	string	no	FK → customer_dim
order_ts	timestamp_tz	no	UTC, source-system event time
amount	decimal(12, 2)	no	USD, gross (before discount)
currency	string	no	ISO 4217
status	string	no	placed / shipped / cancelled / paid

Code — the contract YAML.

# orders/product/order_facts.contract.yaml
name:    orders.product.order_facts
version: "2.1.0"
owner:
  team:     "@orders-team"
  on_call:  "orders-pager"
  reviewers: ["@alice", "@bob"]

schema:
  - { name: order_id,    type: string,          required: true,  description: "UUIDv4oftheorder" }
  - { name: customer_id, type: string,          required: true,  description: "FKtocustomer_dim.customer_id" }
  - { name: order_ts,    type: timestamp_tz,    required: true,  description: "UTCsource-systemeventtime" }
  - { name: amount,      type: "decimal(12,2)", required: true,  description: "Grossamountin`currency`,beforediscount" }
  - { name: currency,    type: string,          required: true,  description: "ISO4217code;whitelistedtoUSD/EUR/GBP/JPY/INR" }
  - { name: status,      type: string,          required: true,  description: "Lifecycle:placed|shipped|cancelled|paid" }

sla:
  freshness:    "15m"     # latest order_ts ≤ now − 15 minutes
  completeness: "99.9%"   # CDC source count − product count ≤ 0.1%
  accuracy:     "99%"     # sample-test pass rate ≥ 99% over 24h window

semantics:
  amount:    "GrossamountinUSD-equivalent(beforediscount).NULLisinvalid."
  currency:  "Restrictedto{USD,EUR,GBP,JPY,INR}.Othercodesroutetoinvestigation."
  status:    "Lifecyclestateatwritetime.Late-arrivingstatustransitionsemitanewevent."
  null_rule: "EveryrequiredfieldMUSTbenon-NULLatwritetime.CIfailsonviolation."

The dbt model wires the contract.

# orders/models/product/order_facts.yml
version: 2
models:
  - name: order_facts
    config:
      contract:
        enforced: true
    columns:
      - name: order_id
        data_type: string
        constraints:
          - type: not_null
      - name: amount
        data_type: "decimal(12,2)"
        constraints:
          - type: not_null
          - type: check
            expression: "amount>=0"

Step-by-step explanation.

The contract YAML is the source of truth. The dbt model references the same column names, types, and NOT NULL constraints — CI verifies the two are in sync.
contract: enforced: true makes dbt fail the run if the materialised table's column types or order do not match the YAML.
The check constraint encodes a piece of the semantics section (amount >= 0) into a lakehouse-level rule. Iceberg / Delta enforce it at write time.
The contract version is the lakehouse table property contract_version=2.1.0. Readers' pinning logic queries the catalog for this property and refuses to read newer-major versions silently.

Output.

Field	Filled?	Enforced where?
name	yes	catalog (must match table name)
version	yes	catalog property + git tag
owner	yes	CODEOWNERS + PagerDuty rotation
schema	yes	dbt contract + Iceberg / Delta constraints
sla	yes	platform monitoring (Prometheus + catalog status badge)
semantics	yes	written docs + sample-test rules

Rule of thumb. A contract that has all six fields and zero TODOs is the bar. A "draft" contract with a missing SLA section is not a contract — consumers cannot pin against it because the runtime promise is undefined.

Worked example — semver in action across one quarter

Detailed explanation. Walk through a representative quarter of order_facts evolution. Bug fix patches, additive minor bumps, and one major bump that follows the deprecation process. Each step shows the contract diff, the consumer impact, and the CI gate.

Question. Show how order_facts evolves from 2.1.0 to 3.0.0 over one quarter, naming the version bump type for each change and the consumer impact.

Input — change log.

Week	Change	Bump type
1	Fix bug: `amount` was occasionally negative for refunds. Add `CHECK >= 0`	patch
4	Add `discount_amount` column (optional, decimal(12,2))	minor
6	Add `payment_method` column (optional, string)	minor
9	Rename `amount` → `gross_amount`; introduce required `net_amount`	major

Code — the contract diff for week 9 (major bump).

# Before (v2.3)
- { name: amount, type: "decimal(12,2)", required: true }

# After (v3.0) — RENAME is a breaking change
- { name: gross_amount, type: "decimal(12,2)", required: true }
- { name: net_amount,   type: "decimal(12,2)", required: true } # NEW required column

deprecation:
  previous_version: "2.x"
  sunset_date:      "2026-09-30"           # ≥ 1 quarter out
  migration_notes:  "Rename`amount`→`gross_amount`.Add`net_amount=gross_amount-discount_amount`."

Step-by-step explanation.

Week 1: patch. Schema unchanged. Adding a CHECK constraint that the current data satisfies is non-breaking. Consumers do nothing.
Week 4: minor. New optional column. Consumers' queries still compile (they don't reference the new column). Catalog page auto-updates.
Week 6: minor again. Same rule — new optional column.
Week 9: major. The rename of amount would break every consumer; CI on the producer side fails the PR unless the version is bumped to 3.0 and a deprecation block is present with a sunset date ≥ 1 quarter out.
The producer ships 2.4 (final 2.x) and 3.0 concurrently for 1 quarter, so consumers can migrate at their own pace.

Output — the catalog timeline.

Date	Version	Type	Consumers affected	Sunset date
Week 1	2.1.1	patch	none	—
Week 4	2.2.0	minor	none	—
Week 6	2.3.0	minor	none	—
Week 9	2.4.0 (final 2.x) + 3.0.0	major	all 2.x consumers	2026-09-30

Rule of thumb. Treat semver as a mechanical rule, not a judgment call. If the schema diff is "add an optional column," it is automatically minor. If the diff is "rename or drop or narrow a type," it is automatically major. Removing the judgment call removes the most common contract violation.

Worked example — the contract-validation CI step

Detailed explanation. A producer ships a PR that bumps order_facts from 2.1.0 to 2.2.0 and adds a column. The CI step validate-contract checks four things: the YAML is well-formed, the diff against main's version is semver-consistent, the dbt model matches the YAML, and the deprecation block is present if the bump is major.

Question. Show the four CI checks the platform runs on every contract change PR and the failure messages that surface in the GitHub UI.

Input — the PR diff (extract).

- version: "2.1.0"
+ version: "2.2.0"
  schema:
    - { name: order_id,    type: string, required: true }
+   - { name: discount_amount, type: "decimal(12,2)", required: false }

Code — the validation harness.

# Platform CI invoked on every PR touching a *.contract.yaml
$ contract-cli validate \
    --yaml      orders/product/order_facts.contract.yaml \
    --baseline  main \
    --model     orders/models/product/order_facts.yml \
    --policies  policies/

# Internally runs:
# 1. JSON Schema validation of the YAML
# 2. Semver diff against main (additive → minor required)
# 3. dbt model ↔ YAML schema sync check
# 4. OPA policies (PII tags, retention, cross-region rules)

Step-by-step explanation.

JSON Schema validation rejects malformed YAML, missing required fields, and bad types. Catches typos before the diff stage.
Semver diff compares main's contract to the PR's contract. If the diff is non-additive but the bump is patch / minor, CI fails with a specific message: "non-additive diff requires major version bump."
The dbt model is parsed and its column list compared to the YAML's. Mismatches fail CI with "schema drift between dbt model and contract YAML."
OPA policies run against the new schema. A new column tagged PII: true in the schema but missing a mask: true directive fails CI immediately.

Output — the CI report card.

Check	Result	Detail
YAML well-formed	pass	6 fields present
Semver consistency	pass	diff is additive; minor bump matches
dbt model sync	pass	column list matches
OPA policies	pass	no PII / retention / cross-region violations

Rule of thumb. Make the contract-validation CI step required for merge. A contract that exists but is not enforced in CI is the worst of both worlds — consumers think they have a contract, producers think they have flexibility. Enforce or do not bother.

Architecture interview question on contract design

A senior interviewer often frames this as: "Design the contract for a payments.product.payment_facts table that pays out across five currencies. Walk me through each of the six fields, then add one SLA and one semantic rule that you would not have included six months ago." It probes whether you write contracts as documents or as enforceable promises.

Solution Using a complete six-field contract with hard-won semantic rules

name:    payments.product.payment_facts
version: "2.0.0"
owner:
  team:     "@payments-team"
  on_call:  "payments-pager"

schema:
  - { name: payment_id,    type: string,          required: true,  description: "UUIDv4ofthepaymentintent" }
  - { name: order_id,      type: string,          required: true,  description: "FKtoorders.product.order_facts" }
  - { name: amount_native, type: "decimal(18,2)", required: true,  description: "Amountintheoriginalcurrency" }
  - { name: amount_usd,    type: "decimal(18,2)", required: true,  description: "USD-convertedamountatratecapturedatwritetime" }
  - { name: currency,      type: string,          required: true,  description: "ISO4217" }
  - { name: fx_rate,       type: "decimal(12,6)", required: true,  description: "FXrateusedforamount_usd;non-NULLevenforUSD-native" }
  - { name: status,        type: string,          required: true,  description: "captured|failed|refunded|partially_refunded" }
  - { name: captured_ts,   type: timestamp_tz,    required: true,  description: "UTCcapturetime" }

sla:
  freshness:    "5m"
  completeness: "99.95%"
  accuracy:     "99.5%"
  # NEW (hard-won): late-arriving rows must arrive within 24h or be flagged
  late_arrival_window: "24h"

semantics:
  amount_native: "Alwayspositive.Refundsusestatus='refunded',NOTanegativeamount."
  amount_usd:    "Computedatwritetimeusingfx_rate.Neverre-deriveddownstream."
  fx_rate:       "EvenforUSD-nativerows,setto1.000000—neverNULL.NULLfx_rateisaCIfailure."
  status:        "Stateatcapturetime;laterstatetransitionsemitanewrowwiththesamepayment_id."

Step-by-step trace.

Field	Why this exists
`name`	uniqueness across catalog
`version`	semver pin for consumers
`owner`	who pages on incident
`schema`	structural promise
`sla`	runtime promise
`semantics`	interpretive promise

The two hard-won additions: late_arrival_window: 24h and the fx_rate semantic rule "never NULL even for USD-native." The first came from a Q2 incident where reconciliation ran on partial data; the second from a Q3 incident where a downstream pipeline divided by NULL and silently produced zeroes.

Output:

Promise	Mechanism
Structural	dbt contract + Iceberg constraints
Runtime (SLA)	platform monitoring + catalog status badge
Interpretive (semantics)	CI sample tests + reviewer checklist

Why this works — concept by concept:

Six required fields — anything fewer is an incomplete promise. Schema without SLA leaves consumers guessing about freshness; SLA without semantics leaves them guessing about meaning.
Semantics column-by-column — the most under-documented field in 90% of "contracts" in the wild. Spelling out "refunds use status, not negative amount" prevents a class of downstream bugs.
Late-arrival window — explicit late-arrival policy turns the "is the data complete?" question into a deterministic check. Without it, reconciliations are timing-dependent and flaky.
Non-NULL fx_rate — encoding hard-won bugs as semantic rules turns institutional knowledge into machine-enforceable promises. Every NULL-fx_rate bug ever debugged should add a contract clause.
Cost — one engineer-week per domain to ship the first contract; ~half a day per minor version after that. Cheap insurance against the entire family of "I thought this column was non-NULL" outages.

Data Architecture
Topic — ETL design
ETL design problems

Practice →

5. Federated computational governance — policy as code

The central team stops writing models and starts writing policies — domains comply automatically through CI

The mental model in one line: federated computational governance means the central platform team writes policies-as-code (OPA, Unity Catalog rules, lakehouse constraints), and the CI on every domain repo evaluates those policies on every PR — domain autonomy + central guardrails coexist because the guardrail is a check, not a *ticket*. Once the policies are codified, the platform team's KPI flips from "how many tickets we shipped" to "% of compliance enforced automatically vs ticket-based."

The compliance loop in five steps.

Step 1. Central platform team writes a policy (e.g. "any column tagged PII: true must be masked in the product tier").
Step 2. The policy lives in a platform git repo, versioned and PR-reviewed by central + delegated reviewers from each domain.
Step 3. A domain team opens a PR in their own repo (adding a column, changing a schema).
Step 4. CI on the domain repo invokes opa eval against the platform policies. Violations fail the PR with a specific message and a link to the policy.
Step 5. Pass → merge. Fail → fix or open a "new policy needed?" issue against the platform repo. The feedback loop is < 60 seconds.

Policies that scale (the canonical set).

PII masking. Any column whose lineage tag includes PII: true must be masked, hashed, or tokenised in the product tier. Catches accidental exposure of email, ssn, phone.
Retention. Any table tagged customer_data must have a retention_days property ≤ 730. Drives automatic vacuum / time-travel pruning.
Cross-region. Reads of EU-tagged tables from non-EU compute require an approved exception. Catches GDPR / data-residency violations.
Query-pattern. Pipelines whose CPU-per-row exceeds a threshold get flagged in CI for review. Cheap defence against runaway costs.

Tag inheritance through lineage.

Producer tags the source. The raw column customer.raw.users.email gets the PII: true tag once, by the customer domain.
Lineage scanner propagates. OpenLineage / Marquez / Datafold scan dbt manifest and CI artifacts, build the column-level lineage graph, propagate tags downstream automatically.
Derived and product inherit. Every downstream column derived from email inherits the PII: true tag — including hashed forms (SHA-256(email)), which are still PII under GDPR.
Policies key off the tag, not the column name. That decoupling means the policy survives column rename and propagation through unioned / joined / aggregated derivations.

The platform team's new KPI.

Before mesh. "Tickets shipped per quarter." Linear with team size. Caps out.
After mesh. "Percent of compliance enforced automatically." Bounded by 100%. Each new policy moves it up; each manual review caught in CI moves it up.
The conversation with leadership changes. "We are at 92% automated compliance. The remaining 8% is the cross-region approval workflow which is intentionally manual."

Common architecture-interview probes on governance.

"How does a PII column propagate through derivations?" — through column-level lineage with tag inheritance. Hashed or tokenised PII is still PII for policy purposes.
"What stops a domain from opting out of governance?" — the platform CI workflow is a required check on every domain repo. The domain cannot merge to main without it passing. Platform writes the workflow template; domains import it.
"When does a policy get exception-approved instead of enforced?" — policies have an exception_allowed: true flag for cases like one-off analytics that need a 90-day exemption. The exemption is auditable, time-bound, and shows in the catalog.
"Is mesh compatible with strict regulatory regimes (SOX, GDPR, HIPAA)?" — more compatible than centralised, because the audit trail is built into the policy-as-code git history. Every compliance decision has a PR, a reviewer, and a timestamp.

Worked example — the PII-masking OPA policy

Detailed explanation. Write the canonical PII-masking policy. Any column whose tags include PII: true must have a mask: <method> directive in the contract. The policy runs on every PR that touches a *.contract.yaml.

Question. Write the Rego OPA policy that fails CI when a PII column in the product tier lacks a masking directive, and show the contract YAML that satisfies it.

Input — the contract excerpt.

schema:
  - { name: customer_id, type: string, required: true,  tags: ["PII"], mask: "sha256" }
  - { name: email,       type: string, required: true,  tags: ["PII"], mask: "tokenize" }
  - { name: order_total, type: "decimal(12,2)", required: true } # no PII tag

Code — the Rego policy.

package mesh.pii_masking

# Deny any product-tier column tagged PII that lacks a mask directive.
deny[msg] {
    input.tier == "product"
    col := input.schema[_]
    "PII" == col.tags[_]
    not col.mask
    msg := sprintf(
        "column %v is tagged PII but has no `mask:` directive (allowed: sha256 | tokenize | redact)",
        [col.name],
    )
}

# Allow only an approved list of masking methods.
allowed_masks := {"sha256", "tokenize", "redact"}

deny[msg] {
    col := input.schema[_]
    "PII" == col.tags[_]
    col.mask
    not allowed_masks[col.mask]
    msg := sprintf(
        "column %v uses unsupported masking method %v",
        [col.name, col.mask],
    )
}

Step-by-step explanation.

The first rule fires when a PII-tagged column has no mask: field. It composes a precise message — naming the offending column and listing the allowed methods.
The second rule fires when a mask: field exists but uses an unapproved value. The allowed set is a single Rego value, easy to extend.
Both rules are evaluated by opa eval in CI on every PR touching the contract. Failures block the merge.
The policy is layered with the column lineage scanner: if a downstream product column is derived from an upstream PII column but the downstream column lacks the PII tag itself, a second policy fires on the lineage manifest. Together they catch both direct and propagated PII exposure.

Output — CI on a violating PR.

Check	Result	Message
YAML well-formed	pass	6 fields present
Semver consistency	pass	minor bump
dbt model sync	pass	columns match
PII masking (OPA)	fail	column `email` is tagged PII but has no `mask:` directive (allowed: sha256 \| tokenize \| redact)

Rule of thumb. Write the policy once. Apply it to every domain. The platform team's role is "policy author," not "PR reviewer for PII." The reviewer role is delegated to the CI.

Worked example — tag propagation through column lineage

Detailed explanation. A new derivation in the marketing domain joins customer.product.customer_dim.email_hash with campaign data. Even though the column is named email_hash and is already a SHA-256, the tag inheritance system propagates the PII: true tag automatically — and the platform's downstream policies enforce masking in marketing's product tier too.

Question. Show the column-level lineage graph and demonstrate how the PII: true tag flows from customer.raw.users.email through three layers of derivation.

Input — lineage manifest.

columns:
  - name: customer.raw.users.email
    tags: [PII]
  - name: customer.derived.users_clean.email_lower
    derived_from: [customer.raw.users.email]
  - name: customer.product.customer_dim.email_hash
    derived_from: [customer.derived.users_clean.email_lower]
    transform:    "sha256"
  - name: marketing.derived.lead_attribution.hashed_lead_email
    derived_from: [customer.product.customer_dim.email_hash]
  - name: marketing.product.campaign_lead_stats.unique_hashed_emails
    derived_from: [marketing.derived.lead_attribution.hashed_lead_email]
    transform:    "count_distinct"

Code — the inheritance rule (Rego).

package mesh.tag_inheritance

# A column inherits any tag that any of its lineage ancestors has.
inherited_tags(col) = tags {
    tags := {tag |
        ancestor := input.ancestors[col][_]
        tag := ancestor.tags[_]
    }
}

# Deny: derived column missing inherited PII tag.
deny[msg] {
    col := input.columns[_]
    "PII" in inherited_tags(col)
    not "PII" in col.tags
    msg := sprintf(
        "column %v inherits PII tag from upstream but does not declare it",
        [col.name],
    )
}

Step-by-step explanation.

email is tagged PII once, at the raw source. Every derivation inherits the tag automatically through the lineage graph.
The SHA-256 transform on email_hash does not strip the PII tag. Hashed PII is still PII (GDPR Article 4(5)). The system encodes that legal fact in the policy.
marketing.derived.lead_attribution.hashed_lead_email inherits PII transitively. If marketing's contract for lead_attribution.hashed_lead_email does not declare tags: [PII], CI fails on inheritance check.
count_distinct is an aggregating transform that produces a non-PII output (a count). The platform's transform-classification table marks count_distinct as PII-stripping; the output column does not inherit the tag. The policy author maintains this table.

Output.

Column	Inherited tags	Declared tags	CI
`email_lower`	{PII}	{PII}	pass
`email_hash`	{PII}	{PII}	pass
`hashed_lead_email`	{PII}	{PII}	pass
`unique_hashed_emails`	{} (transform strips)	{}	pass

Rule of thumb. Tag once at the source. Let lineage do the inheritance. Aggregating transforms (count, sum, count_distinct over hashes) strip PII tags; passing transforms (lower, trim, sha256, tokenize) preserve them.

Worked example — the cross-region read policy

Detailed explanation. A payments domain analyst in the US opens a PR that reads customer.product.customer_dim — which is tagged region: EU because GDPR. The cross-region policy fires in CI: the read is blocked until an exception is granted (or until the analyst rewrites the query to use a US-resident aggregate).

Question. Write the cross-region policy in Rego and show how an exception is granted via a time-bound annotation.

Input — the violating PR.

-- Pipeline runs in compute_region=us-east-1 reading EU-tagged data
SELECT customer_id, signup_date
FROM customer.product.customer_dim
WHERE country = 'DE';

Code — the cross-region policy.

package mesh.cross_region

# Deny: pipeline reads EU-tagged data from non-EU compute, no exception.
deny[msg] {
    input.compute_region != "eu-west-1"
    upstream := input.lineage[_]
    "region:EU" == upstream.tags[_]
    not input.exceptions["cross_region_eu"]
    msg := sprintf(
        "pipeline %v in %v reads EU-resident %v — request exception via /platform-exceptions",
        [input.pipeline.name, input.compute_region, upstream.name],
    )
}

# Allow time-bound exceptions, audited in git.
allow[msg] {
    ex := input.exceptions["cross_region_eu"]
    ex.granted_by != ""
    time.parse_rfc3339_ns(ex.expires) > time.now_ns()
    msg := sprintf("cross-region exception in effect until %v", [ex.expires])
}

Step-by-step explanation.

The pipeline manifest declares compute_region: us-east-1. The lineage scan finds customer.product.customer_dim tagged region: EU. The first rule fires.
The exception block is a YAML in the domain repo, e.g. exceptions/cross_region_eu.yaml, granted by an authorised reviewer and expiring on a date. Without the file, CI fails.
Granting an exception is a PR against the exception file, not against the policy. That PR is reviewed by the platform compliance reviewer, time-bound, and audited.
The policy is data-resident enforcement as code — the same rule that satisfies GDPR Article 44 (cross-border transfers) lives in git, runs in CI, and is auditable forever.

Output.

State	CI result	Note
No exception	fail	PR blocked, message links to /platform-exceptions
Exception granted, valid	pass	exception_expires emitted as warning
Exception granted, expired	fail	CI recomputes on every run; expiry is mechanical

Rule of thumb. Encode every compliance rule (GDPR, HIPAA, SOX) as a policy with a time-bound exception mechanism. The auditor's job becomes "review the policy repo," not "interview the team." That single shift is the most expensive compliance cost the mesh removes.

Worked example — measuring the federated-governance KPI

Detailed explanation. Define and compute the platform team's federated-governance KPI: percent of compliance enforced automatically vs ticket-based. Walk through a quarter where the team starts at 60% and finishes at 92% — naming each policy that moved the number.

Question. Compute the KPI from a quarter's data — total compliance actions, automated CI catches, manual reviews. Identify the two policies that moved the number the most.

Input.

Quarter	Compliance actions	CI catches	Manual reviews
Q1	1000	600	400
Q2	1100	760	340
Q3	1180	920	260
Q4	1260	1160	100

Code — the KPI calculation.

def federated_gov_kpi(ci_catches, manual_reviews):
    total = ci_catches + manual_reviews
    return ci_catches / total if total else 0.0

q = [(600,400), (760,340), (920,260), (1160,100)]
for i,(c,m) in enumerate(q, 1):
    print(f"Q{i}: {federated_gov_kpi(c, m):.0%} automated ({c}/{c+m})")

Step-by-step explanation.

Q1: 60% automated. The platform repo had PII masking and retention policies; cross-region was manual via Slack.
Q2: 69% — adding cross_region policy moved 80 manual reviews to CI catches.
Q3: 78% — adding query_cost_pattern policy caught another 160 cases.
Q4: 92% — adding tag_inheritance automated the 200+ "did this derivation propagate PII?" reviews.
The remaining 8% is intentional: cross-region exceptions, novel-policy requests, and the quarterly auditor review. Those are appropriately manual.

Output.

Quarter	KPI	Top mover
Q1	60%	(baseline)
Q2	69%	`cross_region.rego`
Q3	78%	`query_cost_pattern.rego`
Q4	92%	`tag_inheritance.rego`

Rule of thumb. Report the KPI every quarter. Each new policy is a line in the change log; each policy that moves the number proves the platform's investment is paying back. The KPI is the platform team's most defensible budget argument.

Architecture interview question on federated governance

A senior interviewer often frames this as: "Walk me through how a new PII column added in the customer domain gets enforced across marketing, payments, and orders without anyone filing a ticket." It tests whether you understand that federated governance is a loop, not a one-off policy.

Solution Using policy-as-code + lineage tag inheritance + CI enforcement

# 1. customer domain adds PII column, tags it in the contract
# customer/product/customer_dim.contract.yaml
schema:
  - { name: phone_number, type: string, required: true, tags: ["PII"], mask: "tokenize" }

# 2. Platform OPA policy (already in place) enforces PII masking
# policies/pii_masking.rego applies to every domain's CI

# 3. Lineage scanner propagates PII tag to every downstream column
# OpenLineage manifest emitted by every CI run

# 4. Marketing / payments / orders CI fails on any unmasked downstream
# without anyone filing a ticket — the policy is the ticket

Step-by-step trace.

Step	Actor	Action	Latency
1	`customer` domain	adds `phone_number` PII column with `mask: tokenize`	1 day
2	platform CI	validates contract, OPA passes	60s
3	lineage scanner	propagates `PII: true` tag to downstream derivations across domains	next CI run
4	`marketing` CI	fails any downstream pipeline that exposes unmasked `phone_number`	60s per PR
5	`marketing` domain	adds masking + re-runs	1 day
6	platform team	observes KPI tick: +1 automated catch	passive

The platform team did nothing in the loop. The policy did the work. That is "federated computational governance" working as designed.

Output:

Outcome	Mechanism
Producer added new PII column	self-serve via contract YAML
Downstream domains caught violations	CI + lineage tag inheritance
Compliance audit trail	git history of policies + contracts
Platform team workload	zero PRs reviewed manually

Why this works — concept by concept:

Policy-as-code — turns "compliance is a process" into "compliance is a CI step." Same idea as terraform plan / apply for infra, applied to data.
Tag inheritance — solves the propagation problem mechanically. No engineer has to remember "this is downstream of PII." The lineage scanner does it.
Required CI workflow — every domain repo imports the platform's CI workflow. The platform team writes the policy; the domain team's CI runs it.
Exception as PR — exceptions are not "ask in Slack"; they are PRs against a versioned exception file. Auditors love this; engineers tolerate this.
Cost — the platform team writes ~20-40 policies over the first year. After that, the marginal cost of a new domain is zero compliance-wise — the policies already work for it. The cost curve flattens exactly the opposite shape from the central-team queue.

Data Architecture
Topic — design
Platform / governance design problems

Practice →

Cheat sheet — data mesh implementation recipes

One repo per domain, one CI pipeline per repo. CODEOWNERS routes PRs; the standard CI workflow imports platform OPA policies. Onboarding a new domain = one CLI command, < 30 minutes.
Publish only the product tier cross-domain. Raw and derived are private to the domain. Cross-domain reads of raw or derived are CI failures, not negotiation.
Every product table has a <table>.contract.yaml. Six required fields: name, version, owner, schema, sla, semantics. Reviewed via PR. No "draft" contracts in production.
Semver as a mechanical rule. Patch = bug fix (no schema change). Minor = additive (new optional columns). Major = breaking (rename, drop, narrow). Deprecation window ≥ 1 quarter on major.
Pin consumers with caret (^major.minor). Patches and additive minors flow through silently; majors require explicit consumer PR.
Use Unity Catalog / Polaris / Gravitino for cross-domain discovery. Search, schema, owner, freshness, downstream consumers — all on the catalog page. Slack is not a catalog.
Policies-as-code in OPA, in git, PR-reviewed. Confluence pages are not policies. Policies that do not run in CI do not enforce anything.
Tag PII once at the source; let lineage inheritance propagate. Aggregating transforms (count, sum) strip the tag; passing transforms (lower, sha256, tokenize) preserve it.
Domain teams own their on-call rotation. Producer pages on freshness or accuracy SLA breach. Central platform pages only on substrate (catalog, CI, OPA) outages.
The platform team's KPI is "% compliance automated." Each new policy moves the number up. Report it every quarter; it is the platform team's budget argument.
Conformed dimensions (dim_date, dim_geo, dim_currency) live in platform.product.*. Never duplicate them per domain. Owning them centrally is the platform team's product-tier contribution.
Cross-domain hot joins are platform-managed marts. When two domains' product tables join often, model the join once in platform.curated.* instead of replicating the join in every consumer.
Migration from central warehouse is one-domain-at-a-time, not big-bang. Pick the most painful domain first (highest ticket count to central). Stand it up as a mesh domain. Use it as the lighthouse. Repeat.
"Self-serve" means < 30-minute onboarding. If onboarding requires a platform-team ticket, the platform is mis-named. Onboarding lead time is the single most important platform KPI.
Below 200 product engineers, don't do mesh. Hire one more central engineer and invest in self-serve metric layers. Mesh setup costs dwarf central-team pain below that line.

Frequently asked questions

When is my org big enough to need data mesh?

The rough industry threshold is around 200 product engineers and at least 4 domains with embedded data engineers. Below that line, the central data team usually still scales — adding 1-2 engineers and investing in self-serve metric tooling pays back faster than the 8-15 engineer-quarter mesh setup cost. Above that line, the central team's utilisation passes 0.9, lead times blow past one quarter, and Conway's-law symptoms (one giant fact_everything table, 380 enum values in one column) appear in the warehouse schema. The honest answer in an interview is to refuse to recommend mesh without first running the four-axis diagnostic on scale, domain readiness, platform budget, and central-team utilisation.

What's the difference between data mesh and data fabric?

Data mesh is a socio-technical pattern (org + architecture) emphasising domain ownership of data products with federated governance. Data fabric is a technology pattern (mostly architecture) emphasising a unified metadata / orchestration layer that automates data integration, lineage, and governance across heterogeneous sources. In practice they are complementary, not competing: a real mesh implementation typically uses fabric-style metadata tooling (catalog, lineage, automated governance) as part of its self-serve platform substrate. The shorthand is "mesh is who owns the data; fabric is how the metadata flows." Most modern lakehouse platforms (Databricks Unity Catalog, Snowflake Polaris, Apache Gravitino) ship both: domain-namespaced ownership for mesh plus fabric-style automated lineage and policy propagation.

How do I migrate from a central warehouse to a mesh without big-bang rewrites?

Migrate one domain at a time, in pain-priority order. Pick the domain that files the most tickets against the central team — that is where the org will feel the win first. Stand up its repo, its product tier with a contract, its on-call rotation, and its OPA-enforced CI inside one quarter. Publish the lighthouse — every other domain converts by copying that domain's pattern (the platform CLI bakes the template). Keep the central warehouse running in parallel; consumers cut over to the new product tables on their own timeline using the version-pinning subscription model. Plan on 4-8 quarters for full migration of 5-10 domains, with the first quarter spent almost entirely on the platform-team substrate (CLI, CI templates, OPA bundle, catalog onboarding script) — that investment is what makes the remaining quarters fast.

Do I need a lakehouse to do data mesh?

You do not strictly need a lakehouse, but it makes mesh dramatically cheaper. The lakehouse architecture (Delta / Iceberg / Hudi on object storage with a cross-engine catalog like Unity Catalog or Polaris) gives you one storage layer that every domain's compute engine can read — Spark, Trino, Snowflake, BigQuery, DuckDB — without copying data. That is the technical precondition that makes "one domain, one substrate, many consumers" feasible. Without it, you end up with per-engine permission matrices, data duplication, and a fabric-style integration layer that becomes its own bottleneck. Modern mesh implementations almost universally use lakehouse formats as the substrate; older warehouse-only stacks (pure Snowflake or pure BigQuery) can still implement mesh but require more careful per-engine access policy plumbing.

Who owns shared dimensions like `dim_date` in a mesh?

The platform team owns conformed shared dimensions — dim_date, dim_geo, dim_currency, dim_organization. They live in platform.product.* namespace and are consumed by every business domain. Treating shared dimensions as "just another domain" creates a circular ownership problem (which domain owns "geography"?) and a duplication problem (every domain rolls its own dim_date with subtle inconsistencies). The platform team's product-tier contribution is precisely these conformed dimensions plus any hot cross-domain marts in platform.curated.*. That keeps the principle "domain owns business logic" intact for business domains while assigning the genuinely cross-cutting reference data to the team whose mandate is "make every other team 10x faster."

How do I prevent "mesh" from becoming "anarchy"?

The two non-negotiable guardrails are data contracts and federated computational governance — both enforced in CI, both versioned in git, both producing audit trails. The anti-mesh failure mode is "we adopted the domain ownership principle without the federated governance principle" — domains start publishing data without contracts, without SLAs, without PII tagging, and the org ends up with a hundred private warehouses and no auditor-friendly trail. The discipline is: domain autonomy lives inside the policy guardrails the platform team writes once. Every PR runs OPA. Every product table has a contract. Every cross-domain read goes through the catalog with masking applied. Every PII column is tagged at the source and inherited downstream. If any of those four invariants is missing, what you have is not data mesh — it is the central team's old pain rebranded across N teams.

Practice on PipeCode

Drill the data modeling practice library → for domain-modeling and dimensional schema problems that map onto mesh product tiers.
Rehearse on dimensional modeling problems → when the interviewer wants fact / dimension trade-offs for a conformed-dimension layer.
Sharpen ETL design drills → for the producer-side pipelines that publish a product tier from raw and derived.
Layer the event modeling library → for the source-system event schemas that feed each domain's raw tier.
Stack the design library → for the system-design surface around catalogs, governance, and policy enforcement.
For the broader surface, read top data engineering interview questions →.
Stack the prerequisites with the only 5 skills you need to become a data engineer →.
Sharpen the modeling axis with the data modelling for DE interviews course →.
For platform-engineering depth, work through ETL system design for data engineering interviews →.

Pipecode.ai is Leetcode for Data Engineering — every mesh principle above ships with hands-on practice rooms where you design the domain bounded contexts, draft the `product.contract.yaml`, and reason about federated governance loops against real graded prompts. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your lakehouse data mesh blueprint will survive contact with the staff-level interviewer who actually built one in production.

Practice data modeling now →
Architecture design drills →

Lakehouse Data Mesh: Domain Ownership, Contracts & Federated Governance