Lakehouse Data Mesh: Domain Ownership, Contracts & Federated Governance
data mesh is the architecture pattern that finally answers a question every senior data engineer has been forced to answer since 2018: what do you do when one central data team becomes a ticket queue and every domain in the business is blocked behind it? The honest answer — the one Zhamak Dehghani published in 2019, refined through six years of production lessons — is to decentralise the data the way the business is already decentralised, put each domain in charge of its own data, and let a small platform team write the policies and primitives that keep the whole organisation from devolving into ungoverned anarchy. The new substrate that makes that decentralisation feasible at scale is the lakehouse architecture — Delta, Iceberg, and Hudi on object storage, with a single cross-domain catalog on top.
This guide is the senior-engineer field manual for designing a real mesh on top of a real lakehouse. It walks the four principles in plain English, draws the bounded-context map (raw / derived / product tiers), names the six fields every data contract YAML must have, sketches the federated computational governance loop with Open Policy Agent and Unity Catalog, and is brutally honest about when mesh is the wrong answer. Each section ships an architecture-interview answer — diagrams, code, a step-by-step trace, an output card, and a concept-by-concept walkthrough of why the pattern wins.
When you want hands-on reps immediately after reading, drill the data modeling practice library →, rehearse on dimensional modeling problems →, and stack the platform muscles with ETL design drills →.
On this page
- Why centralized data teams stop scaling at 100+ engineers
- The four principles of data mesh, made concrete
- Domain bounded contexts — drawing the lines
- Data contracts — the API of a data product
- Federated computational governance — policy as code
- Cheat sheet — data mesh implementation recipes
- Frequently asked questions
- Practice on PipeCode
1. Why centralized data teams stop scaling at 100+ engineers
The central data team is a ticket queue — and Conway's law says monolithic data orgs produce monolithic data warehouses
The one-sentence invariant: as an organisation grows past roughly one hundred product engineers, the central data team becomes a ticket queue whose length grows linearly with org size — every new domain (marketing, payments, inventory, support, ML) adds requests that one team has to triage, prioritise, and ship. The result is six-week SLAs on simple metric requests, a backlog that never goes down, and a "shadow IT" pattern where every domain spins up its own pipeline outside the central platform just to ship anything at all.
Three failure patterns of the centralised model.
- Ticket queue grows linearly. Each new business domain adds 2-5 standing analytical requests per week. A central team of 20 engineers can ship maybe 30 requests per week; once you have 7 domains, the queue is permanently saturated and lead time blows past one quarter.
- The central team has zero domain context. The marketing team knows what "qualified lead" means today (and that the definition changed two weeks ago). The data engineer in the central team learns this on the seventh slack ping during PR review.
- Conway's law. "Any organisation that designs a system will produce a design whose structure is a copy of the organisation's communication structure." A single central team produces a single monolithic warehouse — one schema, one repo, one CI pipeline, one on-call rotation — and that monolith carries every domain's quirks at once.
Conway's law in one sentence.
Conway's law (Melvin Conway, 1967) says software architecture mirrors org structure. The inverse for data is just as true: if you want a federated data architecture, you need a federated data organisation. Trying to bolt mesh onto a centralised team is a contradiction in terms.
What interviewers listen for.
- Do you say "the queue grows linearly with org size, but team headcount only grows by a hire per quarter" when asked about scaling pain? — senior signal.
- Do you mention Conway's law and connect it back to "a single team produces a single monolith"? — senior signal.
- Do you correctly identify mesh as a socio-technical pattern (not a technology purchase)? — required answer.
- Do you push back on "we need mesh" when the org has 30 engineers? — senior signal (knows when not to apply it).
The 2026 reality.
-
Lakehouse formats — Delta, Iceberg, Hudi — turn object storage into a multi-engine substrate. A
parquet/icebergtable can be queried from Spark, Trino, Snowflake, BigQuery, and DuckDB without copying data. That is the technical precondition that makes "one domain, one substrate, many consumers" feasible. - Cross-engine catalogs — Unity Catalog OSS (Databricks), Polaris (Snowflake), Apache Gravitino — standardise how domain ownership, access policies, and tags travel across compute engines. Without a unifying catalog, mesh devolves into a per-engine permission matrix nobody can audit.
- Open Policy Agent (OPA) is the de-facto policy-as-code engine. Every modern CI now ships an OPA evaluation step that can block a PR if it violates a central policy.
- The "anti-mesh" pattern is real. A 2024-2025 wave of "we tried data mesh and it became anarchy" post-mortems traces back to teams that adopted the domain ownership principle without the federated governance principle. Mesh without policy-as-code is exactly what those post-mortems said it would be.
Worked example — the ticket-queue bottleneck in one back-of-envelope calculation
Detailed explanation. Most data leaders intuit that "the central team is overwhelmed," but cannot quantify the bottleneck for the CFO. A simple Little's-Law-style calculation turns the intuition into a number that justifies the org change.
Question. A central data team has 20 engineers. The business has 7 domains, each filing an average of 3 standing analytical requests per week. Each request takes the central team an average of 1.5 engineer-weeks to ship. What is the steady-state queue depth and lead time?
Input.
| Variable | Value |
|---|---|
| Central team size | 20 engineers |
| Throughput per engineer-week | 0.67 requests (1 / 1.5) |
| Weekly arrival rate | 7 × 3 = 21 requests |
| Weekly service rate | 20 × 0.67 = 13.3 requests |
Code.
# Little's Law: L = λ × W
# When arrival rate > service rate, queue grows without bound.
arrival_per_week = 7 * 3 # 21
service_per_week = 20 * (1 / 1.5) # 13.3
utilization = arrival_per_week / service_per_week # 1.58
# Queue growth per week (deterministic approximation)
backlog_growth = arrival_per_week - service_per_week # +7.7 per week
weeks_to_one_quarter_lead_time = 13 * service_per_week / backlog_growth
print(f"utilization: {utilization:.2f}")
print(f"backlog grows by {backlog_growth:.1f} requests/week")
print(f"lead time hits 1 quarter in ~{weeks_to_one_quarter_lead_time:.0f} weeks")
Step-by-step explanation.
- Arrival rate (21 / week) exceeds service rate (13.3 / week). Utilisation is 1.58 — already above 1.0, which is the queueing-theory red line for "unbounded growth."
- Every week the backlog grows by about 7.7 requests. After 8 weeks the backlog is roughly 60 open requests on top of in-flight work.
- Lead time — the time from "request filed" to "request shipped" — grows linearly with backlog. The team hits a one-quarter (13-week) average lead time after about 23 weeks.
- Adding 4 engineers to the central team raises service rate to 16.0 / week. That is still below 21 / week — the queue still grows. The arrival rate is dominated by organisational structure (number of domains), so the only way to keep up is to change the structure: let each domain serve its own queue.
Output.
| Metric | Value |
|---|---|
| Utilisation | 1.58 (unsustainable) |
| Backlog growth | +7.7 requests / week |
| Time to one-quarter lead time | ~23 weeks |
| Engineers needed to break even | 32 (60% headcount increase) |
| Engineers needed under mesh | 20 (same headcount, queue per domain) |
Rule of thumb. When utilisation crosses 0.85, lead time spikes; when it crosses 1.0, the queue is mathematically unbounded. The fix is not "more engineers on the central team" — it is distributing the arrival rate across owners who already know the domain context.
Worked example — Conway's law applied to the warehouse schema
Detailed explanation. Walk into a centralised warehouse and look at the schema. You will see one giant fact_events table fed by every domain, conflicting column conventions, mixed grain rows, and a metadata blob column nobody understands. That is not a schema-design failure — it is the org chart leaking into the database.
Question. A central team of 8 maintains a single fact_events table fed by marketing, payments, inventory, and support events. List three structural pathologies you predict in that schema purely from Conway's law, and the data-mesh restructure that fixes each.
Input — central warehouse symptoms.
| Symptom | Conway's law prediction |
|---|---|
payload_json blob with 47 keys |
one schema cannot encode four domains' semantics |
event_type column with 380 values |
each domain reuses the column for its own taxonomy |
is_test column NULL for 80% of rows |
one domain's is_test semantic is not the others' |
| Single PR queue with 18 reviewers | one repo for four domain teams' work |
Code (sketch).
-- Centralised schema — every domain crammed into one table
CREATE TABLE fact_events (
event_id STRING,
domain STRING, -- 'marketing' | 'payments' | 'inventory' | 'support'
event_type STRING, -- 380 distinct values across all domains
user_id STRING,
ts TIMESTAMP_TZ,
payload_json STRING, -- 47 union'd keys
is_test BOOLEAN
);
-- Mesh schema — one product table per domain, each with its own contract
CREATE TABLE marketing.product_lead_event (
lead_id STRING,
campaign_id STRING,
user_id STRING,
ts TIMESTAMP_TZ,
score DECIMAL(5, 2)
);
CREATE TABLE payments.product_payment_event (
payment_id STRING,
order_id STRING,
amount DECIMAL(12, 2),
currency STRING,
ts TIMESTAMP_TZ,
status STRING
);
Step-by-step explanation.
- The centralised
payload_jsonblob is the literal manifestation of Conway's law: four domains forced into one schema produce one column that must encode all four. Splitting into four domain-owned tables collapses 47 union keys into four cohesive schemas. - The 380-value
event_typebecomes a per-domain enum of 30-80 values, each documented in the domain's own data contract. - The
is_testNULL contract is now per-domain — thepaymentsproduct can require non-NULL, themarketingproduct can default to FALSE — and there is no longer a "what does NULL mean in this row" mystery. - The single 18-reviewer PR queue is split into four domain queues of 4-6 reviewers each, all of whom already understand the domain. PR review time drops from days to hours.
Output.
| Layer | Central warehouse | Mesh restructure |
|---|---|---|
| Tables for events | 1 (fact_events) |
4 (one per domain) |
| Schema columns | 7 + 47-key JSON | 5-8 per table |
| Distinct event_types | 380 across one column | 30-80 per domain, separately typed |
| PR queue | 1 × 18 reviewers | 4 × 4-6 reviewers |
| Domain-context owner | central team (none) | each domain team |
Rule of thumb. If the central warehouse already has a fact_everything table with a JSON blob and 300+ enum values in one column, the org has been silently telling you it needs a mesh for a year. The schema is the symptom; the org is the cause.
Worked example — when mesh is wrong (the honest 200-engineer line)
Detailed explanation. The most senior signal in a mesh interview is the willingness to say "no, do not do mesh here." Mesh costs are real — platform team headcount, contract tooling, federated governance setup, multi-quarter migration — and below a certain org size they outweigh the benefits.
Question. A 60-engineer SaaS company with three product domains has a 5-engineer central data team and a six-week metric SLA. Should they adopt data mesh? Justify with three concrete checks.
Input — org snapshot.
| Variable | Value |
|---|---|
| Total engineers | 60 |
| Product domains | 3 |
| Central data team | 5 |
| Current metric SLA | 6 weeks |
| Platform team budget | 0 engineers |
| Domain teams' data fluency | low (no embedded analysts) |
Code (back-of-envelope cost model).
# Cost / benefit estimate for a 60-engineer org
mesh_setup_cost_eng_quarters = (
4 # platform team build-out (2 eng × 2 quarters)
+ 3 # per-domain onboarding (1 eng-quarter × 3 domains)
+ 2 # governance tooling (OPA + contract CLI)
) # = 9 eng-quarters
current_central_pain_eng_quarters_saved_per_year = (
# 6-week SLA is bad, but only 30 metric requests/quarter at this size
# central team can ship them in 2.5 eng-quarters/year of extra capacity
2.5
)
# Payback period in years
payback_years = mesh_setup_cost_eng_quarters / (
4 * current_central_pain_eng_quarters_saved_per_year
)
print(f"payback: {payback_years:.1f} years")
Step-by-step explanation.
- Mesh setup cost for a small org is roughly 9 engineer-quarters: a 2-person platform team for 2 quarters, plus domain onboarding, plus governance tooling.
- The pain the central team is currently absorbing is ~2.5 engineer-quarters per year of overflow. The mesh setup costs roughly 4 years of overflow pain to recoup.
- None of the three domains has embedded analysts yet. Pushing data ownership onto teams that have never owned a pipeline is adding a queue, not removing one.
- The right answer for this org is: hire one more central engineer, invest in self-serve metric tooling, and revisit mesh when the org passes 150 engineers and at least three domains have embedded data engineers.
Output.
| Check | Threshold | This org | Verdict |
|---|---|---|---|
| Domain count | ≥ 4 domains with embedded DEs | 3 domains, 0 embedded DEs | fail |
| Engineer count | ≥ 150-200 product engineers | 60 | fail |
| Platform budget | ≥ 2 platform engineers funded | 0 | fail |
| Current SLA pain | central team utilisation ≥ 0.9 | 6-week SLA but underloaded | fail |
Rule of thumb. The rough industry threshold is 200 product engineers and 4+ domains with embedded data engineers. Below that line, mesh setup costs dwarf central-team pain. Above that line, the central team is mathematically incapable of keeping up and mesh is the only path.
Architecture interview question on scaling the central data team
A senior interviewer often opens with: "Your central data team of 20 is drowning. The CFO asks whether mesh is the answer. Walk me through the diagnostic — utilisation, Conway's law symptoms, org size — and give a clear yes / no with the three follow-up investments." It blends queueing math, org-design intuition, and the honesty to say "not yet."
Solution Using a four-axis diagnostic before recommending mesh
def mesh_readiness(org):
"""Score an org against four data-mesh prerequisites.
Returns a verdict string + the three biggest gaps.
"""
checks = {
"scale": org.product_engineers >= 200,
"domains": org.domains_with_embedded_de >= 4,
"platform_budget": org.platform_engineers_funded >= 2,
"central_pain": org.central_team_utilization >= 0.9,
}
failed = [k for k, ok in checks.items() if not ok]
if not failed:
return "ready", []
if len(failed) <= 1:
return "almost — fix the gap first", failed
return "not yet — hire central, invest in self-serve", failed
Step-by-step trace.
| Org | engineers | embedded DEs | platform budget | utilisation | failed checks | verdict |
|---|---|---|---|---|---|---|
| 60-eng SaaS | 60 | 0 | 0 | 0.6 | scale, domains, platform_budget | not yet |
| 150-eng fintech | 150 | 2 | 1 | 0.95 | scale, domains, platform_budget | not yet |
| 400-eng retail | 400 | 5 | 3 | 0.92 | (none) | ready |
| 800-eng marketplace | 800 | 7 | 4 | 0.96 | (none) | ready |
The diagnostic gives the CFO a binary answer plus a specific gap list. "Not yet" comes with the next investment ("hire central, build self-serve"); "ready" comes with the platform-team build-out timeline.
Output:
| Org | Verdict | Next investment |
|---|---|---|
| 60-eng SaaS | not yet | +1 central engineer; build self-serve metric tooling |
| 150-eng fintech | not yet | embed DEs in 2 more domains; fund platform team |
| 400-eng retail | ready | spin up 3-engineer platform team; pilot 1 domain |
| 800-eng marketplace | ready | full mesh rollout over 4 quarters |
Why this works — concept by concept:
- Scale gate — mesh setup cost is roughly fixed (platform team + tooling + governance). The cost / benefit only makes sense at orgs large enough to keep that team busy. Industry rule of thumb is 200+ product engineers.
-
Domain readiness gate — a domain that has never owned a pipeline cannot suddenly own a
producttier with a contract and an on-call rotation. Embedded data engineers are the precondition. - Platform funding gate — without a paid platform team, "self-serve" turns into "everyone for themselves." Mesh assumes the central team converts from model-writers to platform-builders, not disappears.
- Central-pain gate — if the central team is not saturated, mesh is solving a non-problem. The pain itself is the signal that decentralisation will pay back.
- Cost — mesh setup is 8-15 engineer-quarters depending on org size. Below the readiness line, simpler interventions (more central headcount, self-serve metric layers) pay back faster.
Data Architecture
Topic — design
Data architecture design problems
2. The four principles of data mesh, made concrete
Mesh is four principles, four artifacts, and four failure modes — name each pair or you are buzzword-engineering
The mental model in one line: data mesh is four principles (domain ownership, data as product, self-serve platform, federated computational governance) that each map to a concrete artifact (domain repo, contract YAML, platform CLI, OPA policy) and a specific failure mode if you skip the artifact. Once you can quote the principle, the artifact, and the failure mode for all four, you can defend the architecture in any review meeting.
The four principles in one table.
| # | Principle | One-line definition | Concrete artifact | Typical failure mode |
|---|---|---|---|---|
| 1 | Domain ownership | the team that owns the business logic owns the data | one repo per domain, one on-call rotation | shadow IT in marketing — domains write pipelines outside platform |
| 2 | Data as a product | datasets have SLAs, versioning, consumers, a discoverable interface |
product.contract.yaml in domain repo + catalog page |
data dumped in S3 with no contract — "is this even fresh?" |
| 3 | Self-serve platform | central team provides substrate, not models | platform CLI, standard CI, golden paths for new domains | "self-serve" platform that needs platform-team tickets to use |
| 4 | Federated governance | central policy-as-code, domains comply automatically | OPA policies in git, Unity Catalog tags | Confluence pages nobody reads — governance via vibes |
Lakehouse as the substrate.
- One storage layer, many engines. Delta / Iceberg / Hudi tables live in S3 / GCS / ABFS. Each domain owns its bucket prefix. Compute engines (Spark, Trino, Snowflake, BigQuery) read the same files — domains are not forced to use one engine.
- One catalog, many domains. Unity Catalog, Polaris, or Gravitino provides cross-domain discovery, access control, and tag propagation. Each domain owns a schema (catalog → schema → table); the catalog is the cross-domain marketplace.
- One identity layer, many policies. Workload identity (OIDC, IAM Roles for Service Accounts) ties pipeline runs to a domain. Policies then reason about "which domain is asking" without per-engine permission matrices.
The "anti-mesh" pattern — what real mesh is not.
- "We abandoned governance." Domain ownership without federated governance is anarchy. The whole point of the federated qualifier is that central guardrails coexist with domain autonomy.
-
"Every team rolls their own stack." That is the absence of a self-serve platform. Domain ownership of data does not mean domain ownership of infrastructure. The platform team's job is to make sure every domain can ship a new
producttier in one day, not one quarter. - "We renamed the central data team." Renaming the central warehouse "the mesh" without changing who writes the SQL is theatre. Mesh requires org change as much as architecture change.
Common architecture-interview probes.
- "Name the four principles in order." — required answer. Listing them out of order is a yellow flag.
- "Map each principle to an artifact." — senior signal. Knowing the artifact means you have actually built one.
- "Name the typical failure mode for each principle." — staff signal. Knowing the failure means you have seen one in production.
- "Which principle does Unity Catalog implement?" — federated governance plus self-serve platform (it is the substrate for both). Knowing it spans two principles is the architect's answer.
Worked example — the four-principle audit on an e-commerce org
Detailed explanation. Apply the four-principle / four-artifact / four-failure rubric to a concrete e-commerce organisation with four domains (orders, inventory, customer, payments). The output is an honest audit that surfaces which principles the org has half-implemented.
Question. Given an e-commerce org with four domains, walk through each of the four principles and identify the current state (implemented / half-implemented / missing) plus the next investment.
Input — domain inventory.
| Domain | Owns business logic? | Has own repo? | Publishes contract? | On-call? |
|---|---|---|---|---|
orders |
yes | yes | half (schema only) | yes |
inventory |
yes | yes | no | no |
customer |
yes | shared with auth
|
no | no |
payments |
yes | yes | yes | yes |
Code — the audit harness.
PRINCIPLES = [
("domain_ownership", ["owns_logic", "own_repo", "on_call"]),
("data_as_product", ["publishes_contract"]),
("self_serve_platform",["uses_platform_cli", "uses_standard_ci"]),
("federated_gov", ["opa_policies_pass", "tagged_in_catalog"]),
]
def audit_domain(domain):
return {
principle: all(getattr(domain, f) for f in fields)
for principle, fields in PRINCIPLES
}
Step-by-step explanation.
-
ordersowns its logic and has a repo + on-call but only publishes a schema, not a full contract. Half-implementeddata_as_product. Next investment: ship the SLA and semantics sections of the contract. -
inventoryowns the logic and has a repo, but no contract and no on-call. Two of four principles broken. Next investment: name a domain lead and fund an on-call rotation before publishing cross-domain. -
customershares its repo withauth. That is a brokendomain_ownershipboundary — the bounded context is fuzzy. Next investment: split the repo or formalise the joint ownership. -
paymentsis fully implemented across all four principles. Use it as the lighthouse domain when onboarding the other three.
Output.
| Domain | Ownership | Product | Self-serve | Governance | Verdict |
|---|---|---|---|---|---|
| orders | yes | half | yes | yes | promote to full once contract complete |
| inventory | half | no | yes | no | needs on-call + contract |
| customer | half | no | half | no | split repo or formalise joint ownership |
| payments | yes | yes | yes | yes | lighthouse |
Rule of thumb. Score each domain quarterly against the four-principle rubric. Promote one "lighthouse" domain to full implementation first, then use it as the migration template — fastest way to convert the rest of the org.
Worked example — mapping a principle to its artifact for a new domain
Detailed explanation. A new loyalty domain is being stood up. The platform team's job is to make sure every principle has a concrete artifact at day-one — not "we will add the contract later." Skipping any artifact at onboarding is how mesh devolves into pre-mesh chaos with a new name.
Question. Walk through the platform-team checklist for onboarding a new loyalty domain — name the artifact for each of the four principles and the day-one acceptance test.
Input — onboarding checklist (skeleton).
| Principle | Artifact | Day-one acceptance test |
|---|---|---|
| domain ownership | repo + CODEOWNERS + on-call schedule | PR auto-assigns to domain lead |
| data as product | loyalty/product.contract.yaml |
contract validates in CI |
| self-serve platform | platform-cli init loyalty |
CI passes on main within 30 minutes |
| federated governance | OPA policies attached, catalog tags set | CI fails if PII column unmasked |
Code — onboarding script.
# One-command domain onboarding via the platform CLI
$ platform-cli init loyalty --owner @loyalty-team --on-call loyalty-pager
# Creates:
# • git repo loyalty/ with CODEOWNERS pre-filled
# • product.contract.yaml stub (name, version, owner, schema, sla, semantics)
# • Standard CI workflow (dbt build → contract validate → OPA check → publish)
# • Catalog schema `loyalty` registered in Unity Catalog
# • OPA policy bundle attached (pii_masking.rego, retention.rego, cross_region.rego)
# • On-call rotation wired to PagerDuty
Step-by-step explanation.
- The platform CLI is the artifact for self-serve platform. One command stands up every artifact the new domain needs. If the CLI does not exist, every onboarding becomes a multi-day ticket — and self-serve is theatre.
- The
product.contract.yamlstub is the artifact for data as product. It is intentionally incomplete (just the schema sketch); the domain team fills in the SLA and semantics in the first sprint. - The CODEOWNERS file and on-call rotation are the artifacts for domain ownership. From day one, PRs route to the domain team, and incidents page the domain on-call — not the central team.
- The OPA policy bundle is the artifact for federated governance. The same set of policies that runs against
orders,payments, etc. is attached to the new domain. The central team never has to manually review each PR — the policies do it.
Output.
| Hour | Artifact created | Day-one acceptance test |
|---|---|---|
| 0:00 | git repo + CODEOWNERS | PR auto-assigns |
| 0:05 | contract YAML stub |
validate-contract passes |
| 0:10 | CI workflow installed |
main build green |
| 0:15 | catalog schema registered |
SHOW SCHEMAS includes loyalty
|
| 0:20 | OPA policies attached | mock PR with PII fails CI |
| 0:30 | on-call rotation live | PagerDuty test ping reaches lead |
Rule of thumb. If onboarding a new domain takes more than half a day, the platform is not self-serve. Treat the onboarding time as the single most important platform-team KPI — anything over 4 hours means the CLI is missing a step.
Worked example — the "self-serve that needs platform tickets" anti-pattern
Detailed explanation. Mesh fails most often not because a principle is missing but because it is named without being implemented. The classic failure: the platform team builds a "self-serve" platform that requires a platform-team ticket to create a new dataset. Domain teams treat it like a ticket queue and the central pain returns under a new name.
Question. A platform team claims their platform is self-serve, but the onboarding doc has 9 manual steps and a "file a ticket with platform" item. Identify the three concrete fixes and the metric that proves the fix worked.
Input — current onboarding doc.
| Step | Owner | Manual? |
|---|---|---|
| 1. Create AWS sub-account | platform | yes (ticket) |
| 2. Register Unity Catalog schema | platform | yes (ticket) |
| 3. Set up dbt project | domain | yes |
| 4. Add to CI pipeline | platform | yes (ticket) |
| 5. Attach OPA policy bundle | platform | yes (ticket) |
| 6. Register in catalog | domain | yes |
| 7. PagerDuty rotation | platform | yes (ticket) |
| 8. Backstage entry | platform | yes (ticket) |
| 9. SLO dashboard | platform | yes (ticket) |
Code — the fix.
# Replace 5 platform-owned ticket steps with one CLI call
$ platform-cli init loyalty --owner @loyalty-team --on-call loyalty-pager
# Internally automates: sub-account, catalog schema, CI workflow,
# policy bundle, PagerDuty, Backstage, SLO dashboard.
Step-by-step explanation.
- Every step labelled "platform — ticket" is a queue. Five tickets across one onboarding means lead time is at least the sum of the five SLAs — and they cannot run in parallel because they have dependencies.
- The fix is automation, not delegation. Each ticket-step becomes a Terraform module or platform-CLI subcommand. The domain team triggers the whole sequence with one command.
- The metric that proves the fix worked is median onboarding lead time. Pre-fix: 9-14 days (5 tickets × 2 days each + serial dependencies). Post-fix: 30 minutes (one CLI invocation).
- The platform team's new ticket queue is not onboarding; it is platform feature requests ("can the CLI also set up a Trino catalog?"). That queue is bounded and predictable — the onboarding queue was not.
Output.
| Metric | Before | After |
|---|---|---|
| Manual platform-team tickets per onboarding | 5 | 0 |
| Median onboarding lead time | 9-14 days | 30 minutes |
| Platform-team onboarding workload / quarter | 5 × N domains | 1 (CLI maintenance) |
| Domain-team frustration | high | low |
Rule of thumb. The platform team's success metric is "self-serve onboarding lead time" — measured at the p50. If a new domain cannot stand up its first product table without a ticket, the platform is mis-named.
Architecture interview question on the four principles
A senior interviewer often frames this as: "Map a hypothetical e-commerce platform's four domains against the four mesh principles. Score each cell. Name the next investment that buys the biggest reduction in central-team workload." It probes whether you can use the rubric as a real diagnostic instead of as architecture-deck filler.
Solution Using the four-by-four mesh-readiness matrix
def mesh_audit(domain):
"""Score a domain on the four mesh principles."""
return {
"domain_ownership": score(domain.has_repo, domain.has_oncall, domain.codeowners),
"data_as_product": score(domain.has_contract, domain.has_versioning, domain.has_consumers),
"self_serve_platform": score(domain.uses_cli, domain.uses_standard_ci, domain.onboarding_minutes < 60),
"federated_governance": score(domain.opa_passing, domain.tags_set, domain.pii_lineage_clean),
}
def score(*flags):
return sum(1 for f in flags if f)
Step-by-step trace.
| Domain | ownership / 3 | product / 3 | self-serve / 3 | governance / 3 | total / 12 |
|---|---|---|---|---|---|
| orders | 3 | 2 | 3 | 3 | 11 |
| inventory | 2 | 0 | 3 | 1 | 6 |
| customer | 1 | 0 | 2 | 1 | 4 |
| payments | 3 | 3 | 3 | 3 | 12 |
The lowest column total (governance, here at 8 / 12 across four domains) flags the biggest org-wide gap. Investing one quarter to ship the OPA pii_masking.rego policy bundle across all four domains lifts governance to 12 / 12 and removes the most expensive central-team workload (manual PII review).
Output:
| Investment | Total score lift | Central-team load saved |
|---|---|---|
| Ship OPA PII policy bundle | +4 (governance column) | ~3 reviews / week |
Add SLA section to inventory contract |
+1 (product column) | ~1 ticket / week |
Split customer repo from auth
|
+1 (ownership column) | ~1 review / week |
| Complete the audit quarterly | (process) | catches regressions early |
Why this works — concept by concept:
- Four-axis matrix — turns "are we doing mesh right?" from a vibe into a number. Each axis is a column in the matrix; each domain is a row.
- Score per principle — three sub-checks per principle prevents binary "yes / no" gaming. A domain with a repo but no on-call is not "yes" on ownership; it is 1 / 3.
-
Lighthouse pattern — the highest-scoring domain (here
payments) becomes the template. The platform team converts other domains into copies of the lighthouse, not greenfield each time. - Investment = lowest column total — the column with the worst aggregate score is the place where one centralised investment (a policy bundle, a CLI feature) buys the biggest lift across the whole org.
- Cost — running the audit takes one engineer-day per quarter; the data it produces drives the platform team's roadmap for the entire next quarter. Cheap insurance against silent mesh erosion.
Data Architecture
Topic — dimensional modeling
Dimensional modelling for mesh domains
3. Domain bounded contexts — drawing the lines
A bounded context is the unit of ownership — only the product tier crosses the line
The mental model in one line: a domain owns three internal tiers (raw, derived, product), but only the product tier is consumable by other domains — every cross-domain read goes through the catalog, never reaches into raw or derived storage. Once you can say "raw is private, derived is private, product is the API," the entire bounded-context interview surface collapses to enforcing one rule in CI.
The three-tier model per domain.
- Raw tier (private). Whatever the source systems send: Kafka topics, CDC log files, vendor CSV drops. Schema may change daily. Only the domain pipeline reads it.
- Derived tier (private). Cleaned, conformed, deduplicated, partitioned — the domain's internal working sets. May be joined heavily, may contain PII, may be expensive to recompute. Only the domain pipeline reads it.
- Product tier (public, contract-bound). What other domains consume. Schema is locked by a contract. Versioned with semver. PII is masked or hashed. Documented in the catalog. Has an SLA and an on-call rotation.
The cross-domain consumption rule.
-
Cross-domain reads only touch the
producttier. Marketing readsorders.order_facts(product); marketing never readsorders.orders_kafka_landing(raw) ororders.orders_clean_partitioned(derived). - Cross-domain reads go through the catalog. Unity Catalog / Polaris / Gravitino resolves the table name, applies access control, applies row / column policies, propagates tags. No direct S3 reads across domains.
- Cross-domain joins on raw are banned. A platform-level OPA policy blocks any pipeline whose dependency graph reads another domain's raw or derived tier. The CI rejects the PR with a clear message.
The subscription model.
-
Consumers pin to a major.minor version. Marketing pins to
order_facts >= 2.1, < 3.0. They get patch upgrades automatically; minor upgrades are additive; major upgrades require an explicit pin bump. - Producers signal breaking changes via PR. A schema change that breaks downstream consumers fails CI unless the major version is bumped AND a deprecation notice was posted ≥ 1 quarter ago.
-
The catalog tracks subscribers. Every
producttable page lists its current downstream consumers — making "who is reading this?" a one-page lookup instead of a Slack archaeology dig.
Common architecture-interview probes on bounded contexts.
- "Can the
marketingdomain JOINorders.orders_kafka_landingfor performance?" — no, that breaks the bounded context. The right answer is publishing the needed fields as a product table or extending an existing product. Saying "yes for performance" is an automatic fail signal. - "Who owns
dim_date?" — the platform team, almost always. Conformed dimensions (dim_date,dim_geo,dim_currency) are platform-tier products, not domain-tier products. Treating them as just-another-domain creates a circular ownership problem. - "What happens when two domains need to join their product tables?" — they JOIN through the catalog, both reading product. If the join is hot, model it once at the platform tier as a curated cross-domain mart.
- "How do you stop a consumer from reaching into raw?" — OPA policy on lineage scan: if a downstream pipeline's lineage includes another domain's raw or derived tier, CI fails.
Worked example — drawing the bounded-context map for an e-commerce platform
Detailed explanation. Walk through a concrete e-commerce org's bounded contexts. Each domain owns a verb (placing orders, tracking inventory, knowing customers, processing payments). The product tier is the cross-domain language; the raw and derived tiers are private vocabulary.
Question. Define the four domains for an e-commerce platform, list their three tiers each, and identify two cross-domain consumption flows that must use the product tier.
Input — domain inventory.
| Domain | Verb | Source systems | Product tables |
|---|---|---|---|
| orders | place + fulfill orders | order-service Kafka, OMS CDC |
order_facts v2.1 |
| inventory | track stock | inventory-service Kafka, warehouse RFID |
sku_inventory v1.3 |
| customer | know who the customer is | auth-service CDC, support tickets |
customer_dim v3.0 |
| payments | process money movement | stripe webhooks, ledger CDC |
payment_facts v2.0 |
Code — the tier layout (Iceberg / Delta naming).
-- orders domain (Unity Catalog)
CREATE TABLE orders.raw.orders_kafka_landing (...); -- private
CREATE TABLE orders.derived.orders_clean (...); -- private
CREATE TABLE orders.product.order_facts ( -- public, contract-bound
order_id STRING,
customer_id STRING,
order_ts TIMESTAMP_TZ,
amount DECIMAL(12, 2),
currency STRING,
status STRING
);
-- inventory domain
CREATE TABLE inventory.raw.inventory_kafka_landing (...); -- private
CREATE TABLE inventory.derived.inventory_clean (...); -- private
CREATE TABLE inventory.product.sku_inventory (...); -- public
-- Cross-domain flow: marketing reads orders.product.order_facts
-- joined with customer.product.customer_dim
SELECT
c.segment,
SUM(o.amount) AS total_revenue
FROM orders.product.order_facts o
JOIN customer.product.customer_dim c
ON o.customer_id = c.customer_id
WHERE o.order_ts >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY c.segment;
Step-by-step explanation.
- Each domain owns its
raw+derived+productschemas under its catalog namespace. The schema namespace (orders.raw.*) makes ownership unambiguous and the access-control story trivial — grant theordersgroup write onorders.*, grant everyone else read onorders.product.*only. - The marketing query joins two
producttables — that is the legal cross-domain pattern. The query never touchesorders.raw.*ororders.derived.*, so the bounded context holds. - If marketing needs a new field (e.g.
discount_amount) and it lives inorders.derived.orders_cleanbut not inorders.product.order_facts, the answer is not "let marketing read derived." It is "open a PR against theordersdomain to extend theorder_factsproduct schema as av2.2minor version." - The platform team owns
platform.product.dim_date,platform.product.dim_geo, and other conformed dimensions. Every domain joins to those — domain teams do not each maintain their own date dimension.
Output.
| Layer | Schemas | Cross-domain access |
|---|---|---|
| Raw | <domain>.raw.* |
domain-internal only |
| Derived | <domain>.derived.* |
domain-internal only |
| Product | <domain>.product.* |
discoverable, contract-bound |
| Platform conformed dims | platform.product.dim_* |
shared by every domain |
| Cross-domain marts | platform.curated.* |
platform-managed joins of hot product tables |
Rule of thumb. Build the catalog with a <domain>.<tier>.* namespace from day one. Reading from another domain's raw or derived is a CI failure — and the namespace makes the violation obvious in the SQL itself.
Worked example — the cross-domain consumer subscription with version pinning
Detailed explanation. The marketing analytics domain wants to consume orders.product.order_facts. Without a contract and version-pinning, every schema change in orders silently breaks marketing. With the subscription model, the relationship is explicit, the pinning is mechanical, and breaking changes require a one-quarter deprecation window.
Question. Walk through the subscription handshake when marketing consumes orders.product.order_facts. Show how a minor version bump is silent for the consumer, and how a major version bump goes through a deprecation flow.
Input — initial state.
| Producer | Consumer | Pinned version | Current version |
|---|---|---|---|
orders.product.order_facts |
marketing.derived.orders_enriched |
^2.1 (any 2.x ≥ 2.1) |
2.1.4 |
Code — the marketing consumer config.
# marketing/dbt_project.yml (snippet)
consumers:
- source: orders.product.order_facts
pin: "^2.1" # any 2.x at or above 2.1
on_breaking: "fail_ci" # bumping major requires explicit consumer-side PR
Step-by-step explanation.
-
ordersships2.1.5— patch only (bug fix). Marketing's pin (^2.1) accepts it. No PR, no review, no notification. Patches are silent. -
ordersships2.2.0— minor (additive: new columndiscount_amountappended). Marketing's pin (^2.1) accepts it. The new column is ignored by marketing until they choose to use it. Additive minor bumps are also silent. -
ordersopens a PR for3.0.0— major (breaking:amountrenamed togross_amount). The PR includes a 90-day deprecation notice. Marketing's CI now displays a "your producer is sunsetting^2.1in 90 days" warning on every run. - Marketing opens its own PR to update the pin to
^3.0and rename the column reference. The two PRs merge in coordinated order. The deprecation window prevents the producer from breaking the consumer mid-quarter.
Output.
| Producer ship | Marketing's pin | Marketing PR needed? | Lead time |
|---|---|---|---|
| 2.1.4 → 2.1.5 | ^2.1 |
no | 0 (silent) |
| 2.1.5 → 2.2.0 | ^2.1 |
no | 0 (silent) |
| 2.2.0 → 3.0.0 | ^2.1 |
yes | up to 1 quarter |
Rule of thumb. Always pin with caret (^major.minor), never just latest. The caret lets patches and additive minors flow through, but turns major bumps into deliberate, reviewable events with a quarter of lead time.
Worked example — the "cross-domain JOIN on raw" anti-pattern, caught in CI
Detailed explanation. A junior engineer in marketing finds a join on orders.derived.orders_clean is 4x faster than the same join on orders.product.order_facts (the product tier has extra masking and view overhead). They submit a PR that reads from derived. The OPA policy catches the violation in CI and fails the PR with a clear message.
Question. Show the OPA policy that detects cross-domain reads outside the product tier, and the CI failure message a violating PR would receive.
Input — the violating SQL.
-- BAD — marketing reaching into orders' derived tier for performance
SELECT m.campaign_id, SUM(o.amount) AS revenue
FROM marketing.derived.lead_attribution m
JOIN orders.derived.orders_clean o -- ← bounded-context violation
ON o.order_id = m.order_id
GROUP BY m.campaign_id;
Code — the OPA policy (Rego).
package mesh.bounded_context
# Deny any pipeline whose lineage includes another domain's raw or derived tier.
deny[msg] {
pipeline_domain := input.pipeline.domain
upstream := input.lineage[_]
upstream.domain != pipeline_domain
upstream.tier != "product"
msg := sprintf(
"pipeline %v (domain=%v) reads %v.%v.%v — only the product tier of another domain is consumable",
[input.pipeline.name, pipeline_domain, upstream.domain, upstream.tier, upstream.table],
)
}
Step-by-step explanation.
- The OPA policy inspects the pipeline's lineage manifest (produced by the dbt / lineage scanner during CI).
- For every upstream dependency, it checks: is the upstream in a different domain and in a tier other than
product? If yes, the policy fires. - The CI step
opa evalreturns non-zero, the PR comment shows the formatted message, and the merge button is greyed out. - The fix is for
marketingto open a PR againstordersrequesting a new field on theproducttier (a minor version bump), then update its query to read from product.
Output — the CI failure message.
| CI check | Result | Message |
|---|---|---|
| dbt build | pass | 12 models built |
| contract validate | pass | schema matches contract |
| bounded-context (OPA) | fail | pipeline marketing.derived.lead_attribution (domain=marketing) reads orders.derived.orders_clean — only the product tier of another domain is consumable |
| pii_masking (OPA) | pass | no PII columns detected |
Rule of thumb. Codify the bounded-context rule as a policy that runs in CI, not as a Confluence page. The message needs to be specific enough that the engineer who tripped it can fix it in one PR — vague "this is not allowed" failures produce platform-team tickets, not fixes.
Architecture interview question on cross-domain consumption
A senior interviewer often frames this as: "Your marketing domain needs a field that exists in orders derived tier but not in the order_facts product. Walk me through the three options, recommend one, and explain the trade-offs." It probes whether you understand that the bounded-context rule has a cost and that mesh requires a clear escalation path.
Solution Using a product-extension PR with a minor version bump
# orders/product/order_facts.contract.yaml — proposed minor bump 2.1 → 2.2
name: order_facts
version: "2.2.0" # minor: additive only
owner: "@orders-team"
schema:
- { name: order_id, type: string, required: true }
- { name: customer_id, type: string, required: true }
- { name: order_ts, type: timestamp_tz, required: true }
- { name: amount, type: "decimal(12,2)", required: true }
- { name: currency, type: string, required: true }
- { name: status, type: string, required: true }
- { name: discount_amount, type: "decimal(12,2)", required: false } # NEW
sla:
freshness: "15m"
completeness: "99.9%"
accuracy: "99%"
semantics:
discount_amount: "USDdiscountapplied;0ifnodiscount;NULLonlyforlegacy<2.2"
Step-by-step trace.
| Option | Cost | Time | Verdict |
|---|---|---|---|
| 1. Marketing reads derived | 0 (today) | 1 day | breaks bounded context — fail CI |
| 2. Marketing forks the cleaning pipeline | high (duplicate logic) | 2 weeks | duplicates ownership; smell |
3. Extend order_facts to v2.2 with discount_amount
|
low (one schema change) | 3 days | recommended |
The recommended path is a minor version bump on orders.product.order_facts. Adding an optional field is additive — no consumer breaks because the new field is opt-in. The orders team ships the change as v2.2.0; marketing updates its consumer in a follow-up PR; the bounded context holds.
Output:
| Step | Owner | Artifact |
|---|---|---|
| 1. PR to extend contract | orders + marketing co-authored | product.contract.yaml |
| 2. CI: schema diff, semver check | platform CI | green / red |
| 3. Producer ships v2.2 | orders | new column in order_facts
|
| 4. Consumer updates query | marketing | reads new column |
| 5. Catalog page auto-updates | platform | docs reflect new schema |
Why this works — concept by concept:
- Additive minor is the cheap escape valve — adding an optional column never breaks consumers. The contract semver rules turn "I need a new field" into a one-PR change instead of a cross-domain negotiation.
- Bounded context preserved — marketing never reaches into orders' derived tier. The interview-grade signal is recognising that the performance argument for reaching into derived is short-term thinking that breaks the org.
- Co-authored PR — the consumer (marketing) and producer (orders) co-author the PR. That is the right collaboration shape; it documents the consumer's need in the producer's repo and aligns review.
-
Catalog as marketplace — the catalog page is the source of truth for what
order_factsv2.2 looks like. Search, schema, owner, freshness — all in one place. Marketing finds the new column there, not in Slack. - Cost — one schema change, one minor version bump, one PR each side. The alternative ("just read derived") costs a CI failure plus future architecture debt every time the schema drifts.
Data Architecture
Topic — data modeling
Domain modeling problems
4. Data contracts — the API of a data product
A data contract is the YAML that turns a table into an API — six required fields, semver, git-reviewed
The mental model in one line: a data contract is a versioned YAML file (product.contract.yaml) that declares six fields — name, version, owner, schema, sla, semantics — lives in the domain's git repo, is reviewed via PR, and is enforced in CI on every write. Once the contract exists, the product table behaves like a REST API: callers know what they get, breaking changes are visible, and "is this field nullable?" is a one-line lookup, not a Slack archaeology dig.
The six required fields.
-
name— fully-qualified table name (orders.product.order_facts). Unique across the catalog. -
version— semver string (2.1.0). Patch = bug fix, minor = additive, major = breaking. -
owner— the domain team's git handle + on-call rotation (@orders-team+orders-pager). PRs auto-assign; incidents auto-page. -
schema— column list with name, type, nullable, required, default. The structural promise. -
sla— freshness window, completeness threshold, accuracy bound. The runtime promise. -
semantics— what each column means in business terms: units, NULL semantics, business rules. The interpretive promise.
Schema enforcement at write-time.
-
dbt contracts (since 1.5) —
contract: enforced: truein the model YAML makes dbt fail the run if the materialised table does not match the declared schema. - Great Expectations / Soda / Schemata — assertion frameworks that run in CI; check column types, ranges, uniqueness, freshness.
-
Lakehouse-level constraints — Delta and Iceberg both support
NOT NULL,CHECK, and primary-key constraints at the table level. The contract maps to those. - Catalog-level pinning — Unity Catalog / Polaris records the contract version per table; readers see the schema as of the version they pinned to.
Semver rules — read aloud.
- Patch (2.1.0 → 2.1.1). Bug fix, no schema change. Consumers do nothing.
- Minor (2.1.0 → 2.2.0). Additive only — new optional columns, broader nullable values. Existing consumers' queries still compile and produce the same answer.
- Major (2.1.0 → 3.0.0). Breaking — column rename, type narrow, drop, NULL contract change. Requires a deprecation window (typically ≥ 1 quarter) and explicit consumer-side opt-in.
SLAs as enforceable thresholds.
-
Freshness —
freshness: "15m"means the latest row'sevent_tsis within 15 minutes ofnow(). Monitored by the platform; consumers see "stale" status in the catalog. -
Completeness —
completeness: "99.9%"means at most 0.1% of expected rows are missing in a window. Computed against a reference (CDC source, upstream count, etc.). -
Accuracy —
accuracy: "99%"means at most 1% of rows fail a domain-defined sample test (e.g.amount > 0,currency in standard_codes). - All three are observable — the platform publishes them as Prometheus metrics; the catalog page shows green / amber / red.
Common architecture-interview probes on contracts.
- "What are the minimum fields in a data contract?" —
name,version,owner,schema,sla,semantics. Six. Missing any is half a contract. - "Where does the contract live?" — in the domain's git repo, reviewed via PR. Never in a UI nobody opens. A contract in a UI is a Confluence page.
- "What is the difference between
patch,minor, andmajor?" — patch = bug, minor = additive, major = breaking. Treat the rule mechanically; do not "feel out" what's breaking. - "How do you stop a producer from shipping a breaking change in patch / minor?" — CI runs a schema-diff against the previous version; if the diff is non-additive and the semver bump is not major, CI fails.
Worked example — the full order_facts.contract.yaml
Detailed explanation. A complete, ship-it-today contract for order_facts v2.1 in the orders domain. Each field is filled with concrete values that pass a CI dry-run.
Question. Write the complete YAML for orders.product.order_facts v2.1 including all six required fields. Show how the producer-side dbt model references the contract.
Input — the schema sketch.
| Column | Type | Nullable | Notes |
|---|---|---|---|
| order_id | string | no | UUID v4 |
| customer_id | string | no | FK → customer_dim |
| order_ts | timestamp_tz | no | UTC, source-system event time |
| amount | decimal(12, 2) | no | USD, gross (before discount) |
| currency | string | no | ISO 4217 |
| status | string | no | placed / shipped / cancelled / paid |
Code — the contract YAML.
# orders/product/order_facts.contract.yaml
name: orders.product.order_facts
version: "2.1.0"
owner:
team: "@orders-team"
on_call: "orders-pager"
reviewers: ["@alice", "@bob"]
schema:
- { name: order_id, type: string, required: true, description: "UUIDv4oftheorder" }
- { name: customer_id, type: string, required: true, description: "FKtocustomer_dim.customer_id" }
- { name: order_ts, type: timestamp_tz, required: true, description: "UTCsource-systemeventtime" }
- { name: amount, type: "decimal(12,2)", required: true, description: "Grossamountin`currency`,beforediscount" }
- { name: currency, type: string, required: true, description: "ISO4217code;whitelistedtoUSD/EUR/GBP/JPY/INR" }
- { name: status, type: string, required: true, description: "Lifecycle:placed|shipped|cancelled|paid" }
sla:
freshness: "15m" # latest order_ts ≤ now − 15 minutes
completeness: "99.9%" # CDC source count − product count ≤ 0.1%
accuracy: "99%" # sample-test pass rate ≥ 99% over 24h window
semantics:
amount: "GrossamountinUSD-equivalent(beforediscount).NULLisinvalid."
currency: "Restrictedto{USD,EUR,GBP,JPY,INR}.Othercodesroutetoinvestigation."
status: "Lifecyclestateatwritetime.Late-arrivingstatustransitionsemitanewevent."
null_rule: "EveryrequiredfieldMUSTbenon-NULLatwritetime.CIfailsonviolation."
The dbt model wires the contract.
# orders/models/product/order_facts.yml
version: 2
models:
- name: order_facts
config:
contract:
enforced: true
columns:
- name: order_id
data_type: string
constraints:
- type: not_null
- name: amount
data_type: "decimal(12,2)"
constraints:
- type: not_null
- type: check
expression: "amount>=0"
Step-by-step explanation.
- The contract YAML is the source of truth. The dbt model references the same column names, types, and NOT NULL constraints — CI verifies the two are in sync.
-
contract: enforced: truemakes dbt fail the run if the materialised table's column types or order do not match the YAML. - The
checkconstraint encodes a piece of thesemanticssection (amount >= 0) into a lakehouse-level rule. Iceberg / Delta enforce it at write time. - The contract version is the lakehouse table property
contract_version=2.1.0. Readers' pinning logic queries the catalog for this property and refuses to read newer-major versions silently.
Output.
| Field | Filled? | Enforced where? |
|---|---|---|
| name | yes | catalog (must match table name) |
| version | yes | catalog property + git tag |
| owner | yes | CODEOWNERS + PagerDuty rotation |
| schema | yes | dbt contract + Iceberg / Delta constraints |
| sla | yes | platform monitoring (Prometheus + catalog status badge) |
| semantics | yes | written docs + sample-test rules |
Rule of thumb. A contract that has all six fields and zero TODOs is the bar. A "draft" contract with a missing SLA section is not a contract — consumers cannot pin against it because the runtime promise is undefined.
Worked example — semver in action across one quarter
Detailed explanation. Walk through a representative quarter of order_facts evolution. Bug fix patches, additive minor bumps, and one major bump that follows the deprecation process. Each step shows the contract diff, the consumer impact, and the CI gate.
Question. Show how order_facts evolves from 2.1.0 to 3.0.0 over one quarter, naming the version bump type for each change and the consumer impact.
Input — change log.
| Week | Change | Bump type |
|---|---|---|
| 1 | Fix bug: amount was occasionally negative for refunds. Add CHECK >= 0
|
patch |
| 4 | Add discount_amount column (optional, decimal(12,2)) |
minor |
| 6 | Add payment_method column (optional, string) |
minor |
| 9 | Rename amount → gross_amount; introduce required net_amount
|
major |
Code — the contract diff for week 9 (major bump).
# Before (v2.3)
- { name: amount, type: "decimal(12,2)", required: true }
# After (v3.0) — RENAME is a breaking change
- { name: gross_amount, type: "decimal(12,2)", required: true }
- { name: net_amount, type: "decimal(12,2)", required: true } # NEW required column
deprecation:
previous_version: "2.x"
sunset_date: "2026-09-30" # ≥ 1 quarter out
migration_notes: "Rename`amount`→`gross_amount`.Add`net_amount=gross_amount-discount_amount`."
Step-by-step explanation.
- Week 1: patch. Schema unchanged. Adding a
CHECKconstraint that the current data satisfies is non-breaking. Consumers do nothing. - Week 4: minor. New optional column. Consumers' queries still compile (they don't reference the new column). Catalog page auto-updates.
- Week 6: minor again. Same rule — new optional column.
- Week 9: major. The rename of
amountwould break every consumer; CI on the producer side fails the PR unless the version is bumped to3.0and adeprecationblock is present with a sunset date ≥ 1 quarter out. - The producer ships
2.4(final 2.x) and3.0concurrently for 1 quarter, so consumers can migrate at their own pace.
Output — the catalog timeline.
| Date | Version | Type | Consumers affected | Sunset date |
|---|---|---|---|---|
| Week 1 | 2.1.1 | patch | none | — |
| Week 4 | 2.2.0 | minor | none | — |
| Week 6 | 2.3.0 | minor | none | — |
| Week 9 | 2.4.0 (final 2.x) + 3.0.0 | major | all 2.x consumers | 2026-09-30 |
Rule of thumb. Treat semver as a mechanical rule, not a judgment call. If the schema diff is "add an optional column," it is automatically minor. If the diff is "rename or drop or narrow a type," it is automatically major. Removing the judgment call removes the most common contract violation.
Worked example — the contract-validation CI step
Detailed explanation. A producer ships a PR that bumps order_facts from 2.1.0 to 2.2.0 and adds a column. The CI step validate-contract checks four things: the YAML is well-formed, the diff against main's version is semver-consistent, the dbt model matches the YAML, and the deprecation block is present if the bump is major.
Question. Show the four CI checks the platform runs on every contract change PR and the failure messages that surface in the GitHub UI.
Input — the PR diff (extract).
- version: "2.1.0"
+ version: "2.2.0"
schema:
- { name: order_id, type: string, required: true }
+ - { name: discount_amount, type: "decimal(12,2)", required: false }
Code — the validation harness.
# Platform CI invoked on every PR touching a *.contract.yaml
$ contract-cli validate \
--yaml orders/product/order_facts.contract.yaml \
--baseline main \
--model orders/models/product/order_facts.yml \
--policies policies/
# Internally runs:
# 1. JSON Schema validation of the YAML
# 2. Semver diff against main (additive → minor required)
# 3. dbt model ↔ YAML schema sync check
# 4. OPA policies (PII tags, retention, cross-region rules)
Step-by-step explanation.
- JSON Schema validation rejects malformed YAML, missing required fields, and bad types. Catches typos before the diff stage.
- Semver diff compares
main's contract to the PR's contract. If the diff is non-additive but the bump is patch / minor, CI fails with a specific message: "non-additive diff requires major version bump." - The dbt model is parsed and its column list compared to the YAML's. Mismatches fail CI with "schema drift between dbt model and contract YAML."
- OPA policies run against the new schema. A new column tagged
PII: truein the schema but missing amask: truedirective fails CI immediately.
Output — the CI report card.
| Check | Result | Detail |
|---|---|---|
| YAML well-formed | pass | 6 fields present |
| Semver consistency | pass | diff is additive; minor bump matches |
| dbt model sync | pass | column list matches |
| OPA policies | pass | no PII / retention / cross-region violations |
Rule of thumb. Make the contract-validation CI step required for merge. A contract that exists but is not enforced in CI is the worst of both worlds — consumers think they have a contract, producers think they have flexibility. Enforce or do not bother.
Architecture interview question on contract design
A senior interviewer often frames this as: "Design the contract for a payments.product.payment_facts table that pays out across five currencies. Walk me through each of the six fields, then add one SLA and one semantic rule that you would not have included six months ago." It probes whether you write contracts as documents or as enforceable promises.
Solution Using a complete six-field contract with hard-won semantic rules
name: payments.product.payment_facts
version: "2.0.0"
owner:
team: "@payments-team"
on_call: "payments-pager"
schema:
- { name: payment_id, type: string, required: true, description: "UUIDv4ofthepaymentintent" }
- { name: order_id, type: string, required: true, description: "FKtoorders.product.order_facts" }
- { name: amount_native, type: "decimal(18,2)", required: true, description: "Amountintheoriginalcurrency" }
- { name: amount_usd, type: "decimal(18,2)", required: true, description: "USD-convertedamountatratecapturedatwritetime" }
- { name: currency, type: string, required: true, description: "ISO4217" }
- { name: fx_rate, type: "decimal(12,6)", required: true, description: "FXrateusedforamount_usd;non-NULLevenforUSD-native" }
- { name: status, type: string, required: true, description: "captured|failed|refunded|partially_refunded" }
- { name: captured_ts, type: timestamp_tz, required: true, description: "UTCcapturetime" }
sla:
freshness: "5m"
completeness: "99.95%"
accuracy: "99.5%"
# NEW (hard-won): late-arriving rows must arrive within 24h or be flagged
late_arrival_window: "24h"
semantics:
amount_native: "Alwayspositive.Refundsusestatus='refunded',NOTanegativeamount."
amount_usd: "Computedatwritetimeusingfx_rate.Neverre-deriveddownstream."
fx_rate: "EvenforUSD-nativerows,setto1.000000—neverNULL.NULLfx_rateisaCIfailure."
status: "Stateatcapturetime;laterstatetransitionsemitanewrowwiththesamepayment_id."
Step-by-step trace.
| Field | Why this exists |
|---|---|
name |
uniqueness across catalog |
version |
semver pin for consumers |
owner |
who pages on incident |
schema |
structural promise |
sla |
runtime promise |
semantics |
interpretive promise |
The two hard-won additions: late_arrival_window: 24h and the fx_rate semantic rule "never NULL even for USD-native." The first came from a Q2 incident where reconciliation ran on partial data; the second from a Q3 incident where a downstream pipeline divided by NULL and silently produced zeroes.
Output:
| Promise | Mechanism |
|---|---|
| Structural | dbt contract + Iceberg constraints |
| Runtime (SLA) | platform monitoring + catalog status badge |
| Interpretive (semantics) | CI sample tests + reviewer checklist |
Why this works — concept by concept:
- Six required fields — anything fewer is an incomplete promise. Schema without SLA leaves consumers guessing about freshness; SLA without semantics leaves them guessing about meaning.
- Semantics column-by-column — the most under-documented field in 90% of "contracts" in the wild. Spelling out "refunds use status, not negative amount" prevents a class of downstream bugs.
- Late-arrival window — explicit late-arrival policy turns the "is the data complete?" question into a deterministic check. Without it, reconciliations are timing-dependent and flaky.
-
Non-NULL
fx_rate— encoding hard-won bugs as semantic rules turns institutional knowledge into machine-enforceable promises. Every NULL-fx_rate bug ever debugged should add a contract clause. - Cost — one engineer-week per domain to ship the first contract; ~half a day per minor version after that. Cheap insurance against the entire family of "I thought this column was non-NULL" outages.
Data Architecture
Topic — ETL design
ETL design problems
5. Federated computational governance — policy as code
The central team stops writing models and starts writing policies — domains comply automatically through CI
The mental model in one line: federated computational governance means the central platform team writes policies-as-code (OPA, Unity Catalog rules, lakehouse constraints), and the CI on every domain repo evaluates those policies on every PR — domain autonomy + central guardrails coexist because the guardrail is a check, not a *ticket*. Once the policies are codified, the platform team's KPI flips from "how many tickets we shipped" to "% of compliance enforced automatically vs ticket-based."
The compliance loop in five steps.
-
Step 1. Central platform team writes a policy (e.g. "any column tagged
PII: truemust be masked in theproducttier"). - Step 2. The policy lives in a platform git repo, versioned and PR-reviewed by central + delegated reviewers from each domain.
- Step 3. A domain team opens a PR in their own repo (adding a column, changing a schema).
-
Step 4. CI on the domain repo invokes
opa evalagainst the platform policies. Violations fail the PR with a specific message and a link to the policy. - Step 5. Pass → merge. Fail → fix or open a "new policy needed?" issue against the platform repo. The feedback loop is < 60 seconds.
Policies that scale (the canonical set).
-
PII masking. Any column whose lineage tag includes
PII: truemust be masked, hashed, or tokenised in theproducttier. Catches accidental exposure ofemail,ssn,phone. -
Retention. Any table tagged
customer_datamust have aretention_daysproperty ≤ 730. Drives automatic vacuum / time-travel pruning. - Cross-region. Reads of EU-tagged tables from non-EU compute require an approved exception. Catches GDPR / data-residency violations.
- Query-pattern. Pipelines whose CPU-per-row exceeds a threshold get flagged in CI for review. Cheap defence against runaway costs.
Tag inheritance through lineage.
-
Producer tags the source. The raw column
customer.raw.users.emailgets thePII: truetag once, by thecustomerdomain. - Lineage scanner propagates. OpenLineage / Marquez / Datafold scan dbt manifest and CI artifacts, build the column-level lineage graph, propagate tags downstream automatically.
-
Derived and product inherit. Every downstream column derived from
emailinherits thePII: truetag — including hashed forms (SHA-256(email)), which are still PII under GDPR. - Policies key off the tag, not the column name. That decoupling means the policy survives column rename and propagation through unioned / joined / aggregated derivations.
The platform team's new KPI.
- Before mesh. "Tickets shipped per quarter." Linear with team size. Caps out.
- After mesh. "Percent of compliance enforced automatically." Bounded by 100%. Each new policy moves it up; each manual review caught in CI moves it up.
- The conversation with leadership changes. "We are at 92% automated compliance. The remaining 8% is the cross-region approval workflow which is intentionally manual."
Common architecture-interview probes on governance.
- "How does a PII column propagate through derivations?" — through column-level lineage with tag inheritance. Hashed or tokenised PII is still PII for policy purposes.
- "What stops a domain from opting out of governance?" — the platform CI workflow is a required check on every domain repo. The domain cannot merge to
mainwithout it passing. Platform writes the workflow template; domains import it. - "When does a policy get exception-approved instead of enforced?" — policies have an
exception_allowed: trueflag for cases like one-off analytics that need a 90-day exemption. The exemption is auditable, time-bound, and shows in the catalog. - "Is mesh compatible with strict regulatory regimes (SOX, GDPR, HIPAA)?" — more compatible than centralised, because the audit trail is built into the policy-as-code git history. Every compliance decision has a PR, a reviewer, and a timestamp.
Worked example — the PII-masking OPA policy
Detailed explanation. Write the canonical PII-masking policy. Any column whose tags include PII: true must have a mask: <method> directive in the contract. The policy runs on every PR that touches a *.contract.yaml.
Question. Write the Rego OPA policy that fails CI when a PII column in the product tier lacks a masking directive, and show the contract YAML that satisfies it.
Input — the contract excerpt.
schema:
- { name: customer_id, type: string, required: true, tags: ["PII"], mask: "sha256" }
- { name: email, type: string, required: true, tags: ["PII"], mask: "tokenize" }
- { name: order_total, type: "decimal(12,2)", required: true } # no PII tag
Code — the Rego policy.
package mesh.pii_masking
# Deny any product-tier column tagged PII that lacks a mask directive.
deny[msg] {
input.tier == "product"
col := input.schema[_]
"PII" == col.tags[_]
not col.mask
msg := sprintf(
"column %v is tagged PII but has no `mask:` directive (allowed: sha256 | tokenize | redact)",
[col.name],
)
}
# Allow only an approved list of masking methods.
allowed_masks := {"sha256", "tokenize", "redact"}
deny[msg] {
col := input.schema[_]
"PII" == col.tags[_]
col.mask
not allowed_masks[col.mask]
msg := sprintf(
"column %v uses unsupported masking method %v",
[col.name, col.mask],
)
}
Step-by-step explanation.
- The first rule fires when a PII-tagged column has no
mask:field. It composes a precise message — naming the offending column and listing the allowed methods. - The second rule fires when a
mask:field exists but uses an unapproved value. The allowed set is a single Rego value, easy to extend. - Both rules are evaluated by
opa evalin CI on every PR touching the contract. Failures block the merge. - The policy is layered with the column lineage scanner: if a downstream
productcolumn is derived from an upstream PII column but the downstream column lacks thePIItag itself, a second policy fires on the lineage manifest. Together they catch both direct and propagated PII exposure.
Output — CI on a violating PR.
| Check | Result | Message |
|---|---|---|
| YAML well-formed | pass | 6 fields present |
| Semver consistency | pass | minor bump |
| dbt model sync | pass | columns match |
| PII masking (OPA) | fail | column email is tagged PII but has no mask: directive (allowed: sha256 | tokenize | redact) |
Rule of thumb. Write the policy once. Apply it to every domain. The platform team's role is "policy author," not "PR reviewer for PII." The reviewer role is delegated to the CI.
Worked example — tag propagation through column lineage
Detailed explanation. A new derivation in the marketing domain joins customer.product.customer_dim.email_hash with campaign data. Even though the column is named email_hash and is already a SHA-256, the tag inheritance system propagates the PII: true tag automatically — and the platform's downstream policies enforce masking in marketing's product tier too.
Question. Show the column-level lineage graph and demonstrate how the PII: true tag flows from customer.raw.users.email through three layers of derivation.
Input — lineage manifest.
columns:
- name: customer.raw.users.email
tags: [PII]
- name: customer.derived.users_clean.email_lower
derived_from: [customer.raw.users.email]
- name: customer.product.customer_dim.email_hash
derived_from: [customer.derived.users_clean.email_lower]
transform: "sha256"
- name: marketing.derived.lead_attribution.hashed_lead_email
derived_from: [customer.product.customer_dim.email_hash]
- name: marketing.product.campaign_lead_stats.unique_hashed_emails
derived_from: [marketing.derived.lead_attribution.hashed_lead_email]
transform: "count_distinct"
Code — the inheritance rule (Rego).
package mesh.tag_inheritance
# A column inherits any tag that any of its lineage ancestors has.
inherited_tags(col) = tags {
tags := {tag |
ancestor := input.ancestors[col][_]
tag := ancestor.tags[_]
}
}
# Deny: derived column missing inherited PII tag.
deny[msg] {
col := input.columns[_]
"PII" in inherited_tags(col)
not "PII" in col.tags
msg := sprintf(
"column %v inherits PII tag from upstream but does not declare it",
[col.name],
)
}
Step-by-step explanation.
-
emailis tagged PII once, at the raw source. Every derivation inherits the tag automatically through the lineage graph. - The SHA-256 transform on
email_hashdoes not strip the PII tag. Hashed PII is still PII (GDPR Article 4(5)). The system encodes that legal fact in the policy. -
marketing.derived.lead_attribution.hashed_lead_emailinherits PII transitively. If marketing's contract forlead_attribution.hashed_lead_emaildoes not declaretags: [PII], CI fails on inheritance check. -
count_distinctis an aggregating transform that produces a non-PII output (a count). The platform's transform-classification table markscount_distinctas PII-stripping; the output column does not inherit the tag. The policy author maintains this table.
Output.
| Column | Inherited tags | Declared tags | CI |
|---|---|---|---|
email_lower |
{PII} | {PII} | pass |
email_hash |
{PII} | {PII} | pass |
hashed_lead_email |
{PII} | {PII} | pass |
unique_hashed_emails |
{} (transform strips) | {} | pass |
Rule of thumb. Tag once at the source. Let lineage do the inheritance. Aggregating transforms (count, sum, count_distinct over hashes) strip PII tags; passing transforms (lower, trim, sha256, tokenize) preserve them.
Worked example — the cross-region read policy
Detailed explanation. A payments domain analyst in the US opens a PR that reads customer.product.customer_dim — which is tagged region: EU because GDPR. The cross-region policy fires in CI: the read is blocked until an exception is granted (or until the analyst rewrites the query to use a US-resident aggregate).
Question. Write the cross-region policy in Rego and show how an exception is granted via a time-bound annotation.
Input — the violating PR.
-- Pipeline runs in compute_region=us-east-1 reading EU-tagged data
SELECT customer_id, signup_date
FROM customer.product.customer_dim
WHERE country = 'DE';
Code — the cross-region policy.
package mesh.cross_region
# Deny: pipeline reads EU-tagged data from non-EU compute, no exception.
deny[msg] {
input.compute_region != "eu-west-1"
upstream := input.lineage[_]
"region:EU" == upstream.tags[_]
not input.exceptions["cross_region_eu"]
msg := sprintf(
"pipeline %v in %v reads EU-resident %v — request exception via /platform-exceptions",
[input.pipeline.name, input.compute_region, upstream.name],
)
}
# Allow time-bound exceptions, audited in git.
allow[msg] {
ex := input.exceptions["cross_region_eu"]
ex.granted_by != ""
time.parse_rfc3339_ns(ex.expires) > time.now_ns()
msg := sprintf("cross-region exception in effect until %v", [ex.expires])
}
Step-by-step explanation.
- The pipeline manifest declares
compute_region: us-east-1. The lineage scan findscustomer.product.customer_dimtaggedregion: EU. The first rule fires. - The exception block is a YAML in the domain repo, e.g.
exceptions/cross_region_eu.yaml, granted by an authorised reviewer and expiring on a date. Without the file, CI fails. - Granting an exception is a PR against the exception file, not against the policy. That PR is reviewed by the platform compliance reviewer, time-bound, and audited.
- The policy is data-resident enforcement as code — the same rule that satisfies GDPR Article 44 (cross-border transfers) lives in git, runs in CI, and is auditable forever.
Output.
| State | CI result | Note |
|---|---|---|
| No exception | fail | PR blocked, message links to /platform-exceptions |
| Exception granted, valid | pass | exception_expires emitted as warning |
| Exception granted, expired | fail | CI recomputes on every run; expiry is mechanical |
Rule of thumb. Encode every compliance rule (GDPR, HIPAA, SOX) as a policy with a time-bound exception mechanism. The auditor's job becomes "review the policy repo," not "interview the team." That single shift is the most expensive compliance cost the mesh removes.
Worked example — measuring the federated-governance KPI
Detailed explanation. Define and compute the platform team's federated-governance KPI: percent of compliance enforced automatically vs ticket-based. Walk through a quarter where the team starts at 60% and finishes at 92% — naming each policy that moved the number.
Question. Compute the KPI from a quarter's data — total compliance actions, automated CI catches, manual reviews. Identify the two policies that moved the number the most.
Input.
| Quarter | Compliance actions | CI catches | Manual reviews |
|---|---|---|---|
| Q1 | 1000 | 600 | 400 |
| Q2 | 1100 | 760 | 340 |
| Q3 | 1180 | 920 | 260 |
| Q4 | 1260 | 1160 | 100 |
Code — the KPI calculation.
def federated_gov_kpi(ci_catches, manual_reviews):
total = ci_catches + manual_reviews
return ci_catches / total if total else 0.0
q = [(600,400), (760,340), (920,260), (1160,100)]
for i,(c,m) in enumerate(q, 1):
print(f"Q{i}: {federated_gov_kpi(c, m):.0%} automated ({c}/{c+m})")
Step-by-step explanation.
- Q1: 60% automated. The platform repo had PII masking and retention policies; cross-region was manual via Slack.
- Q2: 69% — adding
cross_regionpolicy moved 80 manual reviews to CI catches. - Q3: 78% — adding
query_cost_patternpolicy caught another 160 cases. - Q4: 92% — adding
tag_inheritanceautomated the 200+ "did this derivation propagate PII?" reviews. - The remaining 8% is intentional: cross-region exceptions, novel-policy requests, and the quarterly auditor review. Those are appropriately manual.
Output.
| Quarter | KPI | Top mover |
|---|---|---|
| Q1 | 60% | (baseline) |
| Q2 | 69% | cross_region.rego |
| Q3 | 78% | query_cost_pattern.rego |
| Q4 | 92% | tag_inheritance.rego |
Rule of thumb. Report the KPI every quarter. Each new policy is a line in the change log; each policy that moves the number proves the platform's investment is paying back. The KPI is the platform team's most defensible budget argument.
Architecture interview question on federated governance
A senior interviewer often frames this as: "Walk me through how a new PII column added in the customer domain gets enforced across marketing, payments, and orders without anyone filing a ticket." It tests whether you understand that federated governance is a loop, not a one-off policy.
Solution Using policy-as-code + lineage tag inheritance + CI enforcement
# 1. customer domain adds PII column, tags it in the contract
# customer/product/customer_dim.contract.yaml
schema:
- { name: phone_number, type: string, required: true, tags: ["PII"], mask: "tokenize" }
# 2. Platform OPA policy (already in place) enforces PII masking
# policies/pii_masking.rego applies to every domain's CI
# 3. Lineage scanner propagates PII tag to every downstream column
# OpenLineage manifest emitted by every CI run
# 4. Marketing / payments / orders CI fails on any unmasked downstream
# without anyone filing a ticket — the policy is the ticket
Step-by-step trace.
| Step | Actor | Action | Latency |
|---|---|---|---|
| 1 |
customer domain |
adds phone_number PII column with mask: tokenize
|
1 day |
| 2 | platform CI | validates contract, OPA passes | 60s |
| 3 | lineage scanner | propagates PII: true tag to downstream derivations across domains |
next CI run |
| 4 |
marketing CI |
fails any downstream pipeline that exposes unmasked phone_number
|
60s per PR |
| 5 |
marketing domain |
adds masking + re-runs | 1 day |
| 6 | platform team | observes KPI tick: +1 automated catch | passive |
The platform team did nothing in the loop. The policy did the work. That is "federated computational governance" working as designed.
Output:
| Outcome | Mechanism |
|---|---|
| Producer added new PII column | self-serve via contract YAML |
| Downstream domains caught violations | CI + lineage tag inheritance |
| Compliance audit trail | git history of policies + contracts |
| Platform team workload | zero PRs reviewed manually |
Why this works — concept by concept:
- Policy-as-code — turns "compliance is a process" into "compliance is a CI step." Same idea as terraform plan / apply for infra, applied to data.
- Tag inheritance — solves the propagation problem mechanically. No engineer has to remember "this is downstream of PII." The lineage scanner does it.
- Required CI workflow — every domain repo imports the platform's CI workflow. The platform team writes the policy; the domain team's CI runs it.
- Exception as PR — exceptions are not "ask in Slack"; they are PRs against a versioned exception file. Auditors love this; engineers tolerate this.
- Cost — the platform team writes ~20-40 policies over the first year. After that, the marginal cost of a new domain is zero compliance-wise — the policies already work for it. The cost curve flattens exactly the opposite shape from the central-team queue.
Data Architecture
Topic — design
Platform / governance design problems
Cheat sheet — data mesh implementation recipes
- One repo per domain, one CI pipeline per repo. CODEOWNERS routes PRs; the standard CI workflow imports platform OPA policies. Onboarding a new domain = one CLI command, < 30 minutes.
-
Publish only the
producttier cross-domain. Raw and derived are private to the domain. Cross-domain reads ofraworderivedare CI failures, not negotiation. -
Every product table has a
<table>.contract.yaml. Six required fields:name,version,owner,schema,sla,semantics. Reviewed via PR. No "draft" contracts in production. - Semver as a mechanical rule. Patch = bug fix (no schema change). Minor = additive (new optional columns). Major = breaking (rename, drop, narrow). Deprecation window ≥ 1 quarter on major.
-
Pin consumers with caret (
^major.minor). Patches and additive minors flow through silently; majors require explicit consumer PR. - Use Unity Catalog / Polaris / Gravitino for cross-domain discovery. Search, schema, owner, freshness, downstream consumers — all on the catalog page. Slack is not a catalog.
- Policies-as-code in OPA, in git, PR-reviewed. Confluence pages are not policies. Policies that do not run in CI do not enforce anything.
-
Tag PII once at the source; let lineage inheritance propagate. Aggregating transforms (
count,sum) strip the tag; passing transforms (lower,sha256,tokenize) preserve it. - Domain teams own their on-call rotation. Producer pages on freshness or accuracy SLA breach. Central platform pages only on substrate (catalog, CI, OPA) outages.
- The platform team's KPI is "% compliance automated." Each new policy moves the number up. Report it every quarter; it is the platform team's budget argument.
-
Conformed dimensions (
dim_date,dim_geo,dim_currency) live inplatform.product.*. Never duplicate them per domain. Owning them centrally is the platform team's product-tier contribution. -
Cross-domain hot joins are platform-managed marts. When two domains' product tables join often, model the join once in
platform.curated.*instead of replicating the join in every consumer. - Migration from central warehouse is one-domain-at-a-time, not big-bang. Pick the most painful domain first (highest ticket count to central). Stand it up as a mesh domain. Use it as the lighthouse. Repeat.
- "Self-serve" means < 30-minute onboarding. If onboarding requires a platform-team ticket, the platform is mis-named. Onboarding lead time is the single most important platform KPI.
- Below 200 product engineers, don't do mesh. Hire one more central engineer and invest in self-serve metric layers. Mesh setup costs dwarf central-team pain below that line.
Frequently asked questions
When is my org big enough to need data mesh?
The rough industry threshold is around 200 product engineers and at least 4 domains with embedded data engineers. Below that line, the central data team usually still scales — adding 1-2 engineers and investing in self-serve metric tooling pays back faster than the 8-15 engineer-quarter mesh setup cost. Above that line, the central team's utilisation passes 0.9, lead times blow past one quarter, and Conway's-law symptoms (one giant fact_everything table, 380 enum values in one column) appear in the warehouse schema. The honest answer in an interview is to refuse to recommend mesh without first running the four-axis diagnostic on scale, domain readiness, platform budget, and central-team utilisation.
What's the difference between data mesh and data fabric?
Data mesh is a socio-technical pattern (org + architecture) emphasising domain ownership of data products with federated governance. Data fabric is a technology pattern (mostly architecture) emphasising a unified metadata / orchestration layer that automates data integration, lineage, and governance across heterogeneous sources. In practice they are complementary, not competing: a real mesh implementation typically uses fabric-style metadata tooling (catalog, lineage, automated governance) as part of its self-serve platform substrate. The shorthand is "mesh is who owns the data; fabric is how the metadata flows." Most modern lakehouse platforms (Databricks Unity Catalog, Snowflake Polaris, Apache Gravitino) ship both: domain-namespaced ownership for mesh plus fabric-style automated lineage and policy propagation.
How do I migrate from a central warehouse to a mesh without big-bang rewrites?
Migrate one domain at a time, in pain-priority order. Pick the domain that files the most tickets against the central team — that is where the org will feel the win first. Stand up its repo, its product tier with a contract, its on-call rotation, and its OPA-enforced CI inside one quarter. Publish the lighthouse — every other domain converts by copying that domain's pattern (the platform CLI bakes the template). Keep the central warehouse running in parallel; consumers cut over to the new product tables on their own timeline using the version-pinning subscription model. Plan on 4-8 quarters for full migration of 5-10 domains, with the first quarter spent almost entirely on the platform-team substrate (CLI, CI templates, OPA bundle, catalog onboarding script) — that investment is what makes the remaining quarters fast.
Do I need a lakehouse to do data mesh?
You do not strictly need a lakehouse, but it makes mesh dramatically cheaper. The lakehouse architecture (Delta / Iceberg / Hudi on object storage with a cross-engine catalog like Unity Catalog or Polaris) gives you one storage layer that every domain's compute engine can read — Spark, Trino, Snowflake, BigQuery, DuckDB — without copying data. That is the technical precondition that makes "one domain, one substrate, many consumers" feasible. Without it, you end up with per-engine permission matrices, data duplication, and a fabric-style integration layer that becomes its own bottleneck. Modern mesh implementations almost universally use lakehouse formats as the substrate; older warehouse-only stacks (pure Snowflake or pure BigQuery) can still implement mesh but require more careful per-engine access policy plumbing.
Who owns shared dimensions like dim_date in a mesh?
The platform team owns conformed shared dimensions — dim_date, dim_geo, dim_currency, dim_organization. They live in platform.product.* namespace and are consumed by every business domain. Treating shared dimensions as "just another domain" creates a circular ownership problem (which domain owns "geography"?) and a duplication problem (every domain rolls its own dim_date with subtle inconsistencies). The platform team's product-tier contribution is precisely these conformed dimensions plus any hot cross-domain marts in platform.curated.*. That keeps the principle "domain owns business logic" intact for business domains while assigning the genuinely cross-cutting reference data to the team whose mandate is "make every other team 10x faster."
How do I prevent "mesh" from becoming "anarchy"?
The two non-negotiable guardrails are data contracts and federated computational governance — both enforced in CI, both versioned in git, both producing audit trails. The anti-mesh failure mode is "we adopted the domain ownership principle without the federated governance principle" — domains start publishing data without contracts, without SLAs, without PII tagging, and the org ends up with a hundred private warehouses and no auditor-friendly trail. The discipline is: domain autonomy lives inside the policy guardrails the platform team writes once. Every PR runs OPA. Every product table has a contract. Every cross-domain read goes through the catalog with masking applied. Every PII column is tagged at the source and inherited downstream. If any of those four invariants is missing, what you have is not data mesh — it is the central team's old pain rebranded across N teams.
Practice on PipeCode
- Drill the data modeling practice library → for domain-modeling and dimensional schema problems that map onto mesh
producttiers. - Rehearse on dimensional modeling problems → when the interviewer wants fact / dimension trade-offs for a conformed-dimension layer.
- Sharpen ETL design drills → for the producer-side pipelines that publish a
producttier from raw and derived. - Layer the event modeling library → for the source-system event schemas that feed each domain's raw tier.
- Stack the design library → for the system-design surface around catalogs, governance, and policy enforcement.
- For the broader surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Sharpen the modeling axis with the data modelling for DE interviews course →.
- For platform-engineering depth, work through ETL system design for data engineering interviews →.
Pipecode.ai is Leetcode for Data Engineering — every mesh principle above ships with hands-on practice rooms where you design the domain bounded contexts, draft the `product.contract.yaml`, and reason about federated governance loops against real graded prompts. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your lakehouse data mesh blueprint will survive contact with the staff-level interviewer who actually built one in production.