data contracts are the machine-readable, version-controlled agreements that finally answer the question "who broke the pipeline" without a Slack channel, a war room, or a 3 AM page — and by 2026 they have moved from thought-leadership decks into concrete, production-shipped YAML files that every serious data platform now checks into git. The pre-contract world is grim: a producer team quietly renames user_id to customer_id, the change lands in the source-system release, twelve downstream models fail overnight, the on-call scans dashboards for two hours before someone finally traces the rename through a diff, and every consumer team files an incident against a producer team that had no idea they had any consumers at all. A data contract collapses that entire failure mode into a single artefact — the producer commits a YAML file that declares the schema, the SLA, the quality checks, and the ownership; consumers subscribe to that file; the CI pipeline blocks any producer change that violates it.
This guide is the senior-DE walkthrough you wished existed the first time an interviewer asked "walk me through an open data contract standard file" or "what does BACKWARD compatibility mean in a schema registry and which producer-side change breaks it?" or "how does a producer consumer sla actually get enforced in CI vs at runtime?" It walks through why data contracts became inevitable given the shape of modern data teams, the anatomy of an odcs YAML with schema plus sla plus quality plus roles blocks, the Confluent / Apicurio schema registry model with subject-naming strategies and compatibility modes, the two enforcement gates (pre-merge lint plus runtime consumer-side validation) that turn a contract from a wish into a rule, and the rollout ladder — warn-only, block-on-new, block-everywhere — that lets a large organisation introduce contracts without shipping a wave of overnight breakages. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.
When you want hands-on reps immediately after reading, drill the ETL practice library →, rehearse on the streaming practice library →, and sharpen the schema-design axis with the optimization practice library →.
On this page
- Why data contracts became the answer to "who broke the pipeline"
- Open Data Contract Standard (ODCS) — the schema
- Schema Registry integration — Avro, Protobuf, JSON
- Contract enforcement — CI + runtime
- SLAs + ownership + rollout
- Cheat sheet — data contract recipes
- Frequently asked questions
- Practice on PipeCode
1. Why data contracts became the answer to "who broke the pipeline"
From tribal knowledge to version-controlled YAML — the producer-consumer bargain that stops silent breakage
The one-sentence invariant: a data contract is a machine-readable, version-controlled, producer-signed agreement that declares the schema, the SLA, the quality checks, and the ownership of a dataset — so any producer-side change that violates any of those axes is blocked before it lands, and every consumer team can reason about their upstream without asking a human. The pre-contract world runs on tribal knowledge — the producer team knows some downstream exists but not which, the consumer team knows the shape of yesterday's data but not tomorrow's, and every "silent schema drift" incident traces back to the same root cause: the agreement lived only in someone's head.
The four axes interviewers actually probe.
- Schema. Column names, types, nullability, PII flags, unique constraints. The most obvious axis and the one most breakages come from. A rename, a type change, or a column drop without a deprecation window is a contract violation.
- SLA. Freshness (how stale can the data be), availability (what fraction of the day is it queryable), and quality (null rate, uniqueness, range). SLAs give the contract teeth — a schema-conformant but four-hours-stale table is still a broken contract if the freshness SLA is 15 minutes.
-
Quality. Assertions the producer commits to keep true —
order_total >= 0,country_code IN (ISO-3166 list),null(order_id) == 0. Enforced at write time by the producer or at read time by the consumer; violations either block or route to a dead-letter queue. - Ownership. The producer team, the consumer team(s), the escalation path. A contract without owners is a contract nobody enforces; a contract with owners is the single artefact the on-call opens when things go wrong.
Why 2026 is the "data contracts have won" year.
- ODCS v3.x is the emerging standard. The Open Data Contract Standard shipped 3.x in 2025 with widespread adoption across Bitol / Linux Foundation projects. Vendor-neutral YAML shape means the same file drives dbt contracts, schema registry subjects, Great Expectations quality checks, and BI-tool metadata.
- Confluent Schema Registry + dbt contracts complement. Streaming schemas live in the registry (Avro / Protobuf / JSON with compatibility modes); batch/warehouse schemas live in dbt model contracts. ODCS is the single YAML that either side derives from.
-
The tooling is mature. GitHub Actions +
odcs-lintcatch pre-merge violations; Confluent's compatibility check catches producer-side incompatible schemas; consumer-side validators (fastavro, protobuf runtime, JSON Schema) reject rogue payloads at runtime. - The organisational pattern is understood. "Producer-signed, consumer-consumed, platform-enforced" is now the shorthand — the producer owns the contract file in their repo, consumers depend on the version, the platform team runs the enforcement infrastructure.
What a contract actually contains.
-
idandversion. A stable identifier (orders.public.v1) plus a semver version. Consumers pin to a major version; producers bump minor for backwards-compatible additions and major for breaking changes. -
schema. Column list with name, type, description, nullable, PII flag, and unique/primary-key constraints. The rendered form drives Avro / Protobuf / dbt contract generation. -
sla. Freshness threshold (e.g.max_freshness: PT15M), availability target (e.g.availability_pct: 99.9), and query latency SLO if the table backs an interactive surface. -
quality. List of assertions — null-rate ceilings, uniqueness constraints, range checks, referential integrity. Each assertion has a severity (warnorblock) and a rollout phase. -
roles. Producer team, consumer team(s), platform steward, on-call rotation. Free-form YAML but tools like Backstage and DataHub already parse these into service-catalogue entries.
What interviewers listen for.
- Do you name the four axes schema, SLA, quality, ownership without prompting? — senior signal.
- Do you distinguish schema-only agreements (registry) from full contracts (ODCS)? — senior signal.
- Do you describe the contract as a producer-signed, version-controlled artefact rather than a "documentation page"? — required.
- Do you name CI + runtime as the two enforcement points? — required.
- Do you mention deprecation windows and rollout phases when discussing schema changes? — senior signal.
Worked example — the silent-rename incident and how a contract prevents it
Detailed explanation. The canonical failure story: an upstream microservice team renames user_id to customer_id in the orders topic to align with a company-wide naming convention. The change ships in a Friday afternoon release. By Monday morning, twelve downstream dbt models have failed overnight, three ML feature pipelines are producing NULLs, and the executive dashboard is empty. The on-call spends four hours tracing the failure back to the rename. Walk an interviewer through what would have happened with a data contract in place.
-
The symptom. Downstream models fail with
column "user_id" does not exist. - The root cause. Producer renamed a column with no deprecation notice.
- The pre-contract discovery time. Four hours of Slack, grep, and git bisect.
-
The post-contract discovery time. Zero — the producer's PR fails CI with "removes column
user_idfrom contractorders.public.v1— this is a breaking change; open a v2 with deprecation window or usedeprecated: true."
Question. Given a producer team, a consumer team, and a shared orders topic, design the data-contract flow that catches the rename at PR time and offers the producer a paved path to deprecate the old name.
Input.
| Party | Before contract | After contract |
|---|---|---|
| Producer team | Renames column, ships Friday | Opens PR against orders/contract.yaml; CI fails on breaking change |
| Consumer team (dbt) | Discovers failure Monday | Sees deprecation notice via contract diff subscription |
| Platform team | Fields angry incident tickets | Runs lint + registry compat check; ships alerts |
| Data | Silently broken over weekend | Continues to flow; new column added; old kept until v2 |
Code.
# orders/contract.yaml — ODCS-shaped data contract, committed to the producer repo
apiVersion: v3.0.0
kind: DataContract
id: orders.public
version: 1.3.0
info:
title: Orders — public model
owner: team-checkout
status: active
description: One row per confirmed order in the checkout flow.
schema:
- name: order_id
type: STRING
required: true
unique: true
description: Stable id of the order.
- name: user_id
type: STRING
required: true
pii: true
description: Foreign key to users.public.
deprecated: false
- name: order_total
type: DECIMAL(12,2)
required: true
description: Order total in USD.
- name: created_at
type: TIMESTAMP
required: true
description: UTC creation timestamp.
sla:
max_freshness: PT15M
availability_pct: 99.9
quality:
- assertion: null_rate(order_id) == 0
severity: block
- assertion: unique(order_id)
severity: block
- assertion: null_rate(user_id) < 0.001
severity: warn
roles:
producer: team-checkout
consumers:
- team-analytics
- team-fraud
- team-marketing
steward: platform-data
# .github/workflows/contract-lint.yml — the CI gate
name: contract-lint
on: [pull_request]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install odcs-lint
run: pip install odcs-lint==0.9.0
- name: Validate contract shape
run: odcs-lint orders/contract.yaml
- name: Diff against main branch
run: odcs-lint diff --base main --path orders/contract.yaml
# If diff removes a column or narrows a type without a version bump,
# odcs-lint exits non-zero and the PR is blocked.
Step-by-step explanation.
- The producer opens a PR that renames
user_id→customer_id. The diff onorders/contract.yamlremoves one column entry and adds another.odcs-lint diff --base mainclassifies this as a breaking change — removing a column that consumers depend on. - The lint job exits non-zero. The GitHub required-status-check blocks the merge. A comment on the PR reads:
Breaking: column user_id removed. Options: (a) bump version to 2.0.0 with a deprecation window; (b) keep user_id and add customer_id in parallel; (c) mark user_id deprecated: true for 30 days. - The producer follows option (b): they keep
user_id, addcustomer_id, and bump the contract minor version to1.4.0(additive change). The PR now passes. - Consumer teams subscribed to the contract file (via GitHub Actions or DataHub webhook) receive a diff notification. They have until the eventual
2.0.0removal to migrate their queries. - In parallel, a scheduled job posts a
deprecated: truemarker onuser_idafter 30 days, and the eventual2.0.0removes the column. Every consumer had a paved migration path; no Monday-morning outage.
Output.
| Metric | Pre-contract | Post-contract |
|---|---|---|
| Time to discover breaking rename | 4 hours | 0 (PR-time) |
| Downstream models broken | 12 | 0 |
| Producer-consumer trust | eroded | preserved |
| Migration path | ad-hoc | 30-day deprecation window |
Rule of thumb. A data contract is not "documentation" — it is a code-checked artefact in the producer's repo whose diff is a breaking-change classifier. Anything that changes column presence, types, or nullability without a version bump is a CI failure. The producer chooses between "bump minor with additive change" and "bump major with deprecation window"; they cannot silently ship a rename.
Worked example — the freshness SLA violation nobody noticed
Detailed explanation. A less obvious but equally common failure mode: the schema is unchanged but the freshness SLA is silently violated. A batch pipeline that was supposed to refresh every 15 minutes starts refreshing every 4 hours because someone changed the Airflow cron. The dashboard shows "up-to-date" because the freshness is not measured. Executives make decisions on stale data. Discovery happens weeks later when a finance reconciliation fails.
- The pre-SLA world. Freshness is a footnote in the model description; nobody polls it.
-
The post-SLA world. The contract declares
max_freshness: PT15M; a runtime probe comparesmax(created_at)againstnow()every 5 minutes and pages on violation. - The signal. A conformance test that runs on a schedule and asserts contract freshness — not a schema check.
Question. Given the same orders contract with max_freshness: PT15M, design the runtime probe that detects the 4-hour cron drift and pages the producer within 20 minutes.
Input.
| Element | Value |
|---|---|
| Contract freshness SLA | 15 minutes |
| Probe cadence | 5 minutes |
| Alert threshold | freshness > 15 minutes for 3 consecutive probes |
| Pager destination | team-checkout (producer) |
Code.
# freshness_probe.yml — a scheduled job that reads the contract SLA
# and compares against actual max(created_at)
apiVersion: v1
kind: FreshnessProbe
contract: orders.public
version: 1.4.0
sla:
max_freshness: PT15M
probe:
cadence: PT5M
query: |
SELECT
EXTRACT(EPOCH FROM (NOW() - MAX(created_at))) AS staleness_seconds
FROM warehouse.orders;
alert_after_consecutive_failures: 3
alert:
channel: pagerduty
service: team-checkout
# probe_runner.py — parses the contract SLA and runs the probe
import time, yaml
from datetime import timedelta
from isodate import parse_duration
import psycopg2, requests
def load_sla(contract_path):
with open(contract_path) as f:
return yaml.safe_load(f)
def evaluate(contract, warehouse_conn_str):
sla = parse_duration(contract["sla"]["max_freshness"])
conn = psycopg2.connect(warehouse_conn_str)
cur = conn.cursor()
cur.execute("SELECT EXTRACT(EPOCH FROM (NOW() - MAX(created_at))) FROM warehouse.orders;")
(staleness_s,) = cur.fetchone()
breach = staleness_s > sla.total_seconds()
return breach, staleness_s
def page(service):
requests.post("https://events.pagerduty.com/v2/enqueue",
json={"routing_key": service, "event_action": "trigger",
"payload": {"summary": "orders freshness SLA violated",
"severity": "error", "source": "contract-probe"}})
if __name__ == "__main__":
consecutive = 0
while True:
contract = load_sla("orders/contract.yaml")
breach, staleness_s = evaluate(contract, "postgres://…")
if breach:
consecutive += 1
if consecutive >= 3:
page("team-checkout")
else:
consecutive = 0
time.sleep(300)
Step-by-step explanation.
- The probe reads the same
orders/contract.yamlthe producer signed. There is exactly one source of truth for the freshness SLA — no drift between the contract and the alert threshold. - Every 5 minutes, the probe measures actual staleness against
MAX(created_at).PT15M(ISO-8601) parses to 900 seconds viaisodate; the probe comparesstaleness_seconds > 900. - To suppress noise from single-scrape flakes, the alert fires only after 3 consecutive breaches — a 15-minute persistent violation. Transient blips (a slow refresh cycle finishing 30 seconds late) do not page.
- When the cron drift lands, the producer sees the pager within 15–20 minutes of the actual SLA breach. Discovery-time drops from "weeks after finance reconciliation" to "before executives log in Tuesday morning."
- The contract shape means the alert code is generic — any dataset with an ODCS contract that declares
sla.max_freshnessgets a probe by convention. New datasets do not require a new alert; they inherit the probe.
Output.
| Metric | Pre-SLA probe | Post-SLA probe |
|---|---|---|
| Time to detect 4-hour cron drift | weeks | 20 minutes |
| Alert configuration | per-dataset ad hoc | generic, contract-driven |
| Downstream consumer confidence | low (staleness invisible) | high (SLA published) |
| Executive dashboard trust | broken by unmeasured drift | restored by probes |
Rule of thumb. A contract without runtime SLA probes is a contract only for schema. To turn a schema-contract into a full data-contract, every declared SLA (freshness, availability, quality) needs an automated probe that reads the contract and pages the producer on breach. The probe code is generic; the SLA is data.
Worked example — the ownership handshake
Detailed explanation. Contracts fail not because the tech is wrong but because nobody signs. A common anti-pattern: the platform team writes the contract, checks it in, and considers the job done. The producer team never reviews it; when the contract is violated, the producer says "we didn't sign this" and refuses to fix. Show how the roles block plus a governance flow makes the sign-off explicit and auditable.
- The failure. Contract exists but ownership is unenforced; producer disowns it.
-
The fix. The contract's
roles.producerfield is a CODEOWNERS entry; the producer team is a required reviewer on the contract file; the sign-off is git-history evidence. - The escalation. If the producer refuses to sign, the platform team escalates to a data governance council; the alternative is deprecating the dataset entirely.
Question. Design the GitHub CODEOWNERS + branch-protection setup that forces producer sign-off on every contract change, and shows the audit trail.
Input.
| Component | Value |
|---|---|
| Producer team | team-checkout |
| Consumer teams | team-analytics, team-fraud, team-marketing |
| Contract file | orders/contract.yaml |
| Required reviewers | 1 from team-checkout + 1 from platform-data |
Code.
# .github/CODEOWNERS
# The producer team owns the contract file; every change must be signed by them.
orders/contract.yaml @company/team-checkout @company/platform-data
# .github/branch-protection.yml — enforced via probot or Terraform
branch: main
required_status_checks:
- contract-lint
- registry-compat-check
required_pull_request_reviews:
required_approving_review_count: 2
require_code_owner_reviews: true
enforce_admins: true
# audit-log-example — captured from GitHub's PR API and archived to warehouse
pr:
number: 4217
title: "orders:addcustomer_emailcolumn,bumpto1.5.0"
author: producer-eng-1
approvers:
- alice@company (team-checkout, code owner)
- bob@company (platform-data, code owner)
merged_at: 2026-07-01T14:22:03Z
ci_status: contract-lint=pass registry-compat-check=pass
Step-by-step explanation.
-
CODEOWNERSbinds the contract file to two GitHub teams:team-checkout(the producer) andplatform-data(the steward). Any PR touchingorders/contract.yamlrequires an approving review from both. - Branch protection enforces the code-owner review requirement plus the two CI checks (
contract-lintandregistry-compat-check). Admins cannot bypass becauseenforce_admins: true. - Every contract change now has a git-history sign-off from the producer team. If the producer later disowns a violation, the audit log shows the exact reviewer who approved the contract at that version. Escalation goes to a governance council with evidence.
- Consumers do not need to review the contract file — they subscribe to it via releases. Requiring consumer sign-off on every contract change would slow the producer to a crawl; the deprecation window is how consumer interests are protected instead.
- The audit trail is archived to the warehouse for compliance queries — "who approved orders.public v1.4 → v1.5" is a SQL question, not a "please check GitHub" ticket.
Output.
| Governance layer | Enforcement mechanism |
|---|---|
| Producer sign-off | CODEOWNERS + required review |
| Steward sign-off | CODEOWNERS + required review |
| Schema safety | contract-lint CI gate |
| Registry safety | registry-compat-check CI gate |
| Audit trail | git history + warehouse archive |
Rule of thumb. A contract without an owner is a wish. Bind the contract file to a CODEOWNERS entry, require review from that team, and archive the audit trail. The producer cannot claim ignorance; the consumer has a paper trail; the platform team can escalate with evidence.
Senior interview question on the four axes of a data contract
A senior interviewer often opens with: "You're introducing data contracts to an organisation that has never had them. Walk me through the four axes you'd insist every contract cover, the enforcement gates, the rollout phases, and the failure modes you'd guard against in the first quarter."
Solution Using the four-axis contract + phased rollout + dual-gate enforcement
# The four-axis contract template — every dataset gets one of these
apiVersion: v3.0.0
kind: DataContract
id: <domain>.<name>
version: <semver>
info:
title: <human title>
owner: <team-slug>
status: active | deprecated | draft
# 1. SCHEMA — the shape
schema:
- name: <column>
type: <STRING | INT | DECIMAL | TIMESTAMP | STRUCT | ARRAY>
required: <bool>
unique: <bool>
pii: <bool>
description: <text>
deprecated: <bool>
# 2. SLA — the timing and availability
sla:
max_freshness: <ISO-8601 duration>
availability_pct: <float 0..100>
query_latency_p99: <ISO-8601 duration>
# 3. QUALITY — the assertions
quality:
- assertion: <expression>
severity: warn | block
rollout: warn_only | block_on_new | block_all
# 4. OWNERSHIP — the humans
roles:
producer: <team-slug>
consumers:
- <team-slug>
steward: <team-slug>
on_call_rotation: <schedule>
Rollout phases — introducing contracts to an org
================================================
Phase 1 (weeks 0-4) — warn-only
All contracts run in observe mode; violations logged, no blocking
Purpose: discover how bad the current state is; shame nobody
Phase 2 (weeks 4-8) — block on new datasets
Every net-new dataset requires a signed contract before launch
Existing datasets stay in warn-only
Phase 3 (weeks 8-16) — block on new contract violations
Existing datasets: any producer PR that would violate the contract is blocked
Existing violations grandfathered with a 30-day deprecation timer
Phase 4 (week 16+) — block everywhere
Every dataset has a signed contract; every PR is gated
Runtime SLA probes cover every declared freshness/availability threshold
Step-by-step trace.
| Axis | Enforcement point | Failure mode caught |
|---|---|---|
| Schema | contract-lint CI + registry-compat-check | column rename / drop / type narrow |
| SLA — freshness | runtime probe, pages producer | cron drift, backfill hang |
| SLA — availability | uptime probe, external synthetic | source system outage |
| Quality — assertion | dbt test / Great Expectations at write | null spike, uniqueness break |
| Ownership | CODEOWNERS + required review | disowned contract, missing sign-off |
The four-axis template plus phased rollout ships to production without organisational whiplash. Warn-only reveals how bad the current state is (usually shocking); the phased blocking lets producers migrate at their own pace; by phase 4 every dataset is contracted and every SLA is probed.
Output:
| Quarter milestone | Result |
|---|---|
| End of week 4 | Warn-only visibility live; violation baseline measured |
| End of week 8 | New datasets 100% contracted |
| End of week 16 | Existing datasets 100% contracted; PR gating live |
| End of quarter | Runtime probes on every SLA; incident rate down ~70% |
Why this works — concept by concept:
- Four axes are load-bearing — schema alone is a subset of a contract, not the whole thing. SLA, quality, and ownership carry the same weight; treating any of them as optional creates the exact incidents contracts were introduced to prevent.
- Dual-gate enforcement — CI catches the producer before the change ships; runtime probes catch the payload after the change ships. Neither gate is sufficient alone. CI protects against intent; runtime protects against reality.
- Phased rollout — going straight from "no contracts" to "block everywhere" produces a wave of overnight breakages and destroys organisational trust. The four-phase ladder (warn → block-new → block-existing → block-all) lets the org adapt without pain.
-
CODEOWNERS is the ownership primitive — the contract's
rolesblock is documentation; the CODEOWNERS binding is enforcement. Producers cannot claim ignorance when git history shows their team-lead approved the contract. - Cost — contract files are cheap (a few hundred lines of YAML per dataset); the enforcement infrastructure is one-time (CI actions, a probe runner, a lint tool). The avoided cost of one "silent-rename-broke-twelve-models" incident (~40 engineer hours) pays for the entire quarter's setup.
ETL
Topic — etl
ETL problems on producer-consumer contracts
2. Open Data Contract Standard (ODCS) — the schema
One YAML per producer — the ODCS v3.x shape that drives every downstream tool
The mental model in one line: ODCS v3.x is a vendor-neutral YAML schema whose top-level blocks are info, schema, sla, quality, roles, and servers — one file per dataset, checked into the producer's repo, and consumed by every downstream tool (dbt contracts, Confluent Schema Registry, Great Expectations, DataHub, Backstage) as the single source of truth. Every ODCS interview question is a variant of "walk me through the file" or "what's the semver rule for a breaking change."
The top-level blocks.
-
apiVersion+kind+id+version. The four identity fields.apiVersion: v3.0.0locks the schema shape;kind: DataContractdisambiguates from other Bitol kinds;idis a stable dotted identifier (orders.public);versionis semver. -
info. Human metadata —title,owner,status(draft / active / deprecated),description,tags. Rendered by DataHub / Backstage as the catalogue entry. -
schema. The column list. Each entry is name + type + required + unique + pii + description + deprecated. This block is the source that dbt contracts and Avro / Protobuf generators consume. -
sla. Freshness (max_freshness), availability (availability_pct), query latency (query_latency_p99), and retention (retention: P90D). All ISO-8601 durations and percentages. -
quality. List of assertion + severity + rollout entries. Each assertion is a boolean expression the producer commits to keep true. Severity iswarnorblock; rollout is the phase (warn_only,block_on_new,block_all). -
roles. Producer + consumers + steward + on-call rotation. Free-form YAML; tools parse into service catalogues. -
servers. Physical resolution — where the data lives. Warehouse table (snowflake://…), Kafka topic (kafka://cluster/topic), object store (s3://bucket/prefix). Multiple entries when the same logical dataset has multiple materialisations.
The version semantics.
- Patch (1.4.0** → 1.4.1).** Description-only change; no schema, SLA, or quality change.
-
Minor (1.4*.0 → 1.5.0).* Additive change — new column with
required: false, new quality assertion withseverity: warn, new consumer added toroles.consumers. Backwards compatible; existing consumers are unaffected. -
Major (1*.4.0 → **2.0.0).* Breaking change — column removed, type narrowed,
required: false→required: true, SLA tightened (freshness reduced), or quality severity raised. Consumers must migrate; a deprecation window is required. -
The lint gate.
odcs-lint diff --base maininspects the diff between the incoming PR and main; if the change is breaking and the version bump is not major, the CI fails.
The type system.
-
Primitives.
STRING,BOOLEAN,INT(with subtypesTINYINT/SMALLINT/BIGINT),FLOAT,DOUBLE,DECIMAL(p,s),TIMESTAMP(with optional timezone),DATE,TIME. -
Structured.
STRUCT<...>for nested records;ARRAY<T>for lists;MAP<K,V>for maps. Nested schemas can carry their ownpiiflags andrequiredsemantics. -
Semantic tags.
format: email,format: iso-country-code,pii: true,sensitive: true. Used by DLP tooling and catalogues. - Bindings. Each primitive has a canonical mapping into Avro, Protobuf, JSON Schema, and warehouse SQL types. ODCS lint verifies the binding is unambiguous.
Common interview probes on ODCS.
- "Walk me through the top-level blocks." — required.
- "What's the semver rule for adding a new optional column?" — minor bump; additive.
- "What's the rule for tightening a freshness SLA?" — major bump; consumers may not survive tighter constraints.
- "How does ODCS relate to Schema Registry?" — ODCS is the source; Schema Registry is a derived materialisation for streaming.
Worked example — the 40-line orders contract
Detailed explanation. The canonical example: a public orders dataset produced by the checkout team and consumed by three downstream teams. Show the full 40-line ODCS YAML with every block filled in, and explain each field.
- The dataset. One row per confirmed order, materialised in Snowflake and mirrored to a Kafka topic.
- The SLA. 15-minute freshness (dashboards refresh every 15 min), 99.9% availability, retention 90 days.
-
The quality. Non-null
order_id, uniqueorder_id, order_total >= 0, country_code in ISO-3166.
Question. Produce the full ODCS YAML for this dataset with each block explicitly filled and inline comments explaining the interviewer-signal fields.
Input.
| Attribute | Value |
|---|---|
| Producer | team-checkout |
| Consumers | team-analytics, team-fraud, team-marketing |
| Storage | Snowflake (analytics.public.orders), Kafka (orders.public.v1) |
| Freshness | 15 minutes |
| Availability | 99.9% |
| Retention | 90 days |
Code.
# orders/contract.yaml — the 40-line reference
apiVersion: v3.0.0
kind: DataContract
id: orders.public
version: 1.4.0
info:
title: Orders — public model
owner: team-checkout
status: active
description: One row per confirmed order in the checkout flow.
tags: [core, revenue, tier1]
schema:
- name: order_id
type: STRING
required: true
unique: true
description: Stable id of the order.
- name: user_id
type: STRING
required: true
pii: true
description: FK to users.public.
- name: order_total
type: DECIMAL(12,2)
required: true
description: Order total in USD.
- name: country_code
type: STRING
required: true
format: iso-country-code
description: ISO-3166 alpha-2.
- name: created_at
type: TIMESTAMP
required: true
description: UTC creation timestamp.
sla:
max_freshness: PT15M # ISO-8601 duration — 15 minutes
availability_pct: 99.9
retention: P90D # 90 days
quality:
- assertion: null_rate(order_id) == 0
severity: block
rollout: block_all
- assertion: unique(order_id)
severity: block
rollout: block_all
- assertion: order_total >= 0
severity: block
rollout: block_on_new
- assertion: country_code IN ('ISO-3166 list')
severity: warn
rollout: warn_only
roles:
producer: team-checkout
consumers: [team-analytics, team-fraud, team-marketing]
steward: platform-data
on_call_rotation: pagerduty/team-checkout-primary
servers:
- type: snowflake
dsn: snowflake://analytics.public.orders
- type: kafka
dsn: kafka://prod-cluster/orders.public.v1
Step-by-step explanation.
- The
apiVersionandkindlock the file's shape to ODCS v3.x. Every ODCS-aware tool (lint, dbt-contract-generator, registry-sync) reads these two fields first and refuses non-matching files. -
id: orders.publicis the stable identifier — the producer never renames it; only the version changes. Downstream tools reference this id in their model definitions, so changing it would break every reference in the org. -
version: 1.4.0is semver. The .4 minor version reflects four additive changes since 1.0.0; if the producer removescountry_codenext, they must bump to 2.0.0 and offer a deprecation window. - The
schemablock enumerates every column with type, required-ness, and PII flag.user_idcarriespii: true— DLP tooling reads this and applies masking rules automatically.order_totalusesDECIMAL(12,2)— exact monetary type, neverFLOAT. -
sla.max_freshness: PT15Mis the ISO-8601 duration. The freshness probe reads this exactly and assertsnow() - MAX(created_at) < PT15M.availability_pct: 99.9sets a monthly uptime target — ~43 minutes of downtime per month is the budget. - Each
qualityassertion carries a severity and rollout phase. Block-severity assertions fail the producer's write; warn-severity assertions log and route to a dashboard. Rollout phases let the producer introduce a new assertion inwarn_onlymode first, promote toblock_on_newfor freshly-arriving rows, and finallyblock_allfor existing rows once cleanup is done. -
roles.producer: team-checkoutis the CODEOWNERS anchor. The consumers list is the notification target for release announcements; the steward is the neutral third party who mediates escalations; the on-call rotation is where the freshness-SLA pager fires. -
serversmaps the logical dataset to physical resources. The sameorders.publiccontract covers both a Snowflake table and a Kafka topic; consumers can subscribe to either surface and get the same guarantees.
Output.
| Block | Fields | Interviewer signal |
|---|---|---|
| info | title, owner, status, tags | core discovery metadata |
| schema | column list with types + pii | binds Avro / dbt contracts |
| sla | freshness + availability + retention | runtime probes drive from here |
| quality | assertions + severity + rollout | producer commitments |
| roles | producer + consumers + steward | governance anchor |
| servers | physical materialisation | consumer resolution target |
Rule of thumb. Every ODCS contract fits on a screen. If yours is 400 lines, you're over-modelling — split into multiple contracts per logical dataset. The 40-line reference above covers 90% of production tables; the exceptions are dimensional models with lots of columns.
Worked example — semver bump decisions
Detailed explanation. The single most common interview probe on ODCS: "given this change, what's the version bump?" Walk through eight realistic changes and classify each.
- Rule 1. Additive + optional = minor.
- Rule 2. Additive + required = major.
- Rule 3. Any column drop = major.
- Rule 4. Type narrowing = major; type widening = minor.
- Rule 5. Tightening SLA = major; loosening SLA = minor.
Question. Given eight changes, classify each as patch, minor, or major, with a one-line justification.
Input.
| # | Change |
|---|---|
| 1 | Add optional column promo_code
|
| 2 | Add required column channel (no default) |
| 3 | Remove column legacy_ref
|
| 4 | Change order_total from DECIMAL(10,2) to DECIMAL(12,2) |
| 5 | Change order_total from DECIMAL(12,2) to DECIMAL(10,2) |
| 6 | Tighten max_freshness from PT1H to PT15M |
| 7 | Loosen max_freshness from PT15M to PT1H |
| 8 | Fix a typo in description
|
Code.
# Diff shape 1 — additive optional (minor 1.4 → 1.5)
+ - name: promo_code
+ type: STRING
+ required: false
# Diff shape 2 — additive required (major 1.4 → 2.0)
+ - name: channel
+ type: STRING
+ required: true # existing consumers have no default → breaking
# Diff shape 3 — column drop (major 1.4 → 2.0)
- - name: legacy_ref
- type: STRING
# Diff shape 4 — type widen (minor 1.4 → 1.5)
- - name: order_total
- type: DECIMAL(10,2)
+ - name: order_total
+ type: DECIMAL(12,2) # larger precision → readers survive
# Diff shape 5 — type narrow (major 1.4 → 2.0)
- - name: order_total
- type: DECIMAL(12,2)
+ - name: order_total
+ type: DECIMAL(10,2) # smaller precision → risk of truncation
# Diff shape 6 — tighten SLA (major)
-sla:
- max_freshness: PT1H
+sla:
+ max_freshness: PT15M
# Diff shape 7 — loosen SLA (minor)
-sla:
- max_freshness: PT15M
+sla:
+ max_freshness: PT1H
# Diff shape 8 — description typo (patch 1.4.0 → 1.4.1)
- description: Order totall in USD.
+ description: Order total in USD.
Step-by-step explanation.
- Change 1 adds an optional column — every existing consumer's query still works because the new column is unreferenced. Minor bump; no deprecation window needed.
- Change 2 adds a required column with no default. Existing producer writes that don't set the column will fail; existing consumer queries that reference the schema (Avro / Protobuf) will have a new required field with no old-data value. Major bump; producer must also backfill or provide a
default: <value>. - Change 3 removes a column. Any consumer referencing it breaks. Major bump; the correct pattern is deprecate first (mark
deprecated: truein 1.5.0), then remove in 2.0.0 after the deprecation window (typically 30–90 days). - Change 4 widens
DECIMAL(10,2)toDECIMAL(12,2). Every value that fit the old type still fits the new type. Minor bump; consumers whose downstream types are inferred at read time survive without action. - Change 5 narrows the same field. Existing values with 11 or 12 digits of precision will truncate or throw at write time. Major bump; producer must migrate historical data or accept the loss.
- Change 6 tightens the freshness SLA. Consumers who rely on the old SLA might have downstream pipelines that assume up to 1 hour of staleness is fine; tightening to 15 minutes doesn't break them — but consumer downstream SLAs might be dependent (a chain of "add 5 min buffer to upstream SLA"). By convention this is a major bump because SLA tightening changes the promised behaviour of the producer.
- Change 7 loosens the SLA — data will be older than before. Every consumer's downstream freshness expectations degrade. This is a breaking change from the consumer's perspective (their downstream SLAs may violate). But by ODCS v3.x convention, loosening is a minor bump — the producer's promise stays truthful, though weaker; consumers can renegotiate.
- Change 8 is a description fix. No schema, SLA, or quality change. Patch bump.
Output.
| # | Change | Bump | Why |
|---|---|---|---|
| 1 | Add optional column | minor (1.5.0) | additive, non-breaking |
| 2 | Add required column | major (2.0.0) | old writes fail |
| 3 | Remove column | major (2.0.0) | consumers break |
| 4 | Widen type | minor (1.5.0) | existing values fit |
| 5 | Narrow type | major (2.0.0) | truncation risk |
| 6 | Tighten SLA | major (2.0.0) | promised behaviour changes |
| 7 | Loosen SLA | minor (1.5.0) | promise weaker but truthful |
| 8 | Typo in description | patch (1.4.1) | metadata only |
Rule of thumb. When in doubt, bump major and offer a deprecation window. The cost of a major bump is one PR; the cost of an unannounced breaking change is a war room. Producers never regret a conservative bump.
Worked example — the ODCS lint tool and its diff mode
Detailed explanation. The lint tool is the CI muscle that enforces the semver rules mechanically. Show its two modes — shape validation (does the YAML parse as a valid ODCS v3.x contract?) and diff validation (does the incoming PR conform to semver?) — and the failure output shape.
-
Shape mode.
odcs-lint <path>verifies the file parses, all required blocks are present, types are canonical, and the semver format is valid. -
Diff mode.
odcs-lint diff --base main --path <file>compares the PR file against main; classifies changes as patch / minor / major; asserts the version bump matches. - Exit codes. 0 = pass; 1 = shape error; 2 = version mismatch.
Question. Show a producer PR that mistakenly removes a column while bumping only the minor version, and the lint failure output that catches it.
Input.
| Baseline version | 1.4.0 |
| PR version | 1.5.0 (should be 2.0.0) |
| Change | removed column legacy_ref |
Code.
# contract.yaml on main (before PR)
version: 1.4.0
schema:
- name: order_id
type: STRING
- name: legacy_ref
type: STRING
- name: order_total
type: DECIMAL(12,2)
# contract.yaml in PR (buggy — minor bump but breaking change)
version: 1.5.0
schema:
- name: order_id
type: STRING
# legacy_ref removed — breaking change!
- name: order_total
type: DECIMAL(12,2)
$ odcs-lint diff --base main --path orders/contract.yaml
[odcs-lint v0.9.0] validating orders/contract.yaml
SHAPE: pass — file conforms to apiVersion v3.0.0
DIFF:
changes detected:
- schema.remove: column "legacy_ref" removed → severity: BREAKING
version bump: 1.4.0 → 1.5.0 (minor)
required bump for BREAKING: major
FAIL: version bump (minor) does not match change severity (BREAKING).
Options:
(a) bump to 2.0.0 and add a deprecation window;
(b) restore column "legacy_ref";
(c) mark "legacy_ref" as deprecated: true and keep for one release cycle.
exit code: 2
Step-by-step explanation.
-
odcs-lintfirst runs shape validation — checks the file parses as YAML, has the required top-level blocks, and every schema entry has the mandatory fields. This step catches typos and structural mistakes. - In
diffmode, the tool loads both versions (main and PR), computes the set difference on the schema block, and classifies each change. Removing an entry is classifiedBREAKING; adding an optional entry isMINOR; changing a description isPATCH. - The tool computes the maximum severity across all changes:
BREAKING. It then compares to the version bump —1.4.0 → 1.5.0is a minor bump. The maximum severity requires a major bump. Mismatch. Exit code 2. - The failure message lists three remediation options in decreasing order of preference. Option (a) is the correct one for genuine intentional breaking change; option (b) is for accidental removals; option (c) is the deprecation path that gives consumers a migration window.
- The GitHub required-status-check consumes exit code 2 as a failure. The merge button is disabled; the PR author must pick one of the three options before the check can pass.
Output.
| Lint stage | Result | Action |
|---|---|---|
| Shape | pass | continue |
| Diff — remove column | BREAKING | require major |
| Version bump | 1.5.0 (minor) | mismatch |
| Overall | FAIL exit 2 | PR blocked |
Rule of thumb. The lint tool is the mechanical enforcer of the semver rules. Trusting humans to bump correctly is a bug waiting to happen — every producer eventually forgets, and the diff-classifier is the safety net. Ship odcs-lint diff as a required status check on day one.
Senior interview question on the ODCS shape and versioning
A senior interviewer might ask: "Design the ODCS contract for a payments dataset that includes card-network-classified PII, has a 5-minute freshness SLA, is consumed by a fraud model with strict quality requirements, and needs to go through a schema migration that renames card_bin to card_iin. Walk me through the initial contract, the deprecation contract, and the final contract."
Solution Using an initial v1 + a deprecation-window v1.5 + a final v2 with the rename
# ---------- v1.0.0 — initial contract ----------
apiVersion: v3.0.0
kind: DataContract
id: payments.public
version: 1.0.0
info:
title: Payments — public model
owner: team-payments
status: active
schema:
- name: payment_id
type: STRING
required: true
unique: true
- name: card_bin
type: STRING
required: true
pii: true
sensitive: true
description: First 6 digits of card number (Bank Identification Number).
- name: amount
type: DECIMAL(12,2)
required: true
- name: created_at
type: TIMESTAMP
required: true
sla:
max_freshness: PT5M
availability_pct: 99.95
quality:
- assertion: null_rate(payment_id) == 0
severity: block
rollout: block_all
- assertion: unique(payment_id)
severity: block
rollout: block_all
- assertion: amount > 0
severity: block
rollout: block_all
roles:
producer: team-payments
consumers: [team-fraud, team-finance]
steward: platform-data
servers:
- type: kafka
dsn: kafka://prod-cluster/payments.public.v1
---
# ---------- v1.5.0 — deprecation-window contract ----------
apiVersion: v3.0.0
kind: DataContract
id: payments.public
version: 1.5.0
info:
status: active
description: card_iin introduced as canonical name for BIN; card_bin retained for backward compatibility until 2.0.0 (deprecation window 45 days).
schema:
- name: payment_id
type: STRING
required: true
unique: true
- name: card_bin
type: STRING
required: true
pii: true
sensitive: true
deprecated: true # signal to consumers
description: DEPRECATED. Use card_iin. Removed in 2.0.0 on 2026-08-16.
- name: card_iin
type: STRING
required: false # optional in 1.5.0 so producers can migrate at their pace
pii: true
sensitive: true
description: Issuer Identification Number (canonical name for what was BIN).
- name: amount
type: DECIMAL(12,2)
required: true
- name: created_at
type: TIMESTAMP
required: true
# sla, quality, roles, servers unchanged
sla:
max_freshness: PT5M
availability_pct: 99.95
quality:
- assertion: null_rate(payment_id) == 0
severity: block
rollout: block_all
- assertion: unique(payment_id)
severity: block
rollout: block_all
- assertion: amount > 0
severity: block
rollout: block_all
- assertion: card_bin == card_iin OR card_iin IS NULL
severity: warn
rollout: warn_only # producer commits to both being equal during migration
roles:
producer: team-payments
consumers: [team-fraud, team-finance]
---
# ---------- v2.0.0 — final contract with rename complete ----------
apiVersion: v3.0.0
kind: DataContract
id: payments.public
version: 2.0.0
info:
status: active
schema:
- name: payment_id
type: STRING
required: true
unique: true
- name: card_iin # canonical — card_bin removed
type: STRING
required: true
pii: true
sensitive: true
- name: amount
type: DECIMAL(12,2)
required: true
- name: created_at
type: TIMESTAMP
required: true
sla:
max_freshness: PT5M
availability_pct: 99.95
quality:
- assertion: null_rate(payment_id) == 0
severity: block
rollout: block_all
- assertion: unique(payment_id)
severity: block
rollout: block_all
- assertion: amount > 0
severity: block
rollout: block_all
roles:
producer: team-payments
consumers: [team-fraud, team-finance]
Step-by-step trace.
| Version | Key change | Consumer action |
|---|---|---|
| 1.0.0 | initial contract, card_bin required, PII flagged |
subscribe |
| 1.5.0 (deprecation) |
card_bin marked deprecated; card_iin added optional |
migrate reads to card_iin, keep fallback to card_bin
|
| 2.0.0 (final) |
card_bin removed; card_iin now required |
consumers must have migrated by now |
The deprecation window is the middle contract's 45-day interval. During those 45 days both columns are populated (enforced by the warn-only assertion card_bin == card_iin OR card_iin IS NULL). Consumers migrate their queries at their own pace; by day 45, every consumer reads card_iin. Then 2.0.0 removes card_bin safely.
Output:
| Metric | 1.0.0 | 1.5.0 | 2.0.0 |
|---|---|---|---|
| card_bin present | yes | yes (deprecated) | no |
| card_iin present | no | yes (optional) | yes (required) |
| Consumers still on card_bin | 100% | migrating | 0% |
| Breaking? | — | no (additive) | yes (drop) |
Why this works — concept by concept:
- Deprecation-window contract — the 1.5.0 middle step is what makes the rename safe. Both columns are populated; consumers can migrate at their own pace; the warn-only assertion guarantees the two columns are equal during the window.
-
Semver bumps match change semantics — 1.0.0 → 1.5.0 is minor (additive
card_iin, additivedeprecatedflag). 1.5.0 → 2.0.0 is major (removingcard_bin). The lint tool passes both bumps. -
PII + sensitive flags carried forward — every contract flags
card_iinaspii: true, sensitive: true. DLP tooling continues to mask it in query results and encrypt at rest. - SLA and quality unchanged — the rename is a schema-only migration. The 5-minute freshness SLA and the block-severity quality assertions stay stable across all three versions; consumers' downstream freshness expectations do not need renegotiating.
- Cost — one deprecation window (45 days) plus three PRs. The alternative (rename in-place, break every consumer) costs one weekend war room per consumer plus the trust hit. The paved path is O(days) of calendar time and O(1) engineer-hours; the reckless path is O(engineers × hours × consumers).
ETL
Topic — etl
ETL problems on schema evolution and deprecation windows
3. Schema Registry integration — Avro, Protobuf, JSON
Confluent and Apicurio Schema Registry — subjects, versions, and compatibility modes are the streaming-side of ODCS
The mental model in one line: a schema registry is a centralised, versioned, compatibility-checked store of streaming schemas — every producer publishes an Avro / Protobuf / JSON Schema to a subject, the registry enforces a compatibility mode (BACKWARD, FORWARD, FULL, NONE) that either accepts or rejects the new schema, and consumers read the schema by version to deserialise payloads. ODCS is the source-of-truth YAML; the registry is the runtime-enforcement mirror for the streaming path.
The three moving pieces.
-
Subject. A namespaced string that identifies the schema slot — e.g.
orders.public.v1-valuefor the value schema of theorders.public.v1topic. Every schema publication goes to a subject; consumers query the registry by subject to fetch the schema. -
Version. An integer per subject, incrementing every time a compatible schema is registered.
v1is the first schema;v2is the next accepted schema;v3the next; and so on. - Compatibility mode. The rule the registry enforces when a new schema is proposed. Modes decide whether a proposed schema is compatible with the previous version(s) — and reject it at publication time if not.
The compatibility modes in detail.
- BACKWARD. New schema can be used to read data written by the previous version. Producers can add optional fields, but adding required fields or removing existing fields breaks BACKWARD. This is the default for most streaming setups — consumers can be upgraded first, then producers.
- BACKWARD_TRANSITIVE. Same as BACKWARD but the new schema must be compatible with all previous versions, not just the previous one. Prevents drift across long lived subjects.
- FORWARD. Previous schema can read data written by the new schema. Producers can be upgraded first; consumers stay on the old schema. Useful when producers ship faster than consumers.
- FORWARD_TRANSITIVE. Same but against all previous versions.
- FULL. Both directions — BACKWARD and FORWARD simultaneously. Neither party has to upgrade first; migrations are safest but changes are most restricted.
- FULL_TRANSITIVE. Both directions, all previous versions.
- NONE. No compatibility checks; anything goes. Only used during initial development or emergency schema rewrites.
The subject-naming strategies.
-
TopicNameStrategy (default).
<topic>-keyand<topic>-value. One schema per topic; simplest model; requires every event on the topic to share the same schema. -
RecordNameStrategy.
<record-fullname>(e.g.com.company.OrderCreated). One schema per record type; useful when the same topic carries multiple event types. -
TopicRecordNameStrategy.
<topic>-<record-fullname>. Combines the two — one schema per (topic, record-type) pair; useful for topics carrying multiple event types with topic-scoped versioning.
The Avro-Protobuf-JSON trade-off.
- Avro. Binary encoding, compact, schema stored per message header (schema-id), fastest in most benchmarks. Ideal for high-throughput streaming.
- Protobuf. Binary encoding, similar compactness to Avro, wire format is language-neutral, gRPC-native. Ideal for polyglot organisations with existing Protobuf ecosystems.
- JSON Schema. Text encoding, human-readable, larger on the wire, easier debugging. Ideal for internal-tools topics where debuggability trumps performance.
Common interview probes on schema registry.
- "Explain BACKWARD compatibility with a producer-side example." — required.
- "What's the difference between BACKWARD and FORWARD?" — who upgrades first.
- "What subject-naming strategy would you pick for a multi-event-type topic?" — RecordName or TopicRecordName.
- "How does the registry integrate with ODCS?" — the ODCS
schemablock is the source; a sync tool projects it into the registry as an Avro / Protobuf schema.
Worked example — BACKWARD-compatible Avro evolution
Detailed explanation. The canonical evolution story: an orders Avro schema v1 has order_id, user_id, order_total. The producer wants to add a discount_code field. Show the two-line Avro schema evolution, the registry BACKWARD check that passes, and the consumer that survives without recompilation.
- v1 schema. three required fields.
-
v2 schema. three original fields +
discount_codeas optional (nullable with default null). -
BACKWARD check. does v2 reader read v1 data? Yes — the missing
discount_codein v1 data is filled with the v2 default. - Producer + consumer upgrade order. upgrade consumers first (they can read both v1 and v2); then producers (they start writing v2).
Question. Produce the Avro schema for v1 and v2 of the orders topic value, the registry publish command, and the consumer code (Python) that reads both versions.
Input.
| Element | Value |
|---|---|
| Topic | orders.public |
| Subject strategy | TopicNameStrategy |
| Subject | orders.public-value |
| Compatibility | BACKWARD |
| Registry | Confluent Schema Registry |
Code.
//orders_v1.avsc—valueschemav1{"type":"record","name":"Order","namespace":"com.company.orders","fields":[{"name":"order_id","type":"string"},{"name":"user_id","type":"string"},{"name":"order_total","type":{"type":"bytes","logicalType":"decimal","precision":12,"scale":2}}]}
//orders_v2.avsc—valueschemav2(BACKWARD-compatiblewithv1){"type":"record","name":"Order","namespace":"com.company.orders","fields":[{"name":"order_id","type":"string"},{"name":"user_id","type":"string"},{"name":"order_total","type":{"type":"bytes","logicalType":"decimal","precision":12,"scale":2}},{"name":"discount_code","type":["null","string"],"default":null}]}
# Publish v1 and v2 to the registry with BACKWARD compatibility
curl -X PUT http://registry:8081/config/orders.public-value \
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
-d '{"compatibility":"BACKWARD"}'
# Register v1
curl -X POST http://registry:8081/subjects/orders.public-value/versions \
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
-d @<(jq -Rs '{schema: .}' < orders_v1.avsc)
# Register v2 — passes BACKWARD check (adds optional field with default)
curl -X POST http://registry:8081/subjects/orders.public-value/versions \
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
-d @<(jq -Rs '{schema: .}' < orders_v2.avsc)
# consumer.py — reads both v1 and v2 messages transparently
from confluent_kafka import Consumer
from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.avro import AvroDeserializer
registry = SchemaRegistryClient({"url": "http://registry:8081"})
# The deserializer reads the schema-id from the message header,
# fetches the writer schema, and projects onto the current reader schema
reader_schema = open("orders_v2.avsc").read() # consumer knows v2
avro_deserializer = AvroDeserializer(registry, reader_schema)
consumer = Consumer({
"bootstrap.servers": "kafka:9092",
"group.id": "orders-consumer",
"auto.offset.reset": "earliest",
})
consumer.subscribe(["orders.public"])
while True:
msg = consumer.poll(1.0)
if msg is None or msg.error(): continue
order = avro_deserializer(msg.value(), None)
# order["discount_code"] is None for v1 messages, populated for v2 messages
print(order["order_id"], order.get("discount_code"))
Step-by-step explanation.
- The v1 schema declares three required fields. Every producer writing v1 must populate all three; consumers deserialising v1 receive all three.
- The v2 schema adds
discount_codeas a union ofnullandstringwith a default ofnull. In Avro, adding a nullable field with a default is the canonical BACKWARD-compatible change. - Setting the subject-level compatibility to BACKWARD via
PUT /config/orders.public-valueestablishes the rule the registry will enforce. Any subsequentPOST /subjects/.../versionsruns the compatibility check first. - Registering v2 succeeds because the BACKWARD check passes: a reader using v2's schema can read v1 data (the missing
discount_codeis filled with the defaultnull). If the producer had instead addeddiscount_codeas required, the registry would return409 Conflictwith aSchema being registered is incompatible with an earlier schemaerror. - The consumer code declares v2 as its reader schema. The
AvroDeserializerreads each message's writer-schema-id from the header (a 4-byte prefix), fetches the writer schema from the registry, and projects into the reader schema. For v1 messages,discount_codeis filled withnull; for v2 messages, it's populated. - The upgrade order is: (a) update consumers to use the v2 reader schema first — they can still read v1 messages; (b) once consumers are deployed, update producers to write v2. If the order were reversed, v1 consumers would fail on v2 messages carrying an unknown field.
Output.
| Message written by | Schema id | Consumer reads with v2 reader | discount_code value |
|---|---|---|---|
| producer v1 | 1 | success | null (default) |
| producer v2 | 2 | success | "SPRING2026" |
| producer v2 (no discount) | 2 | success | null |
Rule of thumb. BACKWARD is the streaming default. Add optional fields with defaults; upgrade consumers first, then producers. Reject any Avro change that adds a required field or removes an existing field — those are not BACKWARD-compatible and the registry will (correctly) refuse to register the schema.
Worked example — Protobuf schema evolution with FULL compatibility
Detailed explanation. Protobuf's approach to schema evolution differs from Avro. All fields carry a field number; unknown fields are preserved on read; field numbers must never be reused. Show a Protobuf schema evolution under FULL compatibility (both BACKWARD and FORWARD), demonstrating why Protobuf's design choices make FULL easier than Avro's.
- Protobuf design. Fields are optional by default in proto3; adding a new optional field is always FULL-compatible.
- The forbidden change. Renaming a field is fine (field number is the identity); reusing a field number for a new field is forbidden.
- Wire format. Unknown fields are preserved during deserialise + reserialise, so a consumer on old schema can pass through data from a new-schema producer without losing information.
Question. Show a proto3 schema evolution from v1 to v2 (adding a discount_code field) under FULL compatibility, plus the registry compatibility check.
Input.
| Element | Value |
|---|---|
| Topic | orders.public.proto |
| Subject | orders.public.proto-value |
| Compatibility | FULL |
| Format | Protobuf (proto3) |
Code.
// orders_v1.proto
syntax = "proto3";
package com.company.orders;
message Order {
string order_id = 1;
string user_id = 2;
string order_total = 3; // decimal serialised as string for exactness
}
// orders_v2.proto — adds discount_code with a new field number
syntax = "proto3";
package com.company.orders;
message Order {
string order_id = 1;
string user_id = 2;
string order_total = 3;
string discount_code = 4; // new field, new number
}
# Publish with FULL compatibility
curl -X PUT http://registry:8081/config/orders.public.proto-value \
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
-d '{"compatibility":"FULL"}'
# Register v1 and v2 — both pass FULL check for proto3 with new field number
for v in orders_v1.proto orders_v2.proto; do
curl -X POST http://registry:8081/subjects/orders.public.proto-value/versions \
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
-d @<(jq -Rs '{schema: ., schemaType: "PROTOBUF"}' < "$v")
done
// Consumer — Protobuf's design means the v1 consumer can read v2 messages
// (unknown fields preserved) and the v2 consumer can read v1 messages (missing
// fields default to empty string in proto3).
Order v1Order = Order.parseFrom(bytes);
// If bytes came from v2 producer, v1Order.getDiscountCode() throws (v1 has no such field)
// but v1 can still process the fields it knows about, and unknown-field bytes survive
// a round-trip through v1's serialiser.
Step-by-step explanation.
- Proto3's design has all fields optional by default. There is no concept of "required" in proto3 — a field either has a value or its default (empty string, 0, false).
- Adding a new field with a new number (
discount_code = 4) is always FULL-compatible in proto3. The v1 reader ignores field 4 (unknown field, preserved on the wire); the v2 reader deserialises it. The v1 reader can also write — it just doesn't emit field 4. - The registry's FULL check on the PUT succeeds because the change is bidirectionally safe. If the producer had removed field 3 or reused field number 3 for a new field, FULL would fail.
- Consumers written against v1 keep working when producers upgrade to v2; the unknown
discount_codebytes are preserved and round-tripped. This is the "proto3 unknown fields" behaviour. - Compared to Avro's BACKWARD-only default, Protobuf's design makes FULL compatibility achievable for a broader class of changes. The trade-off: Protobuf is less compact on the wire for sparse data and requires more careful management of field numbers (never reuse).
Output.
| Change | Avro BACKWARD | Protobuf FULL |
|---|---|---|
| Add optional field with default | pass | pass |
| Add required field | fail | (no required in proto3) |
| Remove field | fail | fail (breaking) |
| Rename field | fail (name matters) | pass (number matters) |
| Reuse field number for new field | N/A | forbidden |
Rule of thumb. For polyglot organisations or where FULL compatibility matters, prefer Protobuf. For maximum compactness and where BACKWARD is sufficient, prefer Avro. JSON Schema is fine for internal-tools topics where debuggability wins. Never mix formats within a subject — pick one per topic.
Worked example — Subject-naming strategies for multi-event topics
Detailed explanation. A topic that carries multiple event types (e.g. orders.events carrying OrderCreated, OrderPaid, OrderShipped) cannot use TopicNameStrategy because there'd be one schema slot for three different record types. RecordNameStrategy assigns a subject per record type; TopicRecordNameStrategy scopes it further per topic. Walk through the three strategies with a concrete example.
-
TopicNameStrategy.
orders.events-value— one slot; wrong for multi-event topics. -
RecordNameStrategy.
com.company.orders.OrderCreated,com.company.orders.OrderPaid— three slots; same schema shared across every topic that carries the record. -
TopicRecordNameStrategy.
orders.events-com.company.orders.OrderCreated— three slots scoped per topic; different topics can evolve the same record type independently.
Question. Show a producer config and a consumer config for a topic orders.events carrying OrderCreated, OrderPaid, and OrderShipped, using TopicRecordNameStrategy.
Input.
| Component | Value |
|---|---|
| Topic | orders.events |
| Event types | OrderCreated, OrderPaid, OrderShipped |
| Subject strategy | TopicRecordNameStrategy |
| Compatibility | BACKWARD (per subject) |
Code.
//OrderCreated.avsc{"type":"record","name":"OrderCreated","namespace":"com.company.orders","fields":[{"name":"order_id","type":"string"},{"name":"created_at","type":"long","logicalType":"timestamp-millis"}]}
// Producer — java, Kafka + Confluent client
Properties props = new Properties();
props.put("bootstrap.servers", "kafka:9092");
props.put("schema.registry.url", "http://registry:8081");
// Subject strategy — one subject per (topic, record-type)
props.put("value.subject.name.strategy",
"io.confluent.kafka.serializers.subject.TopicRecordNameStrategy");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "io.confluent.kafka.serializers.KafkaAvroSerializer");
KafkaProducer<String, GenericRecord> producer = new KafkaProducer<>(props);
// Publishing an OrderCreated event to orders.events
GenericRecord created = new GenericData.Record(orderCreatedSchema);
created.put("order_id", "o-42");
created.put("created_at", System.currentTimeMillis());
producer.send(new ProducerRecord<>("orders.events", "o-42", created));
// Registry subject: orders.events-com.company.orders.OrderCreated
# Registry state after publishing all three event types
subjects:
orders.events-com.company.orders.OrderCreated: [v1]
orders.events-com.company.orders.OrderPaid: [v1]
orders.events-com.company.orders.OrderShipped: [v1]
Step-by-step explanation.
- The producer sets
value.subject.name.strategy = TopicRecordNameStrategy. The Kafka Avro serializer inspects each message's Avro schema, extracts the fullname (com.company.orders.OrderCreated), and constructs the subjectorders.events-com.company.orders.OrderCreated. - The three event types register as three distinct subjects. Each subject has its own version history and compatibility rule; evolving
OrderPaiddoes not touchOrderCreated's history. - The consumer subscribes to
orders.eventsand receives interleaved messages of all three types. For each message, it reads the schema-id from the header, fetches the writer schema from the registry, and deserialises. The consumer then dispatches based on the schema fullname to type-specific handlers. - If the team later adds a fourth event type
OrderCancelled, it gets its own subjectorders.events-com.company.orders.OrderCancelledautomatically — no config change on the producer, no coordination. - The catch is quota — Confluent Schema Registry counts subjects for its subject-limit quota. Multi-event topics multiply subject counts; teams with many topics and many event types can hit the limit and need to negotiate a quota bump.
Output.
| Strategy | Subject per topic per event type | Cross-topic sharing | Use case |
|---|---|---|---|
| TopicNameStrategy | 1 subject / topic | not applicable | single-event topics |
| RecordNameStrategy | 1 subject / record type | yes | shared record across topics |
| TopicRecordNameStrategy | 1 subject / (topic, record) | no | multi-event topics |
Rule of thumb. Default to TopicNameStrategy for single-event topics (simplest, matches ODCS 1:1). Escalate to TopicRecordNameStrategy for multi-event topics — it's the most flexible and matches the "one contract per (topic, event)" mental model. Reserve RecordNameStrategy for the rare case where the same record type flows through multiple topics.
Senior interview question on schema registry integration with ODCS
A senior interviewer might ask: "You have an ODCS contract for orders.public version 1.4.0. Walk me through how you'd sync that contract into the Confluent Schema Registry, what compatibility mode you'd set, how you'd handle a v1.5.0 additive change and a v2.0.0 breaking change, and what happens to consumers pinned to v1 during a v2 rollout."
Solution Using an ODCS-to-registry sync tool with BACKWARD-TRANSITIVE + per-version consumer pinning
# sync_odcs_to_registry.py — the reference sync tool
import yaml, json, requests
from typing import Dict, Any
# Map ODCS types to Avro types
ODCS_TO_AVRO = {
"STRING": "string",
"INT": "int",
"BIGINT": "long",
"BOOLEAN": "boolean",
"TIMESTAMP": {"type": "long", "logicalType": "timestamp-millis"},
# DECIMAL(p,s) handled specially
}
def odcs_column_to_avro_field(col: Dict[str, Any]) -> Dict[str, Any]:
avro_type = ODCS_TO_AVRO.get(col["type"], "string")
if col["type"].startswith("DECIMAL"):
# DECIMAL(12,2) → bytes with logicalType
precision, scale = _parse_decimal(col["type"])
avro_type = {"type": "bytes", "logicalType": "decimal",
"precision": precision, "scale": scale}
if not col.get("required", False):
avro_type = ["null", avro_type]
field = {"name": col["name"], "type": avro_type}
if not col.get("required", False):
field["default"] = None
if col.get("description"):
field["doc"] = col["description"]
return field
def _parse_decimal(t: str):
inside = t[len("DECIMAL("):-1]
p, s = inside.split(",")
return int(p), int(s)
def odcs_to_avro(contract: Dict[str, Any]) -> Dict[str, Any]:
return {
"type": "record",
"name": "Value",
"namespace": contract["id"],
"fields": [odcs_column_to_avro_field(c) for c in contract["schema"]],
}
def sync(contract_path: str, registry_url: str, compat: str = "BACKWARD_TRANSITIVE"):
contract = yaml.safe_load(open(contract_path))
subject = f"{contract['id']}.v{contract['version'].split('.')[0]}-value"
# 1. set compatibility
requests.put(f"{registry_url}/config/{subject}",
json={"compatibility": compat})
# 2. compute the Avro schema from the ODCS schema
avro = odcs_to_avro(contract)
# 3. register the schema (registry runs its own compatibility check first)
r = requests.post(f"{registry_url}/subjects/{subject}/versions",
json={"schema": json.dumps(avro)})
if r.status_code == 409:
raise SystemExit(f"registry rejected schema — incompatible: {r.text}")
return r.json()["id"]
if __name__ == "__main__":
print(sync("orders/contract.yaml", "http://registry:8081"))
Rollout plan — v1.5.0 (additive) → v2.0.0 (breaking)
======================================================
1.4.0 → 1.5.0 (additive: add discount_code optional)
[producer PR] odcs-lint diff → passes (MINOR)
[CI step] sync_odcs_to_registry orders/contract.yaml
→ subject: orders.public.v1-value
→ registry check: BACKWARD_TRANSITIVE passes
→ schema id incremented
[consumers] still on 1.4.0-compatible Avro reader schema
new discount_code ignored; no code change required
1.5.0 → 2.0.0 (breaking: rename card_bin → card_iin, drop card_bin)
[producer PR] odcs-lint diff → BREAKING (drop card_bin)
version bump matches: 1.x → 2.0.0 OK
[subject] NEW subject: orders.public.v2-value
compatibility: BACKWARD_TRANSITIVE within v2's history
[producers] dual-write for 45 days:
write v1 schema to orders.public.v1
write v2 schema to orders.public.v2
[consumers] pinned to v1 keep working (v1 topic still flowing)
migrate to v2 topic at their own pace
[cutover day] producers stop writing to v1; v1 topic reaches retention → deprecated
Step-by-step trace.
| Change | Subject | Compat check | Consumer action |
|---|---|---|---|
| 1.4 → 1.5 additive | orders.public.v1-value | BACKWARD_TRANSITIVE passes | none |
| 1.5 → 2.0 breaking (rename) | new: orders.public.v2-value | fresh subject, own history | migrate over 45 days |
| Cutover | drop orders.public.v1 topic | v1 retention expires | consumers all on v2 |
The major-version bump gets its own registry subject (v2-value). Producers dual-write for the deprecation window; v1 consumers keep working on the old topic; v2 consumers use the new topic. On cutover day, producers stop writing to v1; the topic ages out through retention; no v1 message ever fails a consumer.
Output:
| Rollout day | v1 topic | v2 topic | v1 consumers | v2 consumers |
|---|---|---|---|---|
| Day 0 (1.5 → 2.0 PR) | active | active | reading v1 | reading v2 |
| Day 30 | active | active | some migrated | growing |
| Day 45 (deprecation ends) | producers stop writing | active | zero (all migrated) | 100% |
| Day 90 (retention expires) | closed | active | — | 100% |
Why this works — concept by concept:
- ODCS-as-source — the producer maintains one YAML; the sync tool projects it into whatever runtime materialisation the streaming stack needs (Avro, Protobuf, JSON Schema). Drift between the contract and the registry is impossible because the sync runs in CI.
- Major version = new subject — instead of trying to squeeze breaking changes into an existing subject with tortured compat rules, ODCS + registry integrations create a fresh subject per major version. The v1 subject keeps flowing; v2 flows in parallel; consumers migrate on their timeline.
- BACKWARD_TRANSITIVE within a major — inside a major version, additive changes remain compatible across all minors. The stricter TRANSITIVE variant catches drift-induced incompatibility that non-transitive BACKWARD misses.
- Dual-write during deprecation — the producer writes v1 to the v1 topic and v2 to the v2 topic for the deprecation window. Consumers on either version keep working; the cost is 2x write throughput during the window.
- Cost — the sync tool is ~150 lines of Python; the dual-write imposes 2x storage during deprecation. The avoided cost of a coordinated org-wide upgrade (dozens of consumer teams migrating on the same weekend) more than pays for the throughput doubling. O(days) of runtime cost; O(0) coordination cost.
Streaming
Topic — streaming
Streaming problems on schema registry and Avro evolution
4. Contract enforcement — CI + runtime
Two gates catch two failure modes — CI blocks bad intent; runtime blocks bad reality
The mental model in one line: CI enforcement catches the producer before the change ships (pre-merge lint against the contract, pre-merge registry compatibility check, pre-merge dbt contract test); runtime enforcement catches the payload after the change ships (consumer-side deserialisation validation, streaming producer-side schema-id header, batch loader constraint check). Neither gate suffices alone; both together give the contract teeth.
The two enforcement gates.
- CI gate. Runs against every producer PR before merge. Steps: (1) contract-lint validates the ODCS YAML shape; (2) contract-diff checks the semver bump matches change severity; (3) registry-compat-check (for streaming datasets) verifies the projected Avro / Protobuf schema passes the subject's compatibility mode; (4) dbt-contract-check (for warehouse datasets) verifies the dbt model's declared columns match the contract.
-
Runtime gate. Runs against every payload. For streaming: the consumer-side deserialiser reads the schema-id header, fetches the writer schema, projects into the consumer's reader schema; incompatible payloads throw a
SerializationExceptionthat routes to a DLQ. For batch: the loader (Airflow / dbt / Fivetran) checks the incoming schema against the contract before writing; incompatible files land in a quarantine bucket.
The CI failure classifier.
- Level 1 — YAML shape. File doesn't parse or missing required blocks. Immediate lint failure with a pointer to the offending line.
- Level 2 — Semver mismatch. Diff detects a breaking change; version bump is not major. Fail with the three remediation options (bump major / restore / deprecate).
- Level 3 — Registry incompatibility. ODCS says the schema should be BACKWARD-compatible, but the projected Avro schema is not. Fail with the registry's specific error message.
-
Level 4 — dbt contract mismatch. The dbt model contract's
columnsblock does not match the ODCSschemablock. Fail with the diff.
The runtime failure taxonomy.
- Deserialisation error. Consumer receives a message whose writer schema is incompatible with its reader schema. Options: (a) fail-fast — throw and let the consumer die; (b) route to DLQ; (c) skip and log. Production default: DLQ + alert.
- Quality assertion violation. Payload deserialised successfully but a quality assertion (null-rate ceiling, uniqueness, range) is violated. Options: (a) block the write (severity=block); (b) log-and-continue (severity=warn); (c) route to a quarantine table.
- SLA violation. Freshness or availability probe detects a breach; page the producer.
- PII leak. DLP scanner detects unmasked PII in a queryable surface (dashboard, downstream table, log line). Immediate quarantine + security ticket.
The dead-letter queue pattern.
- When. Consumer-side deserialisation error or quality violation with severity=block.
-
Where. A dedicated Kafka topic (
<original-topic>.dlq) or an object-store prefix (s3://…/dlq/<topic>/<date>/). - What. The original payload bytes + the error message + timestamp + correlation id.
- Who. Producer team owns the DLQ; consumer team monitors DLQ growth as a signal.
- How long. DLQ retention typically 14–30 days; long enough for a producer to diagnose and replay, not so long it becomes a data-leak surface.
Common interview probes on enforcement.
- "Where does contract enforcement happen — CI or runtime?" — both; describe the split.
- "What happens when a schema-registry-BACKWARD check fails in CI?" — PR blocked, remediation options.
- "Consumer receives an incompatible payload — what does it do?" — DLQ + alert; don't crash.
- "How do you enforce a quality assertion?" — dbt test (batch) or streaming validator (Flink / Beam); severity + rollout phase.
Worked example — CI catches a schema-breaking rename
Detailed explanation. Reprise the silent-rename story, but this time the CI catches it. Walk through the exact GitHub Actions run, the odcs-lint output, the registry-compat-check, and the required remediation.
-
The proposed change. Producer renames
user_id→customer_idand bumps 1.4.0 → 1.5.0. - The CI pipeline. contract-lint (shape) → contract-diff (semver) → registry-compat-check (BACKWARD).
- The failure. contract-diff exits 2 (BREAKING vs minor bump); registry-compat-check also fails (renamed field breaks BACKWARD).
Question. Show the GitHub Actions workflow, the failing step outputs, and the PR comment bot that summarises remediation.
Input.
| Element | Value |
|---|---|
| Contract file | orders/contract.yaml |
| Baseline version | 1.4.0 |
| Proposed version | 1.5.0 |
| Change | rename user_id → customer_id
|
| Registry | Confluent (BACKWARD) |
Code.
# .github/workflows/contract-enforcement.yml
name: contract-enforcement
on:
pull_request:
paths:
- '**/contract.yaml'
jobs:
lint-and-diff:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install tooling
run: pip install odcs-lint==0.9.0 avro-python3 requests
- name: Shape validation
run: |
for f in $(git diff --name-only origin/main HEAD | grep 'contract\.yaml$'); do
echo "--- Linting $f ---"
odcs-lint "$f"
done
- name: Diff validation (semver)
id: diff
run: |
for f in $(git diff --name-only origin/main HEAD | grep 'contract\.yaml$'); do
odcs-lint diff --base origin/main --path "$f" | tee /tmp/diff-out
done
- name: Registry compatibility check
run: python .github/scripts/registry_compat_check.py
- name: Post PR comment on failure
if: failure()
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const output = fs.readFileSync('/tmp/diff-out', 'utf8');
const body = `## Contract enforcement failed\n\n\`\`\`\n${output}\n\`\`\`\n\n**Remediation:** bump version to 2.0.0 with a deprecation window, restore the removed column, or mark it \`deprecated: true\` for one release cycle.`;
await github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: body
});
--- Linting orders/contract.yaml ---
SHAPE: pass — file conforms to apiVersion v3.0.0
--- Diff against origin/main ---
DIFF:
changes detected:
- schema.remove: column "user_id" → severity: BREAKING
- schema.add: column "customer_id" → severity: MINOR
version bump: 1.4.0 → 1.5.0 (minor)
required bump for BREAKING: major
FAIL: version bump (minor) does not match change severity (BREAKING).
Options:
(a) bump to 2.0.0 with deprecation window
(b) restore column "user_id"
(c) mark "user_id" as deprecated: true for one release cycle
exit code: 2
--- Registry compatibility check ---
subject: orders.public.v1-value
compatibility: BACKWARD
projected Avro schema differs from registry v1:
removed field: user_id
added field: customer_id
FAIL: schema is not BACKWARD-compatible.
To pass BACKWARD, retain user_id (optionally as deprecated).
exit code: 1
Step-by-step explanation.
- The GitHub Actions workflow triggers on any PR that touches a
contract.yamlfile. Three steps run sequentially: shape validation, diff validation, registry compatibility check. Each step exits non-zero on failure. - Shape validation passes — the file is still valid YAML and has all required blocks. The failure is in the diff.
- Diff validation reads the base branch's
contract.yamland the PR's file, computes the schema delta, and classifies each change. The remove is BREAKING (removing a column consumers depend on); the add is MINOR (additive optional column). The maximum severity is BREAKING; the version bump is 1.4.0 → 1.5.0 (minor); mismatch → exit 2. - The registry-compat-check projects the ODCS schema into Avro (using the sync tool from section 3), fetches the current registry schema for
orders.public.v1-value, and runs Confluent's compatibility check locally (or via API). The rename fails BACKWARD because a consumer reading the new schema cannot read messages produced under the old schema (missing field). - The PR comment bot posts the failure output plus a remediation guide. The PR author sees exactly what the fix must be; no guessing, no waiting for a human reviewer.
Output.
| Step | Exit | Comment | Merge? |
|---|---|---|---|
| Shape | 0 | pass | continue |
| Diff | 2 | BREAKING vs minor | blocked |
| Registry | 1 | BACKWARD fails | blocked |
| Overall | fail | PR comment posted | no |
Rule of thumb. The CI gate is the gate producers experience. Ship it early, ship it with a clear PR-comment output, and never bypass it — a bypass on Friday is an incident on Monday. The registry check is the streaming-specific belt-and-braces; the contract-diff is the format-agnostic backstop.
Worked example — dbt model contract enforcement for a warehouse table
Detailed explanation. dbt's contract: enforced: true feature is the warehouse-side counterpart to the registry's runtime enforcement. When a dbt model contract is enforced, dbt (a) verifies the model's SQL yields columns matching the declared columns block, (b) fails compile if a column is missing or has the wrong type, (c) generates a CREATE TABLE with the declared types and constraints, and (d) runs the constraint checks at write time.
-
The dbt model. A
dim_ordersmodel that materialises theorders.publiccontract into the warehouse. - The dbt contract block. Copy-paste of the ODCS schema's columns with type mappings.
- The enforcement points. dbt compile (columns match), dbt build (constraints), dbt test (quality assertions).
Question. Produce the dbt model with a contract that mirrors the ODCS orders.public schema, and show the compile-time failure when a producer PR breaks the contract.
Input.
| Component | Value |
|---|---|
| dbt project | analytics |
| Model | dim_orders |
| Materialisation | table |
| Contract | enforced: true |
| Warehouse | Snowflake |
Code.
# models/marts/dim_orders.yml
version: 2
models:
- name: dim_orders
description: Dimensional table for confirmed orders.
config:
contract:
enforced: true
columns:
- name: order_id
data_type: varchar
constraints:
- type: not_null
- type: primary_key
- name: user_id
data_type: varchar
constraints:
- type: not_null
- name: order_total
data_type: decimal(12,2)
constraints:
- type: not_null
- name: country_code
data_type: varchar(2)
constraints:
- type: not_null
- name: created_at
data_type: timestamp_tz
constraints:
- type: not_null
tests:
- dbt_utils.expression_is_true:
expression: "order_total>=0"
- dbt_utils.unique_combination_of_columns:
combination_of_columns: ['order_id']
-- models/marts/dim_orders.sql
{{ config(materialized='table') }}
SELECT
order_id::VARCHAR AS order_id,
user_id::VARCHAR AS user_id,
order_total::DECIMAL(12,2) AS order_total,
country_code::VARCHAR(2) AS country_code,
created_at::TIMESTAMP_TZ AS created_at
FROM {{ source('raw', 'orders') }}
WHERE status = 'confirmed';
# What happens when a producer PR renames the source column
$ dbt build --select dim_orders
Running with dbt=1.7.10
Found 1 model, 2 tests, ...
Compiling model.analytics.dim_orders
CONTRACT ERROR in model dim_orders:
Column 'user_id' declared in contract is missing from the model.
Model yields: order_id, customer_id, order_total, country_code, created_at
Contract declares: order_id, user_id, order_total, country_code, created_at
Missing: user_id
Unexpected: customer_id
dbt build FAILED.
Step-by-step explanation.
- The dbt model contract mirrors the ODCS schema. The
data_typevalues are Snowflake-specific translations of the ODCS types (STRING→varchar,DECIMAL(12,2)→decimal(12,2),TIMESTAMP→timestamp_tz). A sync tool can auto-generate this from the ODCS YAML. -
contract: enforced: truetells dbt to verify the model's output columns match the declared columns exactly. At compile time, dbt runs the model's SELECT statement in dry-run mode, inspects the output columns, and compares to the contract. - When the source table's
user_idis renamed tocustomer_id, the model's SELECT yieldscustomer_idinstead ofuser_id. dbt's contract check detects the mismatch:user_iddeclared but missing,customer_idpresent but not declared. The build fails. - The failure happens in dbt build, which runs in the analytics team's CI as a required check. The producer PR that caused the rename to propagate through the raw table is blocked at the point where it would break the dim table.
- Fixing forward: either update the source loader to keep populating
user_id(aliasing from the renamed column), or update the dbt model + contract simultaneously in one PR. The contract change goes through the ODCS lint gate before the model change ships, ensuring the two stay in sync.
Output.
| Step | dbt outcome | Producer PR |
|---|---|---|
| dbt parse | pass | — |
| dbt compile | pass | — |
| dbt contract check | FAIL — column mismatch | blocked |
| dbt build | not reached | blocked |
| dbt test | not reached | blocked |
Rule of thumb. For warehouse-materialised contracts, dbt contract: enforced: true is the runtime gate. Sync the ODCS schema into the dbt contract block via a code-generator; keep them 1:1. The dbt gate is what stops a producer-side rename from silently corrupting the dim table.
Worked example — consumer-side runtime validator with DLQ routing
Detailed explanation. A streaming consumer that reads Avro-encoded messages needs to defend against payloads that don't conform to its reader schema. Show a Kafka consumer that (a) reads the schema-id, (b) fetches the writer schema from the registry, (c) attempts deserialisation, (d) routes failures to a DLQ topic with the original payload + error message + trace metadata.
- The failure mode. Producer accidentally publishes a message whose schema id points to a subject the registry no longer serves (deleted subject, disconnected cluster).
- The graceful handling. DLQ the message; alert the consumer team; keep the consumer running.
- The reckless handling. Throw uncaught; consumer dies; lag grows; on-call woken up.
Question. Implement a Python Kafka consumer that reads Avro-encoded messages from orders.public.v1, validates against the contract's reader schema, and routes deserialisation failures to orders.public.v1.dlq. Include the metadata payload.
Input.
| Component | Value |
|---|---|
| Source topic | orders.public.v1 |
| DLQ topic | orders.public.v1.dlq |
| Registry | http://registry:8081 |
| Consumer group | orders-consumer |
Code.
import json
import logging
import time
from confluent_kafka import Consumer, Producer, KafkaError
from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.avro import AvroDeserializer
from confluent_kafka.serialization import SerializationError, MessageField, SerializationContext
log = logging.getLogger("consumer")
registry = SchemaRegistryClient({"url": "http://registry:8081"})
# The reader schema the consumer expects (from the ODCS contract, projected via sync tool)
READER_SCHEMA = open("orders_reader.avsc").read()
avro_deser = AvroDeserializer(registry, READER_SCHEMA)
consumer = Consumer({
"bootstrap.servers": "kafka:9092",
"group.id": "orders-consumer",
"auto.offset.reset": "earliest",
"enable.auto.commit": False, # commit only on successful process
})
consumer.subscribe(["orders.public.v1"])
dlq = Producer({"bootstrap.servers": "kafka:9092"})
def route_to_dlq(msg, error: Exception):
envelope = {
"original_payload_hex": msg.value().hex() if msg.value() else None,
"original_key": msg.key().decode("utf-8") if msg.key() else None,
"topic": msg.topic(),
"partition": msg.partition(),
"offset": msg.offset(),
"timestamp_ms": msg.timestamp()[1],
"error_type": type(error).__name__,
"error_message": str(error),
"consumer_group": "orders-consumer",
"trace_id": _extract_trace_id(msg),
}
dlq.produce(topic="orders.public.v1.dlq",
key=msg.key(),
value=json.dumps(envelope).encode("utf-8"))
dlq.flush()
log.warning("routed message offset=%s to DLQ: %s", msg.offset(), error)
def _extract_trace_id(msg):
hdrs = dict(msg.headers() or [])
return hdrs.get(b"trace_id", b"").decode("utf-8") if hdrs.get(b"trace_id") else None
def process(order):
# Real business logic; deliberately simple in the example
log.info("processed order_id=%s total=%s", order["order_id"], order["order_total"])
while True:
msg = consumer.poll(1.0)
if msg is None:
continue
if msg.error():
log.error("consumer error: %s", msg.error())
continue
try:
ctx = SerializationContext(msg.topic(), MessageField.VALUE)
order = avro_deser(msg.value(), ctx)
process(order)
consumer.commit(msg)
except SerializationError as e:
route_to_dlq(msg, e)
consumer.commit(msg) # commit past the poison pill so we don't loop
except Exception as e:
log.exception("unexpected error, routing to DLQ: %s", e)
route_to_dlq(msg, e)
consumer.commit(msg)
# DLQ envelope schema — the shape landed in orders.public.v1.dlq
type: object
properties:
original_payload_hex: {type: string}
original_key: {type: string}
topic: {type: string}
partition: {type: integer}
offset: {type: integer}
timestamp_ms: {type: integer}
error_type: {type: string}
error_message: {type: string}
consumer_group: {type: string}
trace_id: {type: [string, "null"]}
Step-by-step explanation.
- The consumer reader schema is the projection of the ODCS contract into Avro. The sync tool from section 3 generates it; the consumer reads it as a static file at startup.
- On each poll, the deserialiser reads the 4-byte schema-id header, fetches the writer schema from the registry (cached), and projects the payload into the reader schema. Compatible payloads deserialise normally; incompatible ones throw
SerializationError. - On a
SerializationError, the DLQ router builds an envelope containing the original bytes (as hex for JSON safety), the topic/partition/offset for replay reconstruction, the timestamp, the error class and message, and any trace-id from message headers. The envelope is JSON to keep the DLQ debuggable. - Publishing to the DLQ topic uses a synchronous flush to guarantee the DLQ write lands before the consumer commits the source offset. Without the flush, a consumer crash could lose the DLQ message.
-
consumer.commit(msg)after DLQ routing advances the source offset past the poison pill. Without this, the consumer would loop forever on the same bad message. The trade-off: the message is technically "processed" from the consumer's perspective; the DLQ is the durable record. - Alerting is external — a small side-car scrapes the DLQ topic and emits Prometheus metrics; on-call gets paged when DLQ growth exceeds a threshold.
Output.
| Message | Deserialise result | Action |
|---|---|---|
| Compatible v1 or v2 payload | success | processed + committed |
| Payload with unknown schema-id | SerializationError | DLQ + committed |
| Payload with corrupt bytes | SerializationError | DLQ + committed |
| Legitimate write, valid schema | success | processed + committed |
Rule of thumb. Every streaming consumer needs a DLQ. Not sometimes — every one. The DLQ is the shock absorber that keeps a single poison pill from taking down the consumer; it's also the audit trail that lets producer teams diagnose their own bugs. Ship the DLQ + envelope + alert wiring on day one; retrofitting is 10× the work.
Senior interview question on contract enforcement
A senior interviewer might ask: "Design the full enforcement stack for a batch-plus-streaming dataset governed by an ODCS contract. Walk me through what CI runs, what runtime does, where the DLQ lives, how a producer PR gets from open-to-merged, and what happens when a rogue payload hits the runtime gate."
Solution Using a four-check CI + a two-gate runtime + envelope-shaped DLQ
Full enforcement stack — batch + streaming dataset
==================================================
CI (pre-merge, GitHub Actions)
Check 1 odcs-lint <path> # YAML shape
Check 2 odcs-lint diff --base main --path <path> # semver
Check 3 sync_odcs_to_registry --dry-run <path> # registry compat
Check 4 dbt parse + dbt --contracts compile # dbt contract match
Runtime — streaming path
Gate A producer-side: KafkaAvroSerializer talks to registry;
if the current schema-id is not registered, PRODUCE fails
Gate B consumer-side: AvroDeserializer + reader schema;
on SerializationError → DLQ envelope → alert
Runtime — batch path
Gate C loader (Airflow/dbt) reads schema of incoming file/table;
runs `dbt contract check` against ODCS-derived contract;
fail → quarantine bucket; pass → load
DLQ envelope (both streaming and batch)
{original_bytes, error_class, error_msg, topic/table, offset/row_id,
timestamp, trace_id, consumer_group}
Alerting
metric dlq_growth_rate_5m > 0 → PagerDuty producer team
metric contract_lint_fail_rate_daily > 0 → Slack platform-data
metric freshness_probe_breach > 0 → PagerDuty producer team
# .github/workflows/enforce.yml — the full CI wiring
name: enforce
on:
pull_request:
paths:
- '**/contract.yaml'
- 'models/**'
jobs:
ci:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with: {fetch-depth: 0}
- uses: actions/setup-python@v5
with: {python-version: '3.11'}
- run: pip install odcs-lint==0.9.0 dbt-core==1.7.10 dbt-snowflake==1.7.4
- name: 1. odcs-lint (shape)
run: odcs-lint orders/contract.yaml
- name: 2. odcs-lint (diff / semver)
run: odcs-lint diff --base origin/main --path orders/contract.yaml
- name: 3. registry compat (dry-run)
env:
REGISTRY_URL: ${{ secrets.REGISTRY_URL }}
run: python .github/scripts/registry_compat_check.py --dry-run
- name: 4. dbt contracts
run: |
dbt deps
dbt parse
dbt compile --select models/marts/dim_orders.sql
Step-by-step trace.
| Stage | Check | What it catches |
|---|---|---|
| CI check 1 | shape lint | YAML typos, missing blocks |
| CI check 2 | semver diff | breaking change without major bump |
| CI check 3 | registry dry-run | Avro/Proto incompatibility |
| CI check 4 | dbt contract | warehouse schema mismatch |
| Runtime gate A | producer serializer | writing with unregistered schema |
| Runtime gate B | consumer deserializer | receiving incompatible payload |
| Runtime gate C | batch loader | file/table with wrong schema |
| DLQ | envelope | preserves poison pill + metadata |
| Alert 1 | DLQ growth | consumer sees bad payloads |
| Alert 2 | Freshness probe | SLA breach |
The stack has three layers: intent (CI), reality (runtime), and observability (DLQ + alerts). Every failure mode has a home: CI catches producer intent errors; runtime catches producer or infrastructure reality errors; DLQ preserves everything with enough metadata to replay; alerts wake the right team.
Output:
| Failure mode | Where caught | Blast radius |
|---|---|---|
| Producer PR renames column | CI check 2 | 0 rows corrupted |
| Producer PR breaks BACKWARD | CI check 3 | 0 rows corrupted |
| Producer PR breaks dbt contract | CI check 4 | 0 rows corrupted |
| Producer writes with unregistered schema | Runtime gate A | 0 rows landed |
| Consumer receives incompatible payload | Runtime gate B → DLQ | 1 message quarantined |
| Batch loader receives bad file | Runtime gate C → quarantine bucket | 1 file quarantined |
| Freshness SLA breach | probe → PagerDuty | consumers still on stale data but aware |
Why this works — concept by concept:
- Four CI checks, not one — each check owns a distinct failure mode. Skipping any one leaves a class of bugs uncaught. The four-check pattern is what makes the CI gate credible.
- Two runtime gates — producer-side (Gate A) and consumer-side (Gate B) form a belt-and-braces. Gate A relies on the registry being reachable; Gate B protects against Gate A being bypassed or misconfigured.
- DLQ envelope structure — the envelope stores enough metadata (original bytes, offset, trace id, error) to replay after fixing the producer. A DLQ without full metadata is a lossy drop; with metadata it's a replayable audit log.
- Alerts tied to blast radius — DLQ growth pages the producer immediately (their bug); freshness breach pages the producer (their SLA); lint failure just posts to Slack (developer-loop, not production).
- Cost — the CI adds ~30 seconds to a producer PR; runtime gates add ~1 ms per message (registry lookup is cached); the DLQ costs ~1% of source topic storage. The avoided cost is measured in hours of incident time — one prevented Monday-morning war room pays for years of enforcement infrastructure.
ETL
Topic — etl
ETL problems on contract testing and DLQ patterns
5. SLAs + ownership + rollout
Freshness, availability, quality — the three SLA dials plus a phased rollout ladder from warn to block
The mental model in one line: the SLA block of an ODCS contract has three dials — freshness (max_freshness: PT15M), availability (availability_pct: 99.9), quality (per-assertion severity + rollout phase) — and a healthy rollout moves each assertion up a three-rung ladder (warn-only → block-on-new → block-everywhere) instead of shipping in enforcement mode from day one. SLA teeth without patience break more than they fix; SLA teeth with patience are how large organisations adopt contracts without whiplash.
The three SLA dials in detail.
-
Freshness.
max_freshness: PT15M— the maximum age of the newest row when the consumer reads. Enforced by a scheduled probe that comparesnow() - MAX(created_at)(or similar) against the declared threshold. Producers own the alert. -
Availability.
availability_pct: 99.9— the fraction of the SLA window (typically monthly) during which the dataset is queryable. Enforced by uptime probes (batch: dataset present and non-empty; streaming: topic reachable and consumer lag bounded). -
Quality. per-assertion —
null_rate(x) < 0.001,unique(id),x IN (allowed set). Enforced by dbt tests (batch) or streaming validators (Flink / Beam / consumer-side runtime). Each assertion has a severity (warn / block) and a rollout phase.
The rollout ladder — warn → block-new → block-all.
- Rung 1 — warn-only. The assertion is evaluated; violations are logged and dashboarded; nothing is blocked. Purpose: measure the current state without breaking anything. Duration: 2–4 weeks typically.
- Rung 2 — block-on-new. New rows that violate the assertion are rejected (streaming) or quarantined (batch); existing rows are grandfathered. Purpose: stop the bleeding without invalidating historical data. Duration: 2–4 weeks.
- Rung 3 — block-everywhere. The assertion is treated as a hard constraint; any read returning violating rows fails the query. Purpose: full enforcement. Existing violations must be cleaned up before promoting to this rung.
The ownership matrix.
- Producer. Signs the contract; commits to the SLAs and quality assertions; owns the pager when SLA probes breach.
- Consumers. Subscribe to the contract; agree on the version they read; represent their needs in negotiation (asking for tighter SLAs, more assertions).
- Steward (platform / data-gov team). Owns the enforcement infrastructure (lint, sync, probes, DLQ); mediates producer-consumer disputes; maintains the ODCS shape and the tooling.
- On-call rotation. The specific PagerDuty (or equivalent) schedule that receives the pages. Usually the producer team, but the contract can specify a delegate for off-hours.
The deprecation contract.
- When. Any major-version bump that removes or breaks a field must ship a deprecation-window contract first.
- How long. Typically 30–90 days depending on the consumer count and their release velocity. High-consumer datasets get longer windows; low-consumer datasets can shorten.
-
What. The intermediate contract marks the affected field with
deprecated: trueand adds a warn-only quality assertion that flags any consumer still reading the field. - Who. The producer files the deprecation; the platform team communicates to consumers; the consumers acknowledge and migrate.
Common interview probes on SLAs + rollout.
- "How do you introduce a new quality assertion to a dataset with existing violations?" — warn-only → block-on-new → block-all, over 6–12 weeks.
- "Freshness SLA — who owns the pager?" — the producer team, always.
- "How do you handle a contract change with 20 downstream consumers?" — deprecation window; 60–90 days; explicit consumer acknowledgement.
- "What's the ODCS field for retention?" —
sla.retention: P90D.
Worked example — introducing a new quality assertion to a dirty dataset
Detailed explanation. The team wants to enforce null_rate(user_id) == 0 on the orders.public dataset. Current reality: 0.3% of historical rows have a null user_id (an old bug). Shipping this in block mode immediately would fail every query. Walk through the three-rung rollout that lands the assertion in block mode without any downstream breakage.
-
Week 0. Add assertion as
severity: block, rollout: warn_only. Every consumer sees the assertion in the contract but nothing blocks. -
Week 4. Producer team ships a fix that populates
user_idfor new rows. -
Week 4. Promote assertion to
rollout: block_on_new. Only rows written after this rung are subject to enforcement. - Week 6. Backfill script fixes historical null rows.
-
Week 8. Promote assertion to
rollout: block_all. Full enforcement.
Question. Show the three contract versions (weeks 0, 4, 8), the CI check that verifies the rollout, and the streaming validator that enforces block-on-new.
Input.
| Rung | Duration | Producer action | Consumer action |
|---|---|---|---|
| warn_only | weeks 0-4 | fix producer code | none |
| block_on_new | weeks 4-8 | backfill historical | none |
| block_all | week 8+ | ongoing enforcement | none |
Code.
# Week 0 — assertion in warn-only
version: 1.5.0
quality:
- assertion: null_rate(user_id) == 0
severity: block
rollout: warn_only # rollout phase; not yet blocking
since: "2026-07-04T00:00:00Z"
# Week 4 — assertion promoted to block-on-new
version: 1.6.0
quality:
- assertion: null_rate(user_id) == 0
severity: block
rollout: block_on_new
since: "2026-07-04T00:00:00Z"
block_cutoff: "2026-08-01T00:00:00Z" # rows before this stay grandfathered
# Week 8 — assertion promoted to block-all
version: 1.7.0
quality:
- assertion: null_rate(user_id) == 0
severity: block
rollout: block_all
since: "2026-07-04T00:00:00Z"
# streaming_validator.py — enforces block_on_new using the block_cutoff
import yaml
from datetime import datetime
from confluent_kafka import Consumer, Producer
contract = yaml.safe_load(open("orders/contract.yaml"))
assertions = contract["quality"]
def evaluate(record):
"""Run each block-rated assertion; return list of violations."""
violations = []
for a in assertions:
if a["severity"] != "block":
continue
if a["rollout"] == "warn_only":
continue # measured but not enforced
if a["rollout"] == "block_on_new":
cutoff = datetime.fromisoformat(a["block_cutoff"].rstrip("Z"))
record_ts = record["created_at"]
if record_ts < cutoff:
continue # grandfathered
if not _check_expr(a["assertion"], record):
violations.append(a["assertion"])
return violations
def _check_expr(expr: str, record):
# Simplified — real impl parses the assertion DSL
if "null_rate(user_id) == 0" in expr:
return record.get("user_id") is not None
return True
# In the consumer/validator loop:
# violations = evaluate(record)
# if violations: route_to_dlq(record, violations)
# else: process(record)
Step-by-step explanation.
- Week 0 ships the assertion in
rollout: warn_only. The validator evaluates the assertion, logs violations, and updates a Prometheus counter, but does not block. The producer team sees anull_rate(user_id) = 0.003gauge; they diagnose and fix the producer code. - Week 4 the producer fix is deployed. New rows have
user_idpopulated. The team promotes the assertion torollout: block_on_newwithblock_cutoff: 2026-08-01. From this timestamp forward, any new row withuser_id = nullfails the assertion and routes to DLQ. Historical rows (before the cutoff) stay in the dataset unmolested. - Week 6 a backfill script fixes the 0.3% of historical rows. The team runs
SELECT COUNT(*) FROM orders WHERE user_id IS NULLto confirm zero. The backfill is one SQL migration + a dbt macro; producer team owns it. - Week 8 the team promotes to
rollout: block_all. Theblock_cutofffield is removed. Any row that returnsuser_id = nullin query results (there should be zero now) fails the assertion. Consumers can rely on the guarantee. - Each promotion is a version bump with a lint-checked contract. The rollout ladder is the semver-safe path:
warn_only → block_on_new → block_allis always a minor bump per rung; producers can move quickly without breaking the contract semantics.
Output.
| Week | Contract version | Rollout phase | Violations blocked? | Producer status |
|---|---|---|---|---|
| 0 | 1.5.0 | warn_only | no | fixing code |
| 4 | 1.6.0 | block_on_new | new rows only | code fixed |
| 6 | 1.6.0 | block_on_new | new rows only | backfilling |
| 8 | 1.7.0 | block_all | all rows | steady state |
Rule of thumb. Never introduce a quality assertion in block mode against a dirty dataset. The three-rung ladder is the paved path — warn-only to discover the mess, block-on-new to stop the bleeding, block-all after cleanup. Shortcuts here are the number-one cause of "we tried contracts and they broke everything."
Worked example — a freshness SLA with a per-partition twist
Detailed explanation. A dataset ingested from multiple sources needs a per-partition freshness SLA — the global MAX(created_at) is misleading if one active partition is fresh but a low-volume partition is 8 hours stale. Show a partition-aware freshness probe with per-partition alerting.
- The pitfall. Global freshness looks fine; per-partition freshness reveals the failure.
-
The fix. Probe reads
MAX(created_at)per partition; alerts on any partition older than SLA. -
The contract. Same
max_freshness: PT15Mbut the probe evaluates per-partition.
Question. Design the probe query and the alerting logic that catches a stale partition without paging on healthy ones.
Input.
| Element | Value |
|---|---|
| Partitioning column | source_id |
| SLA | 15 minutes per partition |
| Sources | 8 sources with wildly different volumes |
| Alert dedup window | 10 minutes |
Code.
# Contract addition — per-partition freshness
sla:
max_freshness: PT15M
freshness_scope:
granularity: per_partition
partition_column: source_id
-- Per-partition freshness probe
WITH per_partition AS (
SELECT
source_id,
MAX(created_at) AS latest_row,
EXTRACT(EPOCH FROM (NOW() - MAX(created_at))) AS staleness_seconds,
COUNT(*) FILTER (WHERE created_at > NOW() - INTERVAL '1 hour') AS rows_last_1h
FROM warehouse.orders
GROUP BY source_id
)
SELECT
source_id,
latest_row,
staleness_seconds,
rows_last_1h,
CASE
WHEN staleness_seconds > 900 AND rows_last_1h = 0
THEN 'stale_and_idle' -- likely broken source
WHEN staleness_seconds > 900 AND rows_last_1h > 0
THEN 'stale_and_active' -- lag; may recover
ELSE 'ok'
END AS status
FROM per_partition
ORDER BY staleness_seconds DESC;
# probe_scheduler.py — reads contract, runs probe, alerts per partition
import yaml, psycopg2, requests, time
from isodate import parse_duration
contract = yaml.safe_load(open("orders/contract.yaml"))
sla_s = parse_duration(contract["sla"]["max_freshness"]).total_seconds()
partition_col = contract["sla"]["freshness_scope"]["partition_column"]
conn = psycopg2.connect("postgres://…")
cur = conn.cursor()
def probe():
cur.execute(f"""
SELECT {partition_col},
EXTRACT(EPOCH FROM (NOW() - MAX(created_at)))
FROM warehouse.orders
GROUP BY {partition_col};
""")
return cur.fetchall()
# Track consecutive breaches per partition
consecutive = {}
while True:
for partition, staleness in probe():
if staleness > sla_s:
consecutive[partition] = consecutive.get(partition, 0) + 1
if consecutive[partition] >= 3:
requests.post("https://events.pagerduty.com/v2/enqueue", json={
"routing_key": contract["roles"]["on_call_rotation"],
"event_action": "trigger",
"dedup_key": f"orders-freshness-{partition}",
"payload": {
"summary": f"orders partition {partition} freshness > SLA "
f"({staleness:.0f}s > {sla_s:.0f}s)",
"severity": "error",
"source": "contract-probe",
},
})
else:
consecutive[partition] = 0
time.sleep(300)
Step-by-step explanation.
- The contract now declares
freshness_scope.granularity: per_partitionwithpartition_column: source_id. The probe reads this and pivots from a single global freshness measurement to a group-by-partition query. - The SQL probe returns one row per partition with
staleness_seconds. The addedstatuscolumn disambiguates two failure modes:stale_and_idle(no new rows in the last hour — probably a broken source) versusstale_and_active(some rows arriving, but the latest is old — probably a queue lag). - The Python scheduler tracks consecutive breaches per partition to avoid paging on a single-scrape flake. Three consecutive 5-minute breaches = 15 minutes of sustained lag = page.
-
dedup_key: orders-freshness-<partition>scopes deduplication per partition. If two partitions are stale at once, two independent incidents fire; each auto-resolves when its partition recovers. - When a partition recovers (staleness drops back below SLA),
consecutive[partition]resets to zero and the next breach starts a new counter. The alert path is one-shot per breach cycle; the on-call sees exactly one incident per genuine stale period per partition.
Output.
| Partition | Staleness | rows_last_1h | Status | Alert? |
|---|---|---|---|---|
| src_a | 40 s | 12000 | ok | no |
| src_b | 950 s | 20 | stale_and_active | if sustained 15 min |
| src_c | 30000 s | 0 | stale_and_idle | immediate (adjust threshold) |
| src_d | 60 s | 1200 | ok | no |
Rule of thumb. Global freshness SLAs hide per-partition failures. If your dataset has any partitioning that matters, encode it in the SLA scope and probe accordingly. The extra alerting fidelity is what catches a broken source-of-the-month before the executive dashboard notices.
Worked example — the deprecation announcement flow
Detailed explanation. A producer needs to deprecate the country_code column and rename to country_alpha2. Twenty downstream consumers depend on the field. Walk through the deprecation announcement flow — the contract diff, the automated consumer notification, the acknowledgement tracking, and the cutover.
-
The change. Rename
country_code→country_alpha2; deprecation window 60 days. -
The announcement. Automated Slack + email + PR comment to every consumer team listed in
roles.consumers. - The tracking. A dashboard shows per-consumer acknowledgement status.
-
The cutover. After 60 days, the 2.0.0 contract removes
country_code.
Question. Design the announcement + tracking flow. Show the CODEOWNERS + Slack + tracking-dashboard integration.
Input.
| Element | Value |
|---|---|
| Contract | orders.public 1.7.0 (before) → 1.8.0 (deprecation) → 2.0.0 (cutover) |
| Consumers | 20 teams |
| Deprecation window | 60 days |
| Notification channels | Slack, email, PR comment |
Code.
# 1.8.0 deprecation-window contract
version: 1.8.0
info:
status: active
changelog:
- version: 1.8.0
date: "2026-07-04"
change: "Deprecatecountry_code;introducecountry_alpha2ascanonical."
cutover_at: "2026-09-02T00:00:00Z"
breaking_in: 2.0.0
schema:
- name: country_code
type: STRING
deprecated: true
description: "DEPRECATED.Usecountry_alpha2.Removedin2.0.0on2026-09-02."
- name: country_alpha2
type: STRING
required: false
description: "CanonicalISO-3166alpha-2name."
# announce_deprecation.py — triggered by the CI on merge of the 1.8.0 contract
import yaml, requests, os
contract = yaml.safe_load(open("orders/contract.yaml"))
consumers = contract["roles"]["consumers"]
changelog = contract["info"]["changelog"][-1]
msg = (f":warning: *Contract deprecation* :warning:\n"
f"Dataset: `{contract['id']}` v{contract['version']}\n"
f"Change: {changelog['change']}\n"
f"Cutover: {changelog['cutover_at']} (v{changelog['breaking_in']})\n"
f"Please migrate before cutover. Acknowledge via `/contract ack {contract['id']}`.")
for team in consumers:
requests.post(os.environ["SLACK_WEBHOOK_URL"], json={
"channel": f"#{team}",
"text": msg,
})
-- Tracking dashboard — per-consumer acknowledgement status
SELECT
consumer_team,
acked_at,
CASE WHEN acked_at IS NULL THEN 'PENDING' ELSE 'ACKED' END AS status
FROM contract_deprecations d
LEFT JOIN contract_acks a
ON d.contract_id = a.contract_id
AND a.consumer_team = ANY (d.consumers)
WHERE d.contract_id = 'orders.public'
AND d.deprecation_version = '1.8.0'
ORDER BY acked_at NULLS FIRST;
Step-by-step explanation.
- The 1.8.0 contract's
info.changelogrecords the deprecation intent, the cutover date, and the target major version. Thecountry_codecolumn is markeddeprecated: truein the schema block;country_alpha2is added as optional. - On merge of the 1.8.0 PR, a CI job runs
announce_deprecation.py. It reads theconsumerslist, formats a Slack message with the changelog, and posts to each consumer team's channel. Email and per-team PR comments follow the same template. - The tracking dashboard reads a
contract_deprecationstable (populated by the CI job) and acontract_ackstable (populated by a/contract ackSlack command). The join produces per-consumer status: PENDING or ACKED. - As consumers migrate, they run
/contract ack orders.publicin Slack; the command writes tocontract_acks. The dashboard updates in real time. The steward reviews it weekly; teams that don't ack after 30 days get escalated to their manager. - On cutover day (60 days after the deprecation contract merged), the producer merges the 2.0.0 contract that removes
country_code. The CI checks: (a) 2.0.0 major bump matches the removal (odcs-lint OK); (b) every consumer inroles.consumershas anACKEDstatus in the tracking table; if any is PENDING, the CI blocks with a "cannot cut over — pending consumers" error.
Output.
| Day | Contract | Consumer acked | Cutover blocked? |
|---|---|---|---|
| 0 | 1.8.0 | 0/20 | — |
| 15 | 1.8.0 | 8/20 | not yet |
| 45 | 1.8.0 | 18/20 | 2 pending → escalate |
| 60 | 2.0.0 PR | 20/20 | no; cutover proceeds |
Rule of thumb. Deprecation windows without tracked acknowledgements are wishful thinking. Wire the /contract ack command, dashboard, and CI gate on day one — the producer never has to guess whether consumers are ready; the CI answers definitively.
Senior interview question on SLAs, ownership, and rollout
A senior interviewer might ask: "You're introducing data contracts to an org with 300 datasets, 40 teams, and a history of silent schema breaks. Walk me through the first-quarter rollout plan — how you'd sequence the SLAs, the enforcement, the consumer acknowledgements, and the on-call ownership handoff."
Solution Using the four-phase rollout + explicit owner handoff + SLA measurement lag
Quarter-1 rollout plan — data contracts across 300 datasets
============================================================
Weeks 0-2 — Foundation
- Ship odcs-lint, sync tool, registry-compat-check as GitHub Actions
- Ship dbt contract enforcement in warehouse-side CI
- Deploy freshness probe framework
- Stand up ownership dashboard
Weeks 2-6 — Contract generation (all in warn-only)
- Auto-generate ODCS contracts from existing dbt models + Avro schemas
- Each contract begins in status: draft, all assertions warn_only
- Producer teams review + edit + status: active
- Consumer teams subscribe by adding themselves to roles.consumers
Weeks 4-8 — SLA measurement (still warn-only)
- Every contract with an SLA gets a probe
- Freshness / availability / quality metrics gathered
- Producer teams see their "actual vs promised" gap
- Some teams negotiate looser SLAs to match reality
Weeks 6-10 — Rollout ladder
- Contract quality assertions promote warn_only → block_on_new
- Producer teams fix or backfill dirty data
- Consumer teams unaffected (they're already reading grandfathered data)
Weeks 10-13 — Full enforcement
- Assertions promote block_on_new → block_all where data is clean
- Remaining dirty datasets stay block_on_new with a scheduled cleanup
- Ownership handoff: on-call rotation lives in producer team calendar
Ongoing
- Every new dataset must ship with a contract (block_on_new for schema; warn_only for quality)
- Deprecation window minimum: 30 days for internal; 60 days for cross-team; 90 days for external
- Quarterly SLA review: producer + consumer sit down; re-negotiate if needed
Step-by-step trace.
| Week | Milestone | Datasets in phase |
|---|---|---|
| 2 | Tooling live | 0 contracts, 300 datasets uncontracted |
| 6 | Contracts drafted | 300 in draft, warn_only |
| 8 | SLAs measured | 300 with measured metrics |
| 10 | Block-on-new active | 200 promoted, 100 still measuring |
| 13 | Block-all active | 150 fully enforced, 150 block_on_new |
| 13 (Q1 end) | Ownership dashboard | Every dataset has a producer + on-call |
The plan sequences discovery before enforcement, measurement before promises, and paved paths before hard rules. Quarter 1 ends with every dataset contracted; enforcement rolls forward through quarters 2 and 3 as data cleanup completes.
Output:
| Quarter-1 milestone | Result |
|---|---|
| Contracts drafted | 300/300 |
| SLAs measured | 300/300 |
| Producer ownership assigned | 300/300 |
| Consumers acknowledged | 300/300 |
| Assertions in block-all | 150/300 |
| Assertions in block-on-new | 150/300 |
| Incident rate (silent schema breaks) | ~75% lower vs pre-rollout |
Why this works — concept by concept:
- Sequence discovery before enforcement — measuring the current state (warn-only) before promising anything (block) is what makes SLAs credible. Teams cannot commit to a freshness they never measured.
- Auto-generation seeds contracts — hand-writing 300 ODCS files is a non-starter; auto-generation from existing dbt models + Avro schemas produces first-draft contracts in hours, not weeks. Producers edit and sign; the marginal cost per contract drops from days to minutes.
- Rollout ladder is dataset-scoped — each dataset moves through the ladder at its own pace based on data cleanliness. The org's rollout is the median dataset's rollout; slow ones stay in block_on_new longer without holding back the fast ones.
- Ownership dashboard as the anchor — the dashboard tracks producer + on-call + consumer acknowledgement. It's the single artefact leadership reviews weekly; a dataset without an owner is immediately visible and immediately actionable.
- Cost — quarter of platform-engineer time + roughly one producer-engineer day per contract. The avoided cost is the incident lift: pre-rollout the org averaged 12 silent-break incidents per quarter at ~8 hours each (96 hours); post-rollout the number drops to ~3 at the same 8 hours (24 hours). The saved 72 hours per quarter dwarfs the setup cost by month three.
ETL
Topic — etl
ETL problems on SLA design and rollout patterns
Optimization
Topic — optimization
Optimization problems on contract ownership and enforcement
Cheat sheet — data contract recipes
- The four axes. Schema (columns + types + PII + unique), SLA (freshness + availability + quality + retention), quality (assertions + severity + rollout), ownership (producer + consumers + steward + on-call). A contract missing any axis is a schema-registry entry, not a data contract.
-
The 20-line ODCS template.
apiVersion: v3.0.0,kind: DataContract,id: <domain>.<name>,version: <semver>,info: {title, owner, status},schema: [{name, type, required, unique, pii, description, deprecated}],sla: {max_freshness, availability_pct, retention},quality: [{assertion, severity, rollout}],roles: {producer, consumers, steward, on_call_rotation},servers: [{type, dsn}]. Every contract fits on one screen. - Semver rules. Additive optional column = minor; additive required column = major; drop column = major; type widen = minor; type narrow = major; tighten SLA = major; loosen SLA = minor; description-only = patch. When in doubt, bump major and offer a deprecation window.
- Registry compatibility modes. BACKWARD (consumer upgrades first; producer can add optional fields) is the streaming default; FORWARD (producer upgrades first; consumer keeps old schema); FULL (both directions); TRANSITIVE variants require compat across all previous versions. Use BACKWARD_TRANSITIVE for high-longevity subjects.
-
Avro BACKWARD-safe changes. Add optional field with default; widen type (
int → long,float → double); rename viaaliases. Never add required field, never remove field, never narrow type — the registry will reject. -
Protobuf FULL-safe changes. Add optional field with new field number; add new enum values (readers preserve unknowns); rename via option
deprecated = truewhile keeping the number. Never reuse a field number; never remove a required field. - Subject-naming strategies. TopicNameStrategy for single-event topics (default; ODCS 1:1); TopicRecordNameStrategy for multi-event topics; RecordNameStrategy for shared record types across topics. Pick per-topic; never mix within a topic.
-
CI enforcement stack. (1)
odcs-lintshape, (2)odcs-lint diffsemver, (3)sync_odcs_to_registry --dry-run, (4)dbt parse + dbt compile --contracts. Four checks as required status checks on every producer PR. - Runtime enforcement stack. Producer-side: KafkaAvroSerializer talks to registry, unknown schema-id fails PRODUCE. Consumer-side: AvroDeserializer + reader schema, SerializationError → DLQ envelope. Batch loader: schema check against contract before write; failure → quarantine bucket.
-
DLQ envelope.
{original_payload_hex, original_key, topic, partition, offset, timestamp_ms, error_type, error_message, consumer_group, trace_id}. Enough metadata to replay after fix; JSON so it stays debuggable. - Rollout ladder for a quality assertion. Week 0-4 warn_only (measure current state); week 4-8 block_on_new (stop the bleeding, grandfather old rows); week 8+ block_all (full enforcement after cleanup). Never ship in block_all against a dirty dataset.
-
Deprecation window. Internal 30 days; cross-team 60 days; external 90 days. Intermediate contract marks the field
deprecated: true; announcement bot pings every consumer;/contract ackcommand tracks acknowledgement; cutover CI blocks if any consumer is PENDING. -
Freshness probe.
now() - MAX(created_at) < parse_duration(sla.max_freshness). Runs every 5 minutes; alerts after 3 consecutive breaches (≥15 min sustained). Per-partition scope if the dataset partitions matter. -
Ownership binding.
roles.producer→ CODEOWNERS entry on the contract file → required PR review. Addon_call_rotationfor the pager destination. Audit trail = git log. Producer cannot claim ignorance. -
dbt contract.
contract: enforced: trueon the model config;columnsblock mirrors ODCSschemablock with warehouse-specific type names. Sync tool auto-generates. Dbt compile fails on column mismatch; dbt build fails on constraint violation. -
Change-signal channels. Slack for consumers listed in
roles.consumers; PR comment auto-posted with changelog; email for offline audiences; DataHub / Backstage for catalogue subscribers. Never expect consumers to poll git. - Contract quarterly review. Producer + consumers + steward meet quarterly; review SLA hits/misses; renegotiate loosen/tighten; retire deprecated fields; add new consumers. Cadence prevents drift.
Frequently asked questions
What is a data contract and how is it different from a schema?
A data contract is a machine-readable, version-controlled, producer-signed agreement that declares the schema, SLA, quality assertions, and ownership of a dataset — all four axes together. A schema (Avro, Protobuf, JSON Schema, dbt model columns) is just the shape — the column names, types, and nullability. A contract wraps the schema with promises about timing (freshness, availability), content (null-rate ceilings, uniqueness, range constraints), and humans (producer team, consumer teams, steward, on-call rotation). The Open Data Contract Standard (ODCS) v3.x is the vendor-neutral YAML shape most 2026 platforms have converged on; Confluent Schema Registry and dbt model contracts are downstream materialisations that a sync tool derives from the ODCS source. In one line: the schema is what shape the data has; the contract is what promises the producer commits to keep.
What is the Open Data Contract Standard (ODCS)?
ODCS is a Linux Foundation / Bitol-hosted, vendor-neutral YAML specification for data contracts, currently at v3.x. The top-level blocks are info (metadata), schema (columns with types + PII + constraints), sla (freshness / availability / retention / query latency), quality (assertions with severity and rollout phase), roles (producer + consumers + steward + on-call), and servers (physical materialisations — Snowflake table, Kafka topic, S3 prefix). Version is semver: patch for description-only, minor for additive optional, major for breaking. The file is checked into the producer's repo, linted in CI (odcs-lint), and synced into downstream tooling (registry, dbt contracts, Great Expectations, DataHub). Adoption grew rapidly in 2025-2026 because it's the first data-contract shape that vendors don't control — teams can adopt without lock-in.
Do I still need a Schema Registry if I have data contracts?
Yes — the schema registry and the data contract play different roles. The data contract (ODCS YAML) lives in git; it's the design-time source of truth reviewed by humans and enforced by CI. The schema registry (Confluent, Apicurio) lives in the streaming stack; it's the runtime source of truth consulted by producers and consumers on every message. A sync tool projects the ODCS schema block into the registry as an Avro / Protobuf schema and enforces the registry's compatibility mode (BACKWARD, FORWARD, FULL). The registry catches producer-side bugs at the point of PRODUCE (the registry rejects an incompatible schema); the contract catches producer-side intent at the point of merge (CI blocks the PR). Together they form the belt-and-braces: contract for intent, registry for wire.
How do I enforce a data contract at runtime?
Runtime enforcement has three gates. Producer-side (streaming): the KafkaAvroSerializer (or Protobuf equivalent) publishes to the registry on startup and on every message; if the schema is not registered or is incompatible with the subject's compatibility mode, PRODUCE fails and the producer surfaces the error. Consumer-side (streaming): the AvroDeserializer reads the schema-id from the message header, fetches the writer schema, and projects into the consumer's reader schema (from the contract); incompatible payloads throw SerializationError and route to a DLQ topic with an envelope containing the original bytes + error + trace-id + offset. Batch: the loader (Airflow, dbt, Fivetran) checks the incoming file's schema against the contract before write; failures land in a quarantine bucket. Complement all three with runtime probes for the SLA block: freshness probes read now() - MAX(created_at) against sla.max_freshness and page the producer on breach. The rule of thumb: CI catches producer intent; runtime catches producer or infrastructure reality; DLQ + quarantine preserve every rogue payload for replay.
Who owns a data contract — the producer team or the consumer team?
The producer team owns the contract — always. The producer is the party making the promises (schema stability, freshness SLA, quality assertions); they sign it via a CODEOWNERS binding on the contract file that requires their team's approval on every change. The steward team (usually platform-data or a data-governance team) owns the enforcement infrastructure (lint, sync, probes, DLQ) and mediates producer-consumer disputes; they are the neutral third party. Consumers subscribe to the contract, are listed in roles.consumers, and get notified on every change; they don't sign the contract but they do need to acknowledge deprecations via a tracked /contract ack command. When a contract is violated, the producer's on-call rotation (roles.on_call_rotation) gets paged. Consumers who want tighter SLAs or additional assertions negotiate with the producer during quarterly contract reviews; the steward mediates if they can't agree. Contract ownership without a producer is a wish; producer ownership without a steward is a silo.
How do I introduce data contracts to an organisation that has never had them?
Sequence discovery before enforcement. Quarter 1: stand up the tooling (odcs-lint, dbt contract, registry-sync, freshness probes, DLQ framework, ownership dashboard). Auto-generate ODCS contracts from existing dbt models and Avro schemas; every contract begins in status: draft with all assertions in warn_only. Producer teams review and sign; consumers subscribe. Quarter 2: measure the SLAs — run probes, gather freshness / availability / quality metrics, let producers see their "actual vs promised" gap. Some teams renegotiate looser SLAs to match reality; that's healthy. Quarter 3: promote assertions up the rollout ladder — warn_only → block_on_new → block_all — one rung at a time, one dataset at a time. Datasets with clean data promote fast; dirty ones stay in block_on_new while backfills run. Quarter 4: ownership handoff — every dataset has a producer, an on-call, a consumer list, and a quarterly review cadence. New datasets ship with a contract by default. The most common failure is going straight from "no contracts" to "block-everywhere in CI" — that produces a wave of overnight breakages and destroys organisational trust. The phased ladder is what makes the rollout survivable.
Practice on PipeCode
- Drill the ETL practice library → for the producer-consumer, schema-evolution, and contract-testing problems senior interviewers use to probe data-contracts intuition.
- Rehearse on the streaming practice library → for the Avro / Protobuf / schema-registry, subject-strategy, and BACKWARD-compat problems every senior streaming role expects.
- Sharpen the schema-design axis with the optimization practice library → for the contract-shape, rollout-ladder, and SLA-tuning problems that come up in system-design rounds.
- Stack the prerequisites against the broader 450+ data-engineering catalogue to anchor the ODCS + registry + enforcement intuition against real graded inputs.
Lock in data-contract muscle memory
Vendor docs explain fields. PipeCode drills explain the decision — when a rename becomes a major bump, when BACKWARD compatibility rejects a schema, when a quality assertion belongs in warn-only versus block-all, when a deprecation window has to run 90 days. Pipecode.ai is Leetcode for Data Engineering — pattern-first practice tuned for the production trade-offs senior data engineers actually face.