Databricks API & CLI for Data Engineers: Jobs, Clusters, Repos & CI/CD

databricks api looks like a single product surface on the marketing page — the reality is a layered stack of REST 2.x endpoint groups, a Go CLI binary, a YAML deployment format, and four different authentication modes — and the line between "I know Databricks" and "I can ship Databricks to production" is whether you can drive every one of those layers from a Git repository with no clicks. Click-ops is fine for exploration; a scheduled prod job that nobody can redeploy without opening a browser is a resume-limiting outage waiting to happen.

This guide is the cheat sheet you wished existed the first time a databricks cli upgrade replaced the Python tool with a Go binary and broke half your Makefile. It walks through the REST API endpoint map across Jobs 2.x, Clusters 2.0, Repos 2.0, Secrets 2.0, Workspace 2.0, DBSQL 2.0 and Unity Catalog 2.1; the twenty CLI commands worth memorising; the Databricks Asset Bundles CI/CD pattern with GitHub Actions and a staging → prod approval gate; and the auth-pattern matrix for PAT, OAuth U2M, OAuth M2M with a service principal, and notebook-context auth. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.

When you want hands-on reps immediately after reading, drill the API integration practice library →, warm up on Databricks company problems →, and rehearse system design drills → for the deploy-pipeline interview.

On this page

Why the API + CLI matter — the GUI ceiling
REST API surface — endpoint groups every DE uses
CLI cheat sheet — the 20 commands worth memorising
CI/CD with Databricks Asset Bundles
Authentication patterns — when each one fits
Cheat sheet — API + CLI recipes
Frequently asked questions
Practice on PipeCode

1. Why the API + CLI matter — the GUI ceiling

The Databricks UI is great for exploration and terrible for production — the API + CLI are the only way to ship idempotent, reviewable, Git-backed change

The one-sentence invariant: every production-grade Databricks operation lives behind the REST API; the CLI and Asset Bundles are thin clients on top of the same endpoints; and any workflow you cannot reproduce from a Git checkout is technical debt waiting to page you at 2am. Once you internalise "the GUI is a renderer over the API," the entire conversation about CI/CD, drift detection, and on-call rotations becomes a sequence of REST calls plus YAML.

The four things only the API/CLI give you.

Bulk operations. Creating 40 clusters or rotating 200 secrets is a for loop, not 40 right-click menus. Anything more than three resources is faster scripted than clicked.
CI/CD and code review. A databricks.yml lives in Git, gets diffed in pull requests, and deploys atomically. A clicked job has no Git history — you cannot diff what changed last Thursday.
Drift detection. databricks bundle validate --target prod compares declared state to deployed state and flags anything a human edited in the UI. The GUI alone has no concept of "drift."
Scheduled rotations. Token rotation, secret rotation, cluster-policy edits, library version bumps — every "every 90 days" task is a cron-friendly CLI call. The UI cannot run on a schedule.

The click-ops debt symptom.

You inherit a prod job whose YAML you cannot find anywhere in Git. The only way to redeploy it is to open the UI, screenshot the config, and rebuild from memory. That is "click-ops debt" — and the cure is to export the current job spec via databricks jobs get <id> --output JSON > resources/jobs/legacy.json and check it into Git the same hour you discover the gap.

API vs CLI vs Terraform vs Asset Bundles — when each tool fits.

REST API directly (via curl, Python requests, or httpx) — when you need a one-off automation outside the CLI's surface area, or when you are writing a deeper library. Higher fidelity, lower ergonomics.
databricks CLI (the Go binary) — the default for interactive ops, scripted one-shots, and ad-hoc deploys. Wraps the REST API with a clean --output JSON | jq flow.
Terraform — when you also need to manage cloud infra (S3 buckets, IAM roles, networking) in the same plan. Great for the whole-stack infrastructure layer; overkill for "just deploy a job."
Databricks Asset Bundles (DAB) — the 2025 default for Databricks-native CI/CD. Declarative YAML, atomic deploys, native staging/prod targets, drift detection.

Authentication surface — four modes, one production answer.

Personal Access Token (PAT) — the legacy mode. Fast to issue, per-user, no automatic rotation. Fine for an ad-hoc human script; never for a service.
OAuth U2M (user-to-machine) — short-lived browser login for laptops. The default for the modern CLI on a developer machine.
OAuth M2M (machine-to-machine) with a service principal — the production default. Client ID + client secret, rotated every 90 days, no human in the loop.
Notebook-context auth — the dbutils token inherited inside a running job, scoped to the job's run-as identity. Available only inside a notebook task.

The CLI generational split.

The 2025 databricks Go binary — the only supported CLI. Installed via brew install databricks/tap/databricks or the prebuilt binary release. Compatible with --output JSON, has top-level groups like jobs, clusters, bundle, auth, workspace, secrets, fs, current-user.
The legacy Python databricks-cli — deprecated since 2024. Still installable from PyPI, still shows up in old Makefiles. Anything you read on an old StackOverflow answer that says databricks workspace ls /Users and not databricks workspace list /Users is from the legacy tool. The new CLI uses different sub-command names for some groups; rewrite your scripts.

Versioning in the REST URL.

Every endpoint is 2.0 or 2.x. Pin the version in every call: POST /api/2.1/jobs/create, not POST /api/jobs/create. Otherwise you get whichever version is the current default — usually fine, sometimes catastrophic when 2.x is released and renames a field.
The bundle deploy command pins versions for you. Hand-rolled curl does not.

What interviewers listen for.

Do you reach for DAB or Terraform, not raw curl, when asked to "deploy this job from CI"? — senior signal.
Do you mention OAuth M2M with a service principal when asked about prod auth, not PAT? — required answer.
Do you mention drift detection when asked how you keep prod stable? — senior signal.
Do you mention --output JSON + jq when asked to script the CLI? — required answer.

Worked example — the click-ops to GitOps migration

Detailed explanation. A new team inherits a Databricks workspace full of jobs created in the UI over three years. There is no Git repo of definitions. The first migration step is to export every job to JSON via the API, commit the dump to Git, convert the highest-frequency jobs to Asset Bundle YAML, and deploy them back via the CLI — proving the round trip works before retiring the GUI version.

Question. Given an inherited Databricks workspace with 12 clicked jobs, write the API calls and CLI commands you would run to discover, export, version-control, and re-deploy each job from a databricks.yml bundle, with no downtime.

Input.

job_id	name	trigger	owner_lost?
101	nightly_etl	schedule	yes
102	hourly_ingest	schedule	yes
103	ad_hoc_reprocess	manual	yes
104	dq_audit	schedule	yes

Code.

# 1) Discover every job and capture the spec
databricks jobs list --output JSON | jq -r '.[].job_id' > jobs.txt
mkdir -p exported_jobs
while read job_id; do
  databricks jobs get --job-id "$job_id" --output JSON > "exported_jobs/${job_id}.json"
done < jobs.txt

# 2) Commit the raw export
git add exported_jobs/ && git commit -m "snapshot inherited Databricks jobs"

# 3) Convert each JSON to a resources/jobs/<name>.yml file under databricks.yml
#    (templated by a script; one resource block per job)
python3 scripts/json_to_bundle.py exported_jobs/ > resources/jobs/

# 4) Validate and deploy back to the same workspace, pointing at the same job IDs
databricks bundle validate --target prod
databricks bundle deploy --target prod

Step-by-step explanation.

databricks jobs list --output JSON returns every job in the workspace as a JSON array. jq -r '.[].job_id' extracts just the IDs into a text file for iteration.
databricks jobs get --job-id <id> --output JSON dumps the full spec including tasks, schedule, libraries, tags, permissions, and run-as identity. One file per job.
Committing the raw JSON gives you an audit baseline — even before the YAML conversion, you have the "what is in prod today" snapshot.
The json_to_bundle.py script translates each Jobs 2.x JSON into the equivalent Asset Bundle YAML stanza under resources/jobs/. Most fields map one-to-one; cluster references rewrite from inline cluster specs into shared cluster pools.
databricks bundle validate checks YAML schema and Unity Catalog references without deploying. Catch the typos here.
databricks bundle deploy --target prod updates the existing job IDs in place. Because the bundle deploy uses the same name field and detects existing resources, no new job IDs are created — schedules keep firing without a gap.

Output.

Step	What was created	Where
Step 1	`jobs.txt` (12 IDs)	local
Step 2	`exported_jobs/*.json` (12 files)	Git
Step 3	initial commit "snapshot…"	Git history
Step 4	`resources/jobs/*.yml` (12 files)	Git
Step 5	validation report	local stdout
Step 6	`Updated 12 jobs in workspace …`	CLI output + workspace

Rule of thumb. Every migration from click-ops to GitOps starts with jobs list + jobs get and ends with bundle deploy --target prod. Never rewrite a job by hand from a screenshot — the JSON export is canonical and lossless.

Worked example — picking between API, CLI, Terraform, and DAB

Detailed explanation. Picking the right tool for a Databricks change is a constant interview probe. The answer is rarely "one tool for everything" — it is "API for fine-grained automation, CLI for ergonomics, Terraform for cross-cloud infra, DAB for Databricks-native CI/CD." Saying so out loud separates seniors from juniors.

Question. For each of these tasks, which Databricks tool would you pick and why: (a) deploy a new job from a PR, (b) rotate 40 secret values nightly, (c) provision an S3 bucket plus a new Databricks workspace, (d) update a job's libraries field on an emergency basis?

Input.

Task	Frequency	Cross-cloud?	Latency tolerance
(a) Deploy a job from PR	per PR	no	minutes
(b) Rotate 40 secrets nightly	nightly	no	minutes
(c) Provision bucket + workspace	one-off	yes	hours
(d) Patch one job's libraries	rare emergency	no	seconds

Code.

# (a) Deploy a new job from a PR — Asset Bundle
databricks bundle deploy --target staging
databricks bundle deploy --target prod   # after approval

# (b) Rotate 40 secrets nightly — CLI in a cron / scheduled GitHub Action
for scope in prod-keys; do
  for key in $(databricks secrets list-secrets --scope "$scope" --output JSON | jq -r '.[].key'); do
    new=$(./scripts/mint_secret.sh "$key")
    databricks secrets put-secret --scope "$scope" --key "$key" --string-value "$new"
  done
done

# (c) Provision an S3 bucket + Databricks workspace — Terraform
terraform plan
terraform apply

# (d) Patch one job's libraries field — REST API direct
curl -X POST "https://${HOST}/api/2.1/jobs/reset" \
  -H "Authorization: Bearer ${TOKEN}" \
  -H "Content-Type: application/json" \
  -d @new_job.json

Step-by-step explanation.

(a) is the canonical Asset Bundle case: declarative, diffable, atomic deploy, staging-then-prod with manual approval. Anything CI-pipeline-shaped that lives entirely inside one Databricks workspace is a bundle.
(b) is a bulk-secret rotation. DAB does not model secrets-as-data well (you do not check secret values into Git), so the CLI in a scheduled action is the right tool. The loop is shell, the lift is the CLI.
(c) crosses Databricks and the cloud provider (S3 + IAM + a new workspace). Terraform is the one tool that can span both, with one plan and one state. DAB cannot create a workspace; the CLI can only act inside one.
(d) is a one-shot emergency patch — the kind of "fix one field in prod right now" where the CLI is overkill and Terraform is too slow. Hit the REST API with curl against jobs/reset, document the change after.

Output.

Task	Tool	Why
Deploy job from PR	DAB	declarative, atomic, staging → prod
Rotate 40 secrets	CLI	bulk + script-friendly + secrets-not-in-Git
Provision workspace + bucket	Terraform	cross-cloud, one state
Emergency one-field patch	REST API direct	lowest latency, minimal cognitive cost

Rule of thumb. Pick the thinnest tool that handles the task. DAB for the 80% recurring deploy path; CLI for ad-hoc and bulk; Terraform when the change leaves the Databricks workspace; raw REST only when you need surgical precision on an undocumented or edge-case field.

Interview question on the click-ops vs GitOps shift

A senior interviewer often frames this as: "Your team inherits a Databricks workspace with no Git repository of job definitions. Walk me through the first month of work to bring it under CI/CD without breaking any running schedule."

Solution Using a discover → snapshot → bundle → deploy migration plan

# Week 1 — discover
databricks current-user me                       # confirm auth + workspace identity
databricks jobs list --output JSON > jobs.json   # snapshot every job
databricks clusters list --output JSON > clusters.json
databricks repos list --output JSON > repos.json

# Week 2 — snapshot every job spec into Git
for id in $(jq -r '.[].job_id' jobs.json); do
  databricks jobs get --job-id "$id" --output JSON > exported_jobs/$id.json
done
git add exported_jobs/ jobs.json clusters.json repos.json
git commit -m "snapshot: workspace inventory"

# Week 3 — author databricks.yml + per-job YAML, validate
databricks bundle validate --target staging

# Week 4 — atomic deploy to staging then prod with manual approval
databricks bundle deploy --target staging
# Run smoke test job: bundle run assert_row_count
databricks bundle deploy --target prod

Step-by-step trace.

Week	Action	Reversible?	Risk
1	snapshot jobs/clusters/repos	yes (read-only)	none
2	commit snapshots to Git	yes	none
3	author + validate bundle	yes (no deploy yet)	none
4a	deploy to staging	yes (rollback by reverting)	low
4b	smoke test in staging	yes	low
4c	deploy to prod	yes (re-deploy old SHA)	medium

The plan is intentionally bottom-up — every step is reversible, and the highest-risk operation (prod deploy) only happens after a smoke test in staging passes. No schedule is interrupted because the bundle deploy updates existing job IDs in place.

Output:

Artifact	Created in	Used for
`exported_jobs/*.json`	Week 2	rollback baseline
`databricks.yml`	Week 3	declarative source of truth
`resources/jobs/*.yml`	Week 3	per-job declarative spec
`staging` workspace job updates	Week 4a	smoke testing
`prod` workspace job updates	Week 4c	GitOps in production

Why this works — concept by concept:

Read-only discovery first — jobs list and jobs get mutate nothing, so the snapshot phase carries zero outage risk and builds confidence in the API.
Snapshot is canonical — the JSON exports are the authoritative "what is in prod today" record before any YAML translation. If the bundle gets the YAML wrong, you can always re-deploy from the JSON via curl + jobs/reset.
Bundle validation before deploy — databricks bundle validate catches schema errors, missing UC references, and undefined cluster pools without touching the workspace. Fail fast, fail cheap.
Atomic update of existing job IDs — the bundle deploy uses the resource name field to find existing job IDs and update them in place. Schedules keep firing across the migration; there is no "the job vanished for ten minutes" window.
Staging → prod with smoke test — the smoke test is a cheap assert_row_count notebook that proves the deploy actually works. Treat staging as a real gate, not a vanity environment.
Cost — O(jobs) API calls for the snapshot; O(1) deploy from then on. The migration is one engineer-week per ~50 jobs, after which every future change is a PR.

SQL · Python
Company — Databricks
Databricks company problems

Practice →

Python
Topic — API integration
API integration problems (Python)

Practice →

2. REST API surface — endpoint groups every DE uses

The Databricks REST API is eight groups, one Bearer token, and one pagination contract — knowing the eight groups by name lets you script anything

The mental model in one line: every Databricks resource you can manage from the UI has a /api/2.x/<group>/<verb> endpoint, every endpoint takes a Bearer token, and the catalogue of groups is small enough to memorise — Jobs, Clusters, Repos, Secrets, Workspace, DBSQL, Unity Catalog, and Workflows. Once you can name the eight groups and the top three verbs in each, every "how would you script this?" question has an immediate skeleton answer.

The eight groups in one table.

Group	Version	Top verbs	Typical use case
Jobs	2.1 / 2.2	`create`, `reset`, `run-now`, `runs/get-output`, `runs/repair`	scheduled and on-demand work
Clusters	2.0	`create`, `edit`, `start`, `restart`, `terminate`, `events`, `libraries`	compute lifecycle
Repos	2.0	`create`, `update`, `list`, `delete`	Git-backed working copies in the workspace
Secrets	2.0	`scopes/create`, `put`, `acls/put`, `list-scopes`	sensitive config + ACLs
Workspace	2.0	`import`, `export`, `mkdirs`, `list`	the legacy notebook FS API
DBSQL	2.0	`warehouses`, `queries`, `dashboards`, `alerts`	the SQL persona surface
Unity Catalog	2.1	`catalogs`, `schemas`, `tables`, `grants`, `external-locations`, `storage-credentials`	governance + lineage
Workflows	(via Jobs)	task types: `notebook`, `python_wheel`, `dlt`, `sql`, `run_job`	the orchestration view of jobs

Conventions you can rely on.

Auth. Every call carries Authorization: Bearer <token>. The token is a PAT, a U2M access token, an M2M access token, or the notebook context token — all four are interchangeable to the API.
Rate limits. Hit a quota, get HTTP 429 with a Retry-After header. The CLI auto-retries with backoff; hand-rolled clients must.
Idempotency. Long-running jobs/run-now accepts an idempotency_token field. Use it on every retry to avoid duplicate runs.
Pagination. List endpoints return next_page_token; pass it back as page_token on the next call. Keep going until next_page_token is absent.
Versioning. Pin 2.0, 2.1, or 2.2 in every URL. The "current default" can change.

Jobs 2.x in three sentences.

POST /api/2.1/jobs/create accepts a JSON spec of tasks, clusters, libraries, schedule, and access control. The response is a job_id — the immortal handle.
POST /api/2.1/jobs/reset replaces the entire job spec atomically (no patch — full overwrite). This is what bundle deploy emits per job.
POST /api/2.1/jobs/run-now triggers an immediate run, optionally with parameters; the response is a run_id.

Clusters 2.0 in three sentences.

POST /api/2.0/clusters/create accepts a cluster spec (node type, runtime, num workers, autoscale, libraries) and returns a cluster_id.
POST /api/2.0/clusters/edit mutates the spec in place — much faster than create-then-destroy, but only legal on a terminated cluster.
GET /api/2.0/clusters/events?cluster_id=<id> is the only way to get the cluster's full event log including autoscale events and driver crashes — vital for debugging.

Repos 2.0 — the Git-backed primitive.

POST /api/2.0/repos with a url, provider, and optional path creates a Git-backed working copy inside /Repos/<user>/<name> or /Repos/<service-principal>/<name>.
PATCH /api/2.0/repos/<id> with a branch or tag field syncs the working copy. This is the "pull latest main into the workspace" call CI uses.
The Repos API is not the right place for production code — code in /Repos is mutable and not idempotent. Use Workspace Files + bundles for prod.

Secrets 2.0 — never log a value.

Scopes are namespaces; secrets are key-value entries inside a scope; ACLs control which principals can read which scope.
Every endpoint accepts the value as a string — but GET /api/2.0/secrets/get returns NOT the value (by design) but only the key list. The only way to read a value is from a notebook via dbutils.secrets.get(scope, key).

Unity Catalog 2.1 — governance is a first-class API.

Catalogs, schemas, tables, volumes, models, and grants all live behind /api/2.1/unity-catalog/....
PATCH /api/2.1/unity-catalog/permissions/<securable_type>/<full_name> adjusts grants. This is the API that drives "self-service grant requests" tooling.

Common interview probes on the REST surface.

"Which API version do you target for Jobs?" — 2.1 or 2.2 in 2025–2026. 2.0 is legacy.
"How do you make jobs/run-now retry-safe?" — pass an idempotency_token; deduplicates retries at the server.
"How do you list every job in a workspace with 5000 jobs?" — paginate via page_token + next_page_token.
"Why is clusters/edit faster than create-then-delete?" — preserves the cluster ID, the metastore attachments, and the warm-cache plan. Cheaper for users referencing the cluster by ID.

Worked example — listing every job with pagination

Detailed explanation. A workspace with 5000 jobs cannot be listed in one call — the API returns at most 25 or 100 per page and includes a next_page_token. The script must loop until the token is empty. This is the canonical "use pagination correctly" interview probe.

Question. Write a Python script that lists every Databricks job in a workspace using the Jobs 2.1 API, handling pagination correctly.

Input.

param	value
host	`https://acme.cloud.databricks.com`
token	`dapi...` (M2M access token)
total_jobs	5000
page_size	25 (default)

Code.

import os
import httpx

HOST  = os.environ["DATABRICKS_HOST"]
TOKEN = os.environ["DATABRICKS_TOKEN"]

def list_all_jobs() -> list[dict]:
    jobs: list[dict] = []
    page_token: str | None = None
    while True:
        params = {"limit": 25}
        if page_token:
            params["page_token"] = page_token
        r = httpx.get(
            f"{HOST}/api/2.1/jobs/list",
            headers={"Authorization": f"Bearer {TOKEN}"},
            params=params,
            timeout=30.0,
        )
        r.raise_for_status()
        body = r.json()
        jobs.extend(body.get("jobs", []))
        page_token = body.get("next_page_token")
        if not page_token:
            break
    return jobs

if __name__ == "__main__":
    all_jobs = list_all_jobs()
    print(f"Total jobs: {len(all_jobs)}")

Step-by-step explanation.

The first iteration calls /api/2.1/jobs/list?limit=25 with no page_token. The response contains up to 25 jobs and a next_page_token opaque cursor.
Each subsequent iteration passes the previous next_page_token as the page_token query param. The API returns the next 25 jobs.
The loop exits when the response does not contain next_page_token — the server is telling you "there is no next page."
The script accumulates every page into the jobs list and returns the full set. For 5000 jobs at 25 per page, that is 200 API calls — small.

Output.

Iteration	Jobs returned	next_page_token?
1	25	yes
2	25	yes
…	…	…
200	25	no
Total	5000	—

Rule of thumb. Every Databricks list endpoint paginates. Write the loop once as a reusable paginate(endpoint, params) helper and you never have to think about it again. Never assume "the workspace is small enough to fit in one page" — the workspace you inherit next quarter will not be.

Worked example — idempotent `run-now` with a retry-safe token

Detailed explanation. Triggering a job from CI is straightforward — until the network blip mid-call. Without an idempotency token, retrying the request after a timeout can run the job twice. With one, the server deduplicates: two POSTs with the same token map to the same run_id.

Question. Write a bash snippet that triggers a Databricks job via jobs/run-now, supplies an idempotency_token, and retries on transient errors without risking a duplicate run.

Input.

field	value
job_id	12345
host	`$DATABRICKS_HOST`
token	`$DATABRICKS_TOKEN`
idempotency_token	UUID derived from the current Git SHA + date

Code.

#!/usr/bin/env bash
set -euo pipefail
JOB_ID=12345
IDEMP=$(printf '%s-%s' "$(git rev-parse HEAD)" "$(date -u +%Y%m%d)")
PAYLOAD=$(jq -n --arg t "$IDEMP" '{
  job_id: 12345,
  idempotency_token: $t,
  notebook_params: {date: "2026-06-04"}
}')

for attempt in 1 2 3; do
  RESPONSE=$(curl -sS -w '\n%{http_code}' -X POST \
    "$DATABRICKS_HOST/api/2.1/jobs/run-now" \
    -H "Authorization: Bearer $DATABRICKS_TOKEN" \
    -H "Content-Type: application/json" \
    -d "$PAYLOAD") || RESPONSE="$?"
  STATUS=$(printf '%s' "$RESPONSE" | tail -n1)
  if [ "$STATUS" = "200" ]; then break; fi
  echo "Attempt $attempt got HTTP $STATUS — retrying"
  sleep $((2 ** attempt))
done

RUN_ID=$(printf '%s' "$RESPONSE" | head -n-1 | jq -r '.run_id')
echo "Run started: $RUN_ID"

Step-by-step explanation.

The idempotency token is derived from the Git SHA + date — stable across retries of the same CI run, distinct across deploys.
The payload includes idempotency_token and notebook_params. The server stores the token; if a second request with the same token arrives within 24 hours, the server returns the original run_id instead of starting a new run.
The loop retries up to 3 times with exponential backoff. Because the token is fixed across attempts, even if the first call partially succeeded (started the run but the response did not reach the client), the second call returns the same run_id.
The trailing head -n-1 | jq parses the JSON body — -w '%{http_code}' appended the status code on its own line, so we strip that with tail -n1 first.

Output.

Scenario	Attempts	Final state
Clean call	1	run started, `run_id=R1`
Network blip on first attempt	2	second attempt returns the same `run_id=R1` (deduped)
Server 5xx on first two attempts	3	third attempt returns the original or a new `run_id` depending on whether the original landed

Rule of thumb. Every CI script that calls jobs/run-now should supply an idempotency_token. The cost is a few characters in the payload; the benefit is "the network can fail and the job still runs exactly once."

Interview question on REST surface design

A senior interviewer often asks: "Suppose I gave you only the REST API — no CLI, no DAB — and asked you to deploy a job, run it, tail its output, and clean up afterwards. Walk me through every endpoint call you would make."

Solution Using a Jobs 2.1 lifecycle with `create` → `run-now` → `runs/get` → `runs/get-output`

# 1) Create the job
JOB_ID=$(curl -sS -X POST "$HOST/api/2.1/jobs/create" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d @job_spec.json | jq -r '.job_id')

# 2) Trigger an immediate run
RUN_ID=$(curl -sS -X POST "$HOST/api/2.1/jobs/run-now" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d "{\"job_id\": $JOB_ID, \"idempotency_token\": \"$(uuidgen)\"}" \
  | jq -r '.run_id')

# 3) Poll until terminal state
while true; do
  STATE=$(curl -sS "$HOST/api/2.1/jobs/runs/get?run_id=$RUN_ID" \
    -H "Authorization: Bearer $TOKEN" | jq -r '.state.life_cycle_state')
  if [ "$STATE" = "TERMINATED" ] || [ "$STATE" = "INTERNAL_ERROR" ]; then break; fi
  sleep 10
done

# 4) Fetch output for the (single-task) run
curl -sS "$HOST/api/2.1/jobs/runs/get-output?run_id=$RUN_ID" \
  -H "Authorization: Bearer $TOKEN" | jq '.notebook_output.result'

# 5) Optionally delete the job
curl -sS -X POST "$HOST/api/2.1/jobs/delete" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d "{\"job_id\": $JOB_ID}"

Step-by-step trace.

Step	Endpoint	Method	Returns
1	`/api/2.1/jobs/create`	POST	`{job_id}`
2	`/api/2.1/jobs/run-now`	POST	`{run_id}`
3	`/api/2.1/jobs/runs/get`	GET	`{state: {life_cycle_state}}` (polled)
4	`/api/2.1/jobs/runs/get-output`	GET	`{notebook_output: {result}}`
5	`/api/2.1/jobs/delete`	POST	`{}`

The lifecycle is intentionally CRUD-shaped: create returns a handle, trigger returns a run handle, polling reads state, output read returns the artifact, delete cleans up. The same five-step pattern works for any job, any task type, any cluster.

Output:

Stage	Identifier	Comment
create	`job_id`	persistent handle
run-now	`run_id`	per-execution handle
poll	`life_cycle_state`	one of PENDING, RUNNING, TERMINATED, INTERNAL_ERROR
get-output	`notebook_output.result`	task output payload
delete	(none)	resource gone from workspace

Why this works — concept by concept:

Resource handles are immortal — job_id survives runs and config edits; run_id survives the run regardless of cluster state. Hold the IDs, not the JSON specs.
Idempotency on triggers — run-now accepts a token so a retried POST does not start a duplicate run. Every CI call should send one.
Polling, not webhooks — Databricks does not push job-completion webhooks by default; you poll runs/get every 10–30 seconds. Set a timeout to avoid runaway polling.
Output is endpoint-specific — runs/get returns metadata; runs/get-output returns the notebook return value. Two endpoints for the same run, different payloads.
Delete is permanent — jobs/delete removes the job spec but historical run_ids and their logs remain queryable. Safer than it looks for hygiene; the audit trail survives.
Cost — five HTTP calls per lifecycle; poll loop adds one call every 10s. Negligible compared to the work the job does.

Python
Topic — API integration
API integration problems (Python)

Practice →

Python
Topic — data processing
Data processing problems (Python)

Practice →

3. CLI cheat sheet — the 20 commands worth memorising

The `databricks` CLI is six categories of about twenty commands — auth, clusters, jobs, workspace, repos/secrets, bundle — and `--output JSON | jq` makes every one scriptable

The mental model in one line: the CLI is a Go binary that wraps the REST API with consistent flags (--profile, --output JSON, --json for input), six top-level groups (auth, clusters, jobs, workspace, repos/secrets, bundle), and a default profile in ~/.databrickscfg. Once you memorise the twenty commands below, eighty percent of daily DE work on Databricks is a single CLI invocation.

Install + first run.

# macOS / Linux — the supported install path
brew install databricks/tap/databricks
# or download the prebuilt binary release and add to PATH

# Confirm version (should be 0.2x+ in 2025)
databricks version

# First-time setup: prompts for host + auth method
databricks configure

Auth — six commands.

databricks configure                            # legacy PAT prompt; writes ~/.databrickscfg
databricks auth login --host https://acme.cloud.databricks.com  # OAuth U2M
databricks auth profiles                         # list known profiles
databricks auth describe --profile prod          # see auth mode of one profile
databricks current-user me                       # smoke test: who am I?
databricks --profile prod jobs list              # use a named profile per call

Clusters — five commands.

databricks clusters list --output JSON
databricks clusters create --json @cluster.json
databricks clusters start    <cluster-id>
databricks clusters restart  <cluster-id>
databricks clusters events   <cluster-id> --output JSON | jq '.events[].type'

Jobs — five commands.

databricks jobs list --output JSON | jq -r '.[].job_id'
databricks jobs get   --job-id <id>
databricks jobs run-now --job-id <id> --notebook-params '{"date":"2026-06-04"}'
databricks jobs get-run --run-id <id> --output JSON | jq '.state'
databricks jobs repair-run --run-id <id> --rerun-all-failed-tasks

Workspace + FS — four commands.

databricks workspace list /Users/me/proj
databricks workspace import-dir ./src /Users/me/proj
databricks workspace export-dir /Users/me/proj ./backup
databricks fs ls dbfs:/databricks-datasets/samples/

Repos + Secrets — five commands.

databricks repos create   --url https://github.com/me/repo --provider gitHub
databricks repos update   <repo-id> --branch main
databricks secrets create-scope  --scope prod-keys
databricks secrets put-secret    --scope prod-keys --key snowflake_pwd --string-value "$VALUE"
databricks secrets list-secrets  --scope prod-keys

Asset Bundles — four commands.

databricks bundle validate                      # schema + UC ref check, no deploy
databricks bundle deploy   --target prod        # upload code + create/update resources
databricks bundle run      --target prod my_job # trigger a deployed job
databricks bundle destroy  --target dev         # remove everything the bundle owns

Two universal flags.

--profile <name> — selects a named section in ~/.databrickscfg. Lets you keep dev / staging / prod profiles side by side and never type a host URL twice.
--output JSON — switches the default human-friendly table format to JSON, suitable for | jq. Every list / get / run command supports it.

Three non-obvious habits.

Pipe everything through jq. The CLI's JSON output is stable; the table output is for humans. CI should always read JSON.
Use --json @file.json for create-style commands. The JSON file checks into Git; the spec is reviewable; the deploy is reproducible.
The CLI does its own retries on transient 5xx, so wrap your scripts with explicit error handling only for client-side errors (4xx).

Common interview probes.

"How do you switch between dev and prod Databricks workspaces in the CLI?" — --profile dev vs --profile prod, profiles stored in ~/.databrickscfg.
"How do you script the CLI to feed downstream tools?" — --output JSON | jq on every read; --json @spec.json on every write.
"What is the difference between the new databricks binary and the old databricks-cli?" — the new one is a Go binary, supported, ships with bundle and auth login; the old one is Python, deprecated, and has different sub-command names.
"How do you authenticate the CLI in GitHub Actions?" — install the binary, then either export DATABRICKS_HOST + DATABRICKS_TOKEN env vars (PAT or M2M token) or run databricks auth login --host with an M2M client_id/client_secret.

Worked example — scripting "list all jobs and find the slowest ones"

Detailed explanation. A common SRE-style ask is "rank the slowest jobs over the last 24 hours so we know what to optimise." The CLI plus jq plus a little shell is more than enough — no Python required.

Question. Write a one-shot shell pipeline that lists every job, fetches the most recent run of each, and prints the top 10 jobs by total runtime over the last 24 hours.

Input.

job_id	name	last_run_id	duration_ms (last_run)
101	nightly_etl	r1	4 200 000
102	hourly_ingest	r2	380 000
103	dq_audit	r3	1 200 000
104	reprocess	r4	9 800 000

Code.

#!/usr/bin/env bash
set -euo pipefail

# 1) Get every job_id
JOB_IDS=$(databricks jobs list --output JSON --profile prod | jq -r '.[].job_id')

# 2) For each, fetch the most recent run and join the duration
{
  for id in $JOB_IDS; do
    NAME=$(databricks jobs get --job-id "$id" --output JSON --profile prod \
           | jq -r '.settings.name')
    LAST=$(databricks jobs list-runs --job-id "$id" --limit 1 --output JSON --profile prod \
           | jq -r '.runs[0].execution_duration // 0')
    printf '%s\t%s\t%s\n' "$id" "$LAST" "$NAME"
  done
} | sort -k2 -n -r | head -n 10 | column -t -s$'\t'

Step-by-step explanation.

jobs list --output JSON | jq -r '.[].job_id' extracts every job_id in the workspace as plain text, one per line.
The loop iterates every ID. For each, two calls: jobs get to read the human-friendly name, jobs list-runs --limit 1 to read the most recent run's execution_duration (in milliseconds).
The // 0 default in jq handles "no runs yet" — those jobs sort to the bottom.
The tab-separated lines feed sort -k2 -n -r (numerical, descending by column 2), then head -n 10 keeps the top ten, and column -t -s$'\t' aligns columns for human reading.
Total API calls = 2 × number_of_jobs (one get, one list-runs). For a workspace with 200 jobs that is 400 calls, well under the rate limit.

Output.

Rank	job_id	duration (ms)	name
1	104	9 800 000	reprocess
2	101	4 200 000	nightly_etl
3	103	1 200 000	dq_audit
4	102	380 000	hourly_ingest

Rule of thumb. Anything that "could be a one-off Python script" can usually be the CLI + jq + sort and run in 20 lines of bash. Reach for Python only when you need libraries (pandas, pydantic, retries with policy), or when the shell quoting becomes a maintenance hazard.

Worked example — creating a cluster from a JSON spec

Detailed explanation. Hand-typing clusters create arguments is fragile. The robust pattern is to keep cluster.json in Git, edit it as a code review, and deploy via --json @cluster.json. The spec is reviewable; the deploy is reproducible; the cluster is one CLI call away.

Question. Given a checked-in cluster.json describing a single-node 14.3.x LTS cluster, create it with the CLI and print the new cluster ID.

Input — cluster.json.

{"cluster_name":"dq-audit-cluster","spark_version":"14.3.x-scala2.12","node_type_id":"i3.xlarge","num_workers":0,"autotermination_minutes":30,"spark_conf":{"spark.databricks.cluster.profile":"singleNode","spark.master":"local[*]"},"custom_tags":{"ResourceClass":"SingleNode"}}

Code.

CLUSTER_ID=$(databricks clusters create \
  --json @cluster.json \
  --profile prod \
  --output JSON \
  | jq -r '.cluster_id')
echo "Created $CLUSTER_ID"

Step-by-step explanation.

--json @cluster.json reads the spec from disk and POSTs it to /api/2.0/clusters/create. The @ prefix tells the CLI "treat the next argument as a path, not a literal."
--profile prod selects the prod section in ~/.databrickscfg so the call lands on the right workspace without an env-var dance.
--output JSON | jq -r '.cluster_id' extracts the new cluster_id from the response.
The same cluster.json is the source of truth. The next time you edit the cluster spec, change the file, commit, re-run — or run databricks clusters edit --cluster-id "$CLUSTER_ID" --json @cluster.json to update in place.

Output.

Field	Value
`cluster_id`	`0604-093425-abcd1234`
state (after create)	PENDING → RUNNING
billable	from RUNNING transition

Rule of thumb. Keep every cluster spec in a JSON file under clusters/. Never invoke clusters create with inline flags in production — the audit trail is the JSON file's Git history.

Interview question on CLI-driven automation

The interviewer often asks: "How would you list every Databricks job in a workspace, then disable any that have not run in 30 days, from a one-shot script with no UI clicks?"

Solution Using `jobs list`, `jobs list-runs`, and `jobs update` with `--output JSON | jq`

#!/usr/bin/env bash
set -euo pipefail
THIRTY_DAYS_AGO_MS=$(($(date -u -v -30d +%s) * 1000))

databricks jobs list --output JSON --profile prod \
  | jq -r '.[].job_id' \
  | while read -r JOB_ID; do
      LAST_RUN_MS=$(databricks jobs list-runs \
                    --job-id "$JOB_ID" --limit 1 --output JSON --profile prod \
                    | jq -r '.runs[0].start_time // 0')
      if [ "$LAST_RUN_MS" -lt "$THIRTY_DAYS_AGO_MS" ]; then
        echo "Pausing job $JOB_ID (last run = $LAST_RUN_MS)"
        databricks jobs update --job-id "$JOB_ID" \
          --json '{"new_settings":{"schedule":{"pause_status":"PAUSED"}}}' \
          --profile prod
      fi
    done

Step-by-step trace.

job_id	last_run_ms	< cutoff?	action
101	yesterday	no	no-op
102	45 days ago	yes	pause
103	never (0)	yes	pause
104	5 days ago	no	no-op

The script reads the workspace, computes the cutoff as "now minus 30 days in milliseconds," compares each job's most recent run timestamp against the cutoff, and pauses the laggards in place — without deleting them.

Output:

Job_ID	Final state	Why
101	UNPAUSED	recent activity
102	PAUSED	inactive 45d
103	PAUSED	never ran
104	UNPAUSED	recent activity

Why this works — concept by concept:

--output JSON everywhere — the table output is for humans; JSON is the contract for scripts. Never grep table output.
jq does the data manipulation — no Python needed for "extract this field" / "filter this list." jq is the lingua franca of REST-API automation.
jobs update is non-destructive — pausing is reversible: re-run with pause_status: UNPAUSED and the schedule resumes. Compare with jobs delete, which is irreversible.
Profile flag — --profile prod keeps the script portable: drop it on any laptop with that profile in ~/.databrickscfg and it runs unchanged. No env-var hygiene required.
Cost — one jobs list + N × (list-runs + maybe update) calls. For 200 jobs that is at most ~600 calls — well under the rate limit, no backoff needed.

Python
Topic — data processing
Data processing problems (Python)

Practice →

SQL · Python
Company — Databricks
Databricks company problems

Practice →

4. CI/CD with Databricks Asset Bundles

Databricks Asset Bundles are the YAML deployment format for the Jobs / Pipelines / Clusters / Permissions surface — one bundle, three targets, one approval gate, one auditable promotion path

The mental model in one line: a Databricks Asset Bundle is a databricks.yml plus a resources/ directory describing every job, pipeline, cluster, permission, and dashboard you want deployed; bundle validate runs schema + UC reference checks; bundle deploy --target <env> uploads the code and creates or updates resources atomically; and the GitHub Actions pattern is validate-on-PR, deploy-to-staging-on-merge, manual-approval-then-prod. Once you internalise that one YAML drives three workspaces, every CI/CD interview question reduces to "show me the bundle and the workflow file."

The DAB mental model.

Root file. databricks.yml declares bundle.name, the list of include: paths to per-resource YAML files, and a targets: block defining each environment (dev / staging / prod) with its host and run-as identity.
Resources directory. Each file under resources/ is a typed declaration: resources/jobs/my_job.yml, resources/pipelines/dlt.yml, resources/clusters/shared.yml. Strict schema, validated locally.
Variables. variables: declares parameterised inputs — catalog name, schema name, warehouse ID. Overridden per target.
Atomic deploy. databricks bundle deploy --target prod performs every resource update in one logical transaction, with name-based identity preservation (jobs keep their job_ids across deploys).

A minimal databricks.yml.

bundle:
  name: pipecode-dab-demo

include:
  - resources/jobs/*.yml
  - resources/pipelines/*.yml

variables:
  catalog:
    description: "UnityCatalogname"
    default: "dev_catalog"

targets:
  dev:
    workspace:
      host: https://dev.cloud.databricks.com
    variables:
      catalog: dev_catalog

  staging:
    workspace:
      host: https://staging.cloud.databricks.com
    variables:
      catalog: staging_catalog

  prod:
    mode: production
    workspace:
      host: https://prod.cloud.databricks.com
      root_path: /Shared/.bundle/prod/${bundle.name}
    variables:
      catalog: prod_catalog
    run_as:
      service_principal_name: 11111111-2222-3333-4444-555555555555

A resources/jobs/my_job.yml.

resources:
  jobs:
    nightly_etl:
      name: "nightly_etl_${bundle.target}"
      schedule:
        quartz_cron_expression: "002**?"
        timezone_id: "UTC"
        pause_status: "UNPAUSED"
      tasks:
        - task_key: ingest
          notebook_task:
            notebook_path: ./src/notebooks/ingest.py
            base_parameters:
              catalog: ${var.catalog}
          new_cluster:
            spark_version: "14.3.x-scala2.12"
            node_type_id: "i3.xlarge"
            num_workers: 2

The three guardrails.

bundle validate — schema check + Unity Catalog reference check. Fails locally before any deploy.
Smoke test — a tiny assert_row_count notebook that reads from the freshly-deployed pipeline and confirms non-zero rows. Pass before promoting from staging to prod.
Drift detection — bundle validate --target prod compared to the deployed state highlights any field a human edited via the UI. Treat any non-empty drift output as an alert.

The GitHub Actions pattern.

On pull_request — bundle validate --target staging. Catches typos before merge.
On push to main — bundle deploy --target staging, then run smoke test.
On workflow_dispatch (manual) — bundle deploy --target prod after a human approval click in the GitHub Environments protection rule.

Auth in the CI step.

The CI runner needs an M2M service principal with workspace admin (or scoped permissions). Set DATABRICKS_HOST, DATABRICKS_CLIENT_ID, DATABRICKS_CLIENT_SECRET as GitHub Actions secrets. The CLI reads them automatically.
Never use a personal PAT in a CI runner. PAT scopes to a user; if that user leaves the company, the pipeline breaks. Service principal scopes to a role, which outlives any individual.

Common interview probes.

"How do you make a staging-vs-prod parameter (like catalog name) flow through one bundle?" — declare variables: in databricks.yml, override per target, reference as ${var.name} in resource YAML.
"How do you guarantee a prod deploy does not start until staging passed a smoke test?" — separate workflows or stages: deploy-to-staging job runs bundle run smoke_test, only on success does the promote-to-prod job run (with environment protection rules).
"What happens if someone edits a bundle-managed job in the UI?" — next bundle validate --target prod reports drift; next bundle deploy --target prod overwrites the manual edit. The bundle is the source of truth.
"How do you roll back a bad bundle deploy?" — re-run bundle deploy from the previous Git SHA. Because deploys are atomic and update existing resource IDs in place, rollback is a re-deploy of the older commit.

Worked example — the minimal validate-on-PR GitHub Actions workflow

Detailed explanation. The cheapest possible CI step is bundle validate on every PR. Even before any deploy automation, this catches 80% of typos: missing fields, wrong cluster references, undefined variables, broken UC references.

Question. Write a GitHub Actions workflow that runs databricks bundle validate --target staging on every PR, authenticated via an M2M service principal stored in GitHub Secrets.

Input.

GitHub secret	maps to env var
DATABRICKS_HOST	DATABRICKS_HOST
DATABRICKS_CLIENT_ID	DATABRICKS_CLIENT_ID
DATABRICKS_CLIENT_SECRET	DATABRICKS_CLIENT_SECRET

Code.

name: bundle-validate
on:
  pull_request:
    paths:
      - "databricks.yml"
      - "resources/**"
      - "src/**"
      - ".github/workflows/bundle-validate.yml"

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install Databricks CLI
        run: |
          curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

      - name: Validate bundle
        env:
          DATABRICKS_HOST:          ${{ secrets.DATABRICKS_HOST }}
          DATABRICKS_CLIENT_ID:     ${{ secrets.DATABRICKS_CLIENT_ID }}
          DATABRICKS_CLIENT_SECRET: ${{ secrets.DATABRICKS_CLIENT_SECRET }}
        run: |
          databricks bundle validate --target staging

Step-by-step explanation.

The on: pull_request trigger plus the paths: filter ensures the job only runs when bundle-relevant files change. Saves CI minutes.
actions/checkout@v4 makes the repo available; the bundle CLI needs to read databricks.yml + resources/**.
The install script drops a Linux build of the databricks Go binary into /usr/local/bin. Two seconds.
The validate step receives three env vars. The CLI sees DATABRICKS_CLIENT_ID + DATABRICKS_CLIENT_SECRET and automatically performs an OAuth M2M token exchange — no PAT, no human in the loop.
bundle validate --target staging resolves variables to staging values, runs schema validation, dereferences UC catalog/schema/table names, and returns non-zero if anything is wrong. The PR fails the check on any issue.

Output.

PR scenario	validate exit	PR status
YAML syntax OK + UC refs OK	0	green
typo in `node_type_id`	non-zero	red
missing variable	non-zero	red
undefined UC catalog	non-zero	red

Rule of thumb. Add bundle validate as your first GitHub Actions step on day one of bundle adoption. It is the cheapest possible safety net and catches a wide class of bugs before any deploy ever happens.

Worked example — the staging-then-prod promote workflow

Detailed explanation. Once validate works, the second workflow is "deploy to staging on merge, run smoke test, then optionally promote to prod after a human approval." This is the canonical Databricks CI/CD shape.

Question. Write a GitHub Actions workflow that on merge to main deploys to staging, runs a smoke-test job, and gates a separate prod-deploy job behind a manual approval.

Input.

Trigger	Action
`push: main`	deploy + smoke staging
`workflow_dispatch` (with approval)	deploy prod

Code.

name: bundle-deploy
on:
  push:
    branches: [main]
  workflow_dispatch:

jobs:
  staging:
    runs-on: ubuntu-latest
    if: github.event_name == 'push'
    steps:
      - uses: actions/checkout@v4
      - run: curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
      - env:
          DATABRICKS_HOST:          ${{ secrets.DATABRICKS_HOST }}
          DATABRICKS_CLIENT_ID:     ${{ secrets.DATABRICKS_CLIENT_ID }}
          DATABRICKS_CLIENT_SECRET: ${{ secrets.DATABRICKS_CLIENT_SECRET }}
        run: |
          databricks bundle deploy --target staging
          databricks bundle run    --target staging assert_row_count

  prod:
    runs-on: ubuntu-latest
    needs: staging
    if: github.event_name == 'workflow_dispatch'
    environment: production    # GitHub Environment with required reviewers
    steps:
      - uses: actions/checkout@v4
      - run: curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
      - env:
          DATABRICKS_HOST:          ${{ secrets.DATABRICKS_PROD_HOST }}
          DATABRICKS_CLIENT_ID:     ${{ secrets.DATABRICKS_PROD_CLIENT_ID }}
          DATABRICKS_CLIENT_SECRET: ${{ secrets.DATABRICKS_PROD_CLIENT_SECRET }}
        run: |
          databricks bundle deploy --target prod

Step-by-step explanation.

On push: main, the staging job runs. It deploys the bundle to the staging workspace, then runs an assert_row_count job in staging as a smoke test. If the smoke test fails, the workflow stops — prod is never touched.
The prod job is gated by environment: production. GitHub Environments support "required reviewers" — a configured human must click "Approve" before the job starts. That click is the audit trail.
The two jobs use different secrets for staging vs prod. Two service principals, two scopes — staging credentials cannot mint a prod deploy even if they leak.
needs: staging is a soft dependency for workflow_dispatch runs — the if: filter only runs prod on the manual dispatch, but the needs: ensures the workflow file is logically chained when both events fire together.

Output.

Event	Stage outcome
PR opened	(handled by validate workflow)
Merge to main, smoke green	staging deployed; prod NOT deployed (awaiting dispatch)
Operator runs `workflow_dispatch` + approves	prod deployed
Smoke test fails in staging	workflow fails; alarms fire; nobody clicks approve

Rule of thumb. Treat the staging deploy + smoke as a gate, not a vanity step. If the smoke test ever passes when prod would have broken, fix the smoke test — it is your last line of defence before a manual prod approval.

Interview question on DAB drift detection

A senior interviewer often probes: "Your team adopted DAB six months ago, but engineers still occasionally edit jobs in the UI for hot-fixes. How do you detect and reconcile that drift automatically?"

Solution Using a scheduled `bundle validate --target prod` job that diffs against deployed state

name: drift-detect
on:
  schedule:
    - cron: "09**MON"        # every Monday 09:00 UTC
  workflow_dispatch:

jobs:
  drift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
      - id: validate
        env:
          DATABRICKS_HOST:          ${{ secrets.DATABRICKS_PROD_HOST }}
          DATABRICKS_CLIENT_ID:     ${{ secrets.DATABRICKS_PROD_CLIENT_ID }}
          DATABRICKS_CLIENT_SECRET: ${{ secrets.DATABRICKS_PROD_CLIENT_SECRET }}
        run: |
          set +e
          OUTPUT=$(databricks bundle validate --target prod 2>&1)
          ECHOEC=$?
          echo "$OUTPUT" > drift.txt
          echo "drifted=$([ $ECHOEC -ne 0 ] && echo true || echo false)" >> $GITHUB_OUTPUT
      - if: steps.validate.outputs.drifted == 'true'
        run: |
          gh issue create \
            --title "DAB drift detected on prod ($(date -u +%F))" \
            --body-file drift.txt \
            --label drift
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Step-by-step trace.

Day	bundle vs deployed	validate exit	action
Mon	identical	0	no-op
Tue	(someone edits a job in UI)	n/a	(no cron yet)
Mon next week	diff: schedule changed	non-zero	open GitHub issue, page on-call
(engineer reviews)	either commit the UI edit to YAML, or `bundle deploy` to reset	—	issue closed

The job runs weekly, compares the declared YAML to the live workspace, and opens an issue when drift appears. Engineers then either adopt the manual edit (commit it to YAML and merge) or reject it (re-deploy the bundle to restore the declared state).

Output:

Drift scenario	Issue title	Issue body
no drift	(none — issue not created)	—
job schedule edited	"DAB drift detected on prod (2026-06-08)"	full validate output diff
cluster libraries added in UI	"DAB drift detected on prod (2026-06-15)"	full validate output diff

Why this works — concept by concept:

bundle validate is the diff engine — it not only checks YAML schema but also compares against the deployed workspace state. Any difference is reported and exits non-zero.
Scheduled, not push-triggered — drift is a state, not an event. A weekly cron catches it without depending on someone making a code change. Run nightly if your team allows.
Service principal scoped to prod — the drift workflow's M2M token only has read access on prod resources. Read-only credentials cannot be weaponised even if leaked.
GitHub issue, not silent fix — auto-reconciling drift by re-deploying would clobber a legitimate hot-fix. An issue forces a human review and a decision (adopt vs reject).
Adopt-or-reject loop — engineers either commit the UI edit to YAML (turning the manual fix into declarative code) or re-deploy the bundle to undo the edit. Either way the bundle becomes the source of truth again.
Cost — one CI minute per week + one read-only API scan. Negligible compared to the cost of a silent click-ops outage.

SQL · Python
Topic — design
System design problems

Practice →

Python
Topic — API integration
API integration problems (Python)

Practice →

5. Authentication patterns — when each one fits

Databricks ships four authentication modes — PAT, OAuth U2M, OAuth M2M (service principal), and notebook context — and the production answer is OAuth M2M

The mental model in one line: PAT is fast and per-user; OAuth U2M is browser-login for laptops; OAuth M2M with a service principal is the answer for any scheduled or shared automation; notebook-context auth is the implicit token a job inherits inside its own notebook. Once you can name when each fits, the entire "how do I authenticate this script?" question becomes a four-way decision tree.

The four modes in one table.

Mode	Use case	Lifetime	Rotation	Scope
PAT	ad-hoc human, quick script	up to 90 days	manual	user identity
OAuth U2M	laptop CLI	1h access / 90d refresh	browser refresh	user identity
OAuth M2M (SP)	CI/CD, scheduled, shared automation	1h access	client_secret rotation every 90d	service principal
Notebook context	inside a running job	per-run	inherited from job run-as	run-as identity

PAT in detail.

How to mint. UI: User settings → Developer → Access tokens → Generate. Or via the /api/2.0/token/create endpoint.
Shape. dapi<32 hex chars>. Carries the user's identity and entitlements.
Lifetime. Configurable, capped at 90 days by workspace policy.
Use case. Ad-hoc — curl from your laptop, a one-off python script you run interactively. The "I just need to poke the API once" tool.
What it should never be. A CI runner credential, a shared team credential, a value committed to a repo.

OAuth U2M in detail.

How to mint. databricks auth login --host https://<workspace> opens a browser, the user logs in, the CLI stores a refresh token under ~/.databrickscfg.
Lifetime. Each access token is 1 hour; the refresh token rotates and is valid 90 days.
Use case. Developer laptops running the CLI. Token rotation is automatic and invisible. If the user gets offboarded, the refresh token dies — no orphaned credentials.
What it should never be. A CI runner credential (interactive login required).

OAuth M2M with a service principal in detail.

How to mint. Create a service principal in the account console; generate an OAuth secret on it; the secret yields a client_id + client_secret. The CLI exchanges those for a 1-hour access token automatically.
Use case. GitHub Actions, scheduled CLI jobs, shared deploy automation. Anything that runs without a human at the keyboard.
Rotation. The client_secret rotates every 90 days by policy. Rotate via databricks service-principal-secrets create <sp_id>, deploy the new secret to the CI store, then revoke the old secret.
What it should never be. A handout to individual developers — service principals are role-shaped, not user-shaped, and personal use undermines the audit trail.

Notebook-context auth in detail.

How it works. Inside a notebook running as a job task, dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get() returns a per-run token. Most code instead calls dbutils.secrets.get(scope, key) or uses an SDK client that picks up the context automatically.
Identity. Tied to the job's run_as (a user or a service principal). The token cannot escalate beyond run_as entitlements.
Lifetime. As long as the run. Cannot be exported, cannot be saved to a file.
Use case. Any in-job automation — "list other jobs in the workspace," "trigger a downstream run," "write a record to a metadata table."
What it should never be. Logged. Ever. print(token) in a notebook is a fireable mistake.

The four hard rules.

Never commit a PAT (or a client_secret, or any token) to a repository — even a private one. Treat tokens as radioactive.
Never log a secret value — not in print, not in dbutils.fs.put, not in stdout from a CLI. Every secrets API is designed so you do not need to.
Rotate service principal secrets every 90 days — automatable via a scheduled CLI job that mints, deploys, and revokes.
Audit via system.access.audit — every API call carries the principal name; the audit table shows who deployed what and when.

Common interview probes on auth.

"What is the production answer for authenticating a CI runner?" — OAuth M2M with a service principal. PAT in CI is a code smell.
"How is OAuth U2M different from a PAT?" — U2M tokens are short-lived (1h), automatically refreshed via a 90-day refresh token, and die when the user is offboarded. PAT is a manually-issued long-lived bearer token.
"If a notebook inside a job needs to call the Jobs API, how does it authenticate?" — via notebook-context auth — the inherited token of the job's run-as identity. Use the Databricks SDK and it picks the token up automatically.
"How would you migrate from PAT to M2M without downtime?" — provision the service principal, grant it the same workspace permissions as the PAT's user, swap CI secrets to the new client_id/client_secret, revoke the old PAT.

Worked example — provisioning a service principal for GitHub Actions

Detailed explanation. The first M2M setup is fiddly because three layers must agree: the Databricks account console (create the SP), the workspace (grant it permissions), and GitHub (store the secret). Once the three layers are wired, the CI script just sees three env vars.

Question. Walk through the CLI commands to provision a service principal ci-deploy-bot, grant it CAN_MANAGE on the workspace, mint an OAuth secret, and verify the secret works for a deploy.

Input.

field	value
account_id	`00000000-1111-2222-3333-444444444444`
workspace_id	12345
sp display name	`ci-deploy-bot`

Code.

# Run with an account-admin OAuth U2M session
databricks auth login --host https://accounts.cloud.databricks.com

# 1) Create the service principal at the account level
SP_ID=$(databricks account service-principals create \
  --display-name "ci-deploy-bot" --output JSON | jq -r '.id')

# 2) Mint an OAuth secret
SECRET_JSON=$(databricks account service-principal-secrets create \
  --service-principal-id "$SP_ID" --output JSON)
CLIENT_ID=$(echo "$SECRET_JSON" | jq -r '.client_id')
CLIENT_SECRET=$(echo "$SECRET_JSON" | jq -r '.secret')

# 3) Grant the SP access on the workspace (workspace-scoped command)
databricks workspace-assignments create \
  --workspace-id 12345 \
  --principal-id "$SP_ID" \
  --permissions USER

# 4) Verify by listing jobs via the SP (use a new profile)
cat <<EOF >> ~/.databrickscfg
[ci-deploy-bot]
host          = https://acme.cloud.databricks.com
client_id     = $CLIENT_ID
client_secret = $CLIENT_SECRETEOF
databricks --profile ci-deploy-bot jobs list --output JSON | jq 'length'

Step-by-step explanation.

The first login is human (U2M, account-admin scope) — required to call account-level APIs.
account service-principals create registers the SP at the account level, returning a UUID that is the principal ID for everything downstream.
service-principal-secrets create returns a one-time client_secret — store it immediately; the API never returns it again. The client_id is paired with it.
workspace-assignments create grants the SP USER access on the workspace. Without this, the SP can authenticate but cannot see any workspace resource.
Adding a [ci-deploy-bot] profile in ~/.databrickscfg lets you test the credentials locally. jobs list should return a non-empty array (assuming the SP has been granted CAN_VIEW on at least one job).
The same client_id / client_secret go into GitHub Actions secrets — and the CI script uses them directly via env vars without any local profile file.

Output.

Step	Output
1	(no output — interactive login)
2	`SP_ID = 99999999-...`
3	`client_id`, `client_secret`
4	`{"permissions": "USER"}`
5	jobs count (e.g. 12)

Rule of thumb. Provision one service principal per deployment role — ci-deploy-bot for prod deploys, ci-validate-bot for read-only validates, drift-bot for read-only drift checks. Never reuse one SP across roles; if one credential leaks, you only need to rotate that one role's secret.

Worked example — rotating a service principal secret on a 90-day schedule

Detailed explanation. SP secrets are valid for 90 days. The rotation has three steps: mint a new secret, swap it into the CI store, then revoke the old secret. The trick is to overlap — do not revoke the old secret before the new one is live in CI, or the next deploy will fail.

Question. Write a bash script that mints a new SP secret, prints both the new and old secret IDs, and revokes the old one only after the user confirms the CI is using the new one.

Input.

field	value
SP_ID	`99999999-...`
current_secret_id	`aaaaaaaa-...` (from `~/.databrickscfg` or audit log)

Code.

SP_ID="99999999-1111-2222-3333-444444444444"
OLD_SECRET_ID="aaaaaaaa-1111-2222-3333-444444444444"

# 1) Mint a new secret
NEW=$(databricks account service-principal-secrets create \
       --service-principal-id "$SP_ID" --output JSON)
NEW_SECRET_ID=$(echo "$NEW" | jq -r '.id')
NEW_CLIENT_SECRET=$(echo "$NEW" | jq -r '.secret')

echo "New secret ID: $NEW_SECRET_ID"
echo "New client_secret (store in GitHub Actions NOW): $NEW_CLIENT_SECRET"

# 2) Pause for the operator to update the CI secret store
read -p "Press ENTER after the new secret is live in CI..."

# 3) Verify CI can authenticate with the new secret
echo "Run a smoke test deploy in CI now. Did it succeed? [y/n]"
read CONFIRM
if [ "$CONFIRM" != "y" ]; then
  echo "Aborting — leaving both secrets active"
  exit 1
fi

# 4) Revoke the old secret
databricks account service-principal-secrets delete \
  --service-principal-id "$SP_ID" \
  --secret-id "$OLD_SECRET_ID"
echo "Old secret $OLD_SECRET_ID revoked"

Step-by-step explanation.

The new secret is minted first. Both new and old are valid simultaneously — that overlap window is the safety margin.
The script pauses for the operator to deploy the new client_secret to CI (GitHub Actions secrets, AWS Secrets Manager, whichever store the team uses). No automation here on purpose; CI store updates require a human.
The smoke test confirms the new secret actually works for CI deploys. If the operator says "no," the script aborts without revoking the old secret — the rotation can be retried.
Only on y does the script call service-principal-secrets delete against the old secret ID. After this call, anything still using the old client_secret breaks.

Output.

Phase	What is valid
Before rotation	old secret only
After step 1	old + new both valid (overlap)
After step 4 (revoke)	new secret only

Rule of thumb. Rotation always overlaps — mint, deploy, verify, then revoke. Never revoke first; never skip verify. A 30-minute window where both secrets work is the cost of "the deploys never fail at midnight on rotation day."

Interview question on the production auth answer

A senior interviewer often asks: "Your team is migrating from PATs to OAuth for all CI/CD. Walk me through the migration plan — what changes in CI, how do you handle the cutover, and what auditing do you add?"

Solution Using OAuth M2M with a service principal and `system.access.audit` reconciliation

# 1) Provision one SP per deploy role (already shown in prior example)
SP_ID=...

# 2) Grant the SP exactly the workspace permissions of the PAT user
#    - workspace USER access
#    - CAN_MANAGE on the bundle-owned jobs and pipelines

# 3) Add NEW CI secrets (do not delete the old PAT yet)
#    DATABRICKS_HOST, DATABRICKS_CLIENT_ID, DATABRICKS_CLIENT_SECRET

# 4) Update CI workflows to use M2M env vars
#    (databricks CLI automatically prefers client_id/client_secret over PAT
#     when both are set, but explicit is better — set DATABRICKS_AUTH_TYPE=oauth-m2m)

# 5) Cut traffic — re-trigger the staging workflow, confirm it passes
#    Then re-trigger prod with a manual dispatch

# 6) Run the audit query in DBSQL or via the REST API
SELECT
    user_identity.email AS principal,
    action_name,
    response.status_code,
    COUNT(*) AS n
FROM system.access.audit
WHERE event_time > current_date - INTERVAL 7 DAYS
  AND service_name = 'databricks-cli'
GROUP BY principal, action_name, response.status_code
ORDER BY n DESC;

# 7) Confirm the SP's principal name shows up, the old user does NOT
#    Revoke the old PAT via /api/2.0/token/delete

Step-by-step trace.

Day	Action	What's valid
0	provision SP + grants	PAT + SP both valid
0	add new CI secrets alongside the old PAT	PAT + SP both valid
1	update workflows to use SP env vars	PAT + SP both valid
2	run staging workflow, verify in audit	PAT + SP both valid
3	run prod workflow (manual), verify in audit	PAT + SP both valid
4	confirm only SP appears in audit	PAT + SP both valid
5	delete old PAT	SP only

Output:

audit row	principal	action_name	n
latest week	`ci-deploy-bot@.spn`	`bundleDeploy`	14
latest week	`ci-deploy-bot@.spn`	`jobRunNow`	320
latest week	(old user) — none	—	0

The audit confirms the SP is doing the work and the old user identity no longer appears, so the old PAT is safe to revoke.

Why this works — concept by concept:

Overlap, don't cut over — provisioning the SP first and keeping the PAT live until the SP has handled real traffic is the only safe migration. Never delete the old credential first.
Per-role service principals — ci-deploy-bot, drift-bot, validate-bot are separate. If one credential leaks, you rotate one role's secret, not all of them.
system.access.audit reconciliation — the audit table is the source of truth for "who actually called the API." If the SP shows up and the old user does not, the migration is genuinely complete.
Automatic token refresh — the M2M flow auto-refreshes the 1-hour access token from the client_id/client_secret pair. No human in the loop, no scheduled "renew" job needed for the access token (only the secret rotates every 90 days).
Auditing as a first-class step — the migration is not done when CI is green; it is done when the audit log shows only the SP. Treat the audit query as a checklist item.
Cost — the migration is one engineer-week of setup; ongoing cost is one rotation script every 90 days. Compared to the cost of a leaked PAT on a public repo, the math is trivially in favour of M2M.

Python
Topic — API integration
API integration problems (Python)

Practice →

SQL · Python
Company — Databricks
Databricks company problems

Practice →

Cheat sheet — API + CLI recipes

List every job in prod (just IDs). databricks jobs list --profile prod --output JSON | jq -r '.[].job_id'. Pipe to wc -l for a count, or to a while read loop for per-job actions.
Trigger a run with parameters. databricks jobs run-now --job-id <id> --notebook-params '{"date":"2026-06-04"}' --profile prod. Add --idempotency-token "$(uuidgen)" in CI.
Tail a run until terminal state. while STATE=$(databricks jobs get-run --run-id <id> --output JSON | jq -r '.state.life_cycle_state'); [ "$STATE" != "TERMINATED" ]; do sleep 10; done.
Deploy a bundle to prod. databricks bundle deploy --target prod. Add --force-lock only when you know nobody else is mid-deploy.
Validate a bundle without deploying. databricks bundle validate --target prod. Treat any non-zero exit as a deploy blocker.
Create a cluster from a checked-in JSON spec. databricks clusters create --json @cluster.json --profile prod --output JSON | jq -r '.cluster_id'.
Restart a cluster with new libraries. databricks clusters restart --cluster-id <id> after databricks libraries install --cluster-id <id> --pypi-package "mylib==1.2.3".
Rotate a service principal secret. databricks account service-principal-secrets create --service-principal-id <sp_id> — capture the new secret, deploy to CI, verify, then delete --secret-id <old_id>.
Find drift between bundle and prod. databricks bundle validate --target prod. Diff the output against the previous run; schedule weekly as a GitHub Action.
Curl a raw REST endpoint. curl -X POST "$DATABRICKS_HOST/api/2.1/jobs/run-now" -H "Authorization: Bearer $DATABRICKS_TOKEN" -H "Content-Type: application/json" -d @payload.json | jq.
Smoke test a freshly-deployed bundle job. databricks bundle run --target staging assert_row_count — fails non-zero if the notebook raises, gating prod promotion.
Switch profiles for one call. databricks --profile prod <command>. Same binary, different workspace, different identity.

Frequently asked questions

What is the difference between the Databricks API and the Databricks CLI?

The Databricks API is the canonical REST surface (/api/2.x/<group>/<verb>) — authenticated with a Bearer token, paginated with next_page_token, and the foundation everything else builds on. The Databricks CLI is a Go binary (databricks) that wraps the API with consistent flags (--profile, --output JSON, --json @file), six top-level command groups, and automatic OAuth token handling. They are the same surface; the CLI is just the ergonomic client. Use the CLI for daily ops and scripts; reach for raw HTTP calls only when you need a field the CLI does not expose or you are writing a deeper library. The legacy Python databricks-cli from before 2024 is deprecated — use the Go binary.

PAT vs OAuth — which authentication should I use for CI/CD?

OAuth M2M with a service principal is the production answer for CI/CD. A PAT is per-user, manually rotated, and dies when the issuing user is offboarded — a single human's offboarding can break every CI pipeline. An OAuth M2M service principal is role-shaped: it has a client_id and client_secret, rotates every 90 days, and survives any individual leaving the team. In GitHub Actions, set DATABRICKS_HOST, DATABRICKS_CLIENT_ID, and DATABRICKS_CLIENT_SECRET as repository secrets and the CLI handles the token exchange automatically. PATs are fine for laptop scripts and ad-hoc human work; never for shared automation.

What are Databricks Asset Bundles and why are they better than raw API calls?

Databricks Asset Bundles are a YAML deployment format that describes the full set of jobs, pipelines, clusters, permissions, and dashboards a project wants in a workspace. A bundle is a databricks.yml root file plus typed YAML files under resources/, with multiple targets: (dev, staging, prod) sharing the same structure. The CLI commands are bundle validate (schema and Unity Catalog reference check), bundle deploy --target <env> (atomic deploy that updates existing resource IDs in place), and bundle run (trigger a deployed job). Bundles beat raw API calls on three axes: they are declarative (the YAML is the source of truth), atomic (one transaction per deploy), and diffable (Git history is your audit trail). Use raw API calls only for surgical one-off fixes the bundle does not cover.

Can I manage Unity Catalog catalogs and grants via the REST API?

Yes — Unity Catalog has its own endpoint group at /api/2.1/unity-catalog/... covering catalogs, schemas, tables, volumes, models, external locations, storage credentials, and grants. The most common automated calls are POST /unity-catalog/catalogs to create a catalog, POST /unity-catalog/schemas to create a schema, and PATCH /unity-catalog/permissions/<securable_type>/<full_name> to adjust grants. The CLI exposes these as databricks catalogs, databricks schemas, databricks grants, and Asset Bundles support resources/grants/ blocks for declarative grant management. Anything you can do in the Unity Catalog UI is exposed via REST; treat grants as code and apply them via bundles.

How do I idempotently create a job from a GitHub Actions workflow?

The cleanest pattern is to use Databricks Asset Bundles. A bundle's bundle deploy --target staging creates the job on first run and updates the same job ID in place on every subsequent run — there is no "create vs update" distinction for the CI script to manage. The job's identity comes from its resource name in the bundle YAML, not from a per-deploy generated ID. If you must use raw API calls instead, hit POST /api/2.1/jobs/reset (full overwrite of an existing job) when you already know the job_id, or POST /api/2.1/jobs/create with an idempotency_token in the body to dedupe accidental retries — though jobs/create does not natively support a stable name-based lookup; you have to record the returned job_id yourself. The bundle path is dramatically simpler.

How do I migrate from the legacy `databricks-cli` to the new `databricks` CLI?

Three substitutions cover most of it. First, change the installed tool: brew install databricks/tap/databricks (or the official prebuilt binary) replaces pip install databricks-cli. Second, rewrite sub-command names: databricks workspace ls /Users becomes databricks workspace list /Users; databricks fs ls dbfs:/ is unchanged. Third, switch auth from the legacy ~/.databrickscfg PAT-only format to either OAuth U2M (databricks auth login --host) for laptops or M2M (client_id/client_secret env vars) for CI. The new CLI's bundle group is brand new and has no legacy equivalent. Update every script's databricks ... invocation in a single PR and run a CI smoke test before merging — the new CLI's exit codes are mostly compatible but a handful of edge cases changed.

Practice on PipeCode

Drill the API integration practice library → for the REST-call, pagination, and retry-shape probes interviewers love.
Warm up on Databricks company problems → for the company-specific SQL + Python + Spark surface.
Rehearse system design drills → for the "design the deploy pipeline" interview question.
Layer the data processing library → for the Jobs-API-driven ingest + transform shapes.
Sharpen the SQL axis with the SQL for data engineering interviews course →.
Stack the Spark internals with the Apache Spark internals course → — every Databricks job runs on Spark.
For broader pipeline craft, work through the ETL system design course →.
For the overall surface, read top data engineering interview questions →.
Stack the prerequisites with the only 5 skills you need to become a data engineer →.

Pipecode.ai is Leetcode for Data Engineering — every API and CLI recipe above ships with hands-on practice rooms where you write the `jobs/run-now` retry, the `bundle deploy` GitHub Actions workflow, and the OAuth M2M service principal cutover against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your `databricks bundle deploy --target prod` actually maps to the same idempotent behaviour interviewers expect on the whiteboard.

Practice Databricks problems now →
API integration drills →

Databricks API & CLI for Data Engineers: Jobs, Clusters, Repos & CI/CD

1. Why the API + CLI matter — the GUI ceiling

The Databricks UI is great for exploration and terrible for production — the API + CLI are the only way to ship idempotent, reviewable, Git-backed change

Worked example — the click-ops to GitOps migration

Worked example — picking between API, CLI, Terraform, and DAB

Interview question on the click-ops vs GitOps shift

Solution Using a discover → snapshot → bundle → deploy migration plan

2. REST API surface — endpoint groups every DE uses

The Databricks REST API is eight groups, one Bearer token, and one pagination contract — knowing the eight groups by name lets you script anything

Worked example — listing every job with pagination

Worked example — idempotent run-now with a retry-safe token

Interview question on REST surface design

Solution Using a Jobs 2.1 lifecycle with create → run-now → runs/get → runs/get-output

3. CLI cheat sheet — the 20 commands worth memorising

The databricks CLI is six categories of about twenty commands — auth, clusters, jobs, workspace, repos/secrets, bundle — and --output JSON | jq makes every one scriptable

Worked example — scripting "list all jobs and find the slowest ones"

Worked example — creating a cluster from a JSON spec

Interview question on CLI-driven automation

Solution Using jobs list, jobs list-runs, and jobs update with --output JSON | jq

4. CI/CD with Databricks Asset Bundles

Databricks Asset Bundles are the YAML deployment format for the Jobs / Pipelines / Clusters / Permissions surface — one bundle, three targets, one approval gate, one auditable promotion path

Worked example — the minimal validate-on-PR GitHub Actions workflow

Worked example — the staging-then-prod promote workflow

Interview question on DAB drift detection

Solution Using a scheduled bundle validate --target prod job that diffs against deployed state

5. Authentication patterns — when each one fits

Databricks ships four authentication modes — PAT, OAuth U2M, OAuth M2M (service principal), and notebook context — and the production answer is OAuth M2M

Worked example — provisioning a service principal for GitHub Actions

Worked example — rotating a service principal secret on a 90-day schedule

Interview question on the production auth answer

Solution Using OAuth M2M with a service principal and system.access.audit reconciliation

Cheat sheet — API + CLI recipes

Frequently asked questions

What is the difference between the Databricks API and the Databricks CLI?

PAT vs OAuth — which authentication should I use for CI/CD?

What are Databricks Asset Bundles and why are they better than raw API calls?

Can I manage Unity Catalog catalogs and grants via the REST API?

How do I idempotently create a job from a GitHub Actions workflow?

How do I migrate from the legacy databricks-cli to the new databricks CLI?

Practice on PipeCode

Worked example — idempotent `run-now` with a retry-safe token

Solution Using a Jobs 2.1 lifecycle with `create` → `run-now` → `runs/get` → `runs/get-output`

The `databricks` CLI is six categories of about twenty commands — auth, clusters, jobs, workspace, repos/secrets, bundle — and `--output JSON | jq` makes every one scriptable

Solution Using `jobs list`, `jobs list-runs`, and `jobs update` with `--output JSON | jq`

Solution Using a scheduled `bundle validate --target prod` job that diffs against deployed state

Solution Using OAuth M2M with a service principal and `system.access.audit` reconciliation

How do I migrate from the legacy `databricks-cli` to the new `databricks` CLI?