databricks api looks like a single product surface on the marketing page — the reality is a layered stack of REST 2.x endpoint groups, a Go CLI binary, a YAML deployment format, and four different authentication modes — and the line between "I know Databricks" and "I can ship Databricks to production" is whether you can drive every one of those layers from a Git repository with no clicks. Click-ops is fine for exploration; a scheduled prod job that nobody can redeploy without opening a browser is a resume-limiting outage waiting to happen.
This guide is the cheat sheet you wished existed the first time a databricks cli upgrade replaced the Python tool with a Go binary and broke half your Makefile. It walks through the REST API endpoint map across Jobs 2.x, Clusters 2.0, Repos 2.0, Secrets 2.0, Workspace 2.0, DBSQL 2.0 and Unity Catalog 2.1; the twenty CLI commands worth memorising; the Databricks Asset Bundles CI/CD pattern with GitHub Actions and a staging → prod approval gate; and the auth-pattern matrix for PAT, OAuth U2M, OAuth M2M with a service principal, and notebook-context auth. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.
When you want hands-on reps immediately after reading, drill the API integration practice library →, warm up on Databricks company problems →, and rehearse system design drills → for the deploy-pipeline interview.
On this page
- Why the API + CLI matter — the GUI ceiling
- REST API surface — endpoint groups every DE uses
- CLI cheat sheet — the 20 commands worth memorising
- CI/CD with Databricks Asset Bundles
- Authentication patterns — when each one fits
- Cheat sheet — API + CLI recipes
- Frequently asked questions
- Practice on PipeCode
1. Why the API + CLI matter — the GUI ceiling
The Databricks UI is great for exploration and terrible for production — the API + CLI are the only way to ship idempotent, reviewable, Git-backed change
The one-sentence invariant: every production-grade Databricks operation lives behind the REST API; the CLI and Asset Bundles are thin clients on top of the same endpoints; and any workflow you cannot reproduce from a Git checkout is technical debt waiting to page you at 2am. Once you internalise "the GUI is a renderer over the API," the entire conversation about CI/CD, drift detection, and on-call rotations becomes a sequence of REST calls plus YAML.
The four things only the API/CLI give you.
-
Bulk operations. Creating 40 clusters or rotating 200 secrets is a
forloop, not 40 right-click menus. Anything more than three resources is faster scripted than clicked. -
CI/CD and code review. A
databricks.ymllives in Git, gets diffed in pull requests, and deploys atomically. A clicked job has no Git history — you cannot diff what changed last Thursday. -
Drift detection.
databricks bundle validate --target prodcompares declared state to deployed state and flags anything a human edited in the UI. The GUI alone has no concept of "drift." - Scheduled rotations. Token rotation, secret rotation, cluster-policy edits, library version bumps — every "every 90 days" task is a cron-friendly CLI call. The UI cannot run on a schedule.
The click-ops debt symptom.
You inherit a prod job whose YAML you cannot find anywhere in Git. The only way to redeploy it is to open the UI, screenshot the config, and rebuild from memory. That is "click-ops debt" — and the cure is to export the current job spec via databricks jobs get <id> --output JSON > resources/jobs/legacy.json and check it into Git the same hour you discover the gap.
API vs CLI vs Terraform vs Asset Bundles — when each tool fits.
-
REST API directly (via
curl, Pythonrequests, orhttpx) — when you need a one-off automation outside the CLI's surface area, or when you are writing a deeper library. Higher fidelity, lower ergonomics. -
databricksCLI (the Go binary) — the default for interactive ops, scripted one-shots, and ad-hoc deploys. Wraps the REST API with a clean--output JSON | jqflow. - Terraform — when you also need to manage cloud infra (S3 buckets, IAM roles, networking) in the same plan. Great for the whole-stack infrastructure layer; overkill for "just deploy a job."
- Databricks Asset Bundles (DAB) — the 2025 default for Databricks-native CI/CD. Declarative YAML, atomic deploys, native staging/prod targets, drift detection.
Authentication surface — four modes, one production answer.
- Personal Access Token (PAT) — the legacy mode. Fast to issue, per-user, no automatic rotation. Fine for an ad-hoc human script; never for a service.
- OAuth U2M (user-to-machine) — short-lived browser login for laptops. The default for the modern CLI on a developer machine.
- OAuth M2M (machine-to-machine) with a service principal — the production default. Client ID + client secret, rotated every 90 days, no human in the loop.
-
Notebook-context auth — the
dbutilstoken inherited inside a running job, scoped to the job's run-as identity. Available only inside a notebook task.
The CLI generational split.
-
The 2025
databricksGo binary — the only supported CLI. Installed viabrew install databricks/tap/databricksor the prebuilt binary release. Compatible with--output JSON, has top-level groups likejobs,clusters,bundle,auth,workspace,secrets,fs,current-user. -
The legacy Python
databricks-cli— deprecated since 2024. Still installable from PyPI, still shows up in old Makefiles. Anything you read on an old StackOverflow answer that saysdatabricks workspace ls /Usersand notdatabricks workspace list /Usersis from the legacy tool. The new CLI uses different sub-command names for some groups; rewrite your scripts.
Versioning in the REST URL.
- Every endpoint is
2.0or2.x. Pin the version in every call:POST /api/2.1/jobs/create, notPOST /api/jobs/create. Otherwise you get whichever version is the current default — usually fine, sometimes catastrophic when 2.x is released and renames a field. - The
bundle deploycommand pins versions for you. Hand-rolledcurldoes not.
What interviewers listen for.
- Do you reach for DAB or Terraform, not raw
curl, when asked to "deploy this job from CI"? — senior signal. - Do you mention OAuth M2M with a service principal when asked about prod auth, not PAT? — required answer.
- Do you mention drift detection when asked how you keep prod stable? — senior signal.
- Do you mention
--output JSON+jqwhen asked to script the CLI? — required answer.
Worked example — the click-ops to GitOps migration
Detailed explanation. A new team inherits a Databricks workspace full of jobs created in the UI over three years. There is no Git repo of definitions. The first migration step is to export every job to JSON via the API, commit the dump to Git, convert the highest-frequency jobs to Asset Bundle YAML, and deploy them back via the CLI — proving the round trip works before retiring the GUI version.
Question. Given an inherited Databricks workspace with 12 clicked jobs, write the API calls and CLI commands you would run to discover, export, version-control, and re-deploy each job from a databricks.yml bundle, with no downtime.
Input.
| job_id | name | trigger | owner_lost? |
|---|---|---|---|
| 101 | nightly_etl | schedule | yes |
| 102 | hourly_ingest | schedule | yes |
| 103 | ad_hoc_reprocess | manual | yes |
| 104 | dq_audit | schedule | yes |
Code.
# 1) Discover every job and capture the spec
databricks jobs list --output JSON | jq -r '.[].job_id' > jobs.txt
mkdir -p exported_jobs
while read job_id; do
databricks jobs get --job-id "$job_id" --output JSON > "exported_jobs/${job_id}.json"
done < jobs.txt
# 2) Commit the raw export
git add exported_jobs/ && git commit -m "snapshot inherited Databricks jobs"
# 3) Convert each JSON to a resources/jobs/<name>.yml file under databricks.yml
# (templated by a script; one resource block per job)
python3 scripts/json_to_bundle.py exported_jobs/ > resources/jobs/
# 4) Validate and deploy back to the same workspace, pointing at the same job IDs
databricks bundle validate --target prod
databricks bundle deploy --target prod
Step-by-step explanation.
-
databricks jobs list --output JSONreturns every job in the workspace as a JSON array.jq -r '.[].job_id'extracts just the IDs into a text file for iteration. -
databricks jobs get --job-id <id> --output JSONdumps the full spec including tasks, schedule, libraries, tags, permissions, and run-as identity. One file per job. - Committing the raw JSON gives you an audit baseline — even before the YAML conversion, you have the "what is in prod today" snapshot.
- The
json_to_bundle.pyscript translates each Jobs 2.x JSON into the equivalent Asset Bundle YAML stanza underresources/jobs/. Most fields map one-to-one; cluster references rewrite from inline cluster specs into shared cluster pools. -
databricks bundle validatechecks YAML schema and Unity Catalog references without deploying. Catch the typos here. -
databricks bundle deploy --target produpdates the existing job IDs in place. Because the bundle deploy uses the samenamefield and detects existing resources, no new job IDs are created — schedules keep firing without a gap.
Output.
| Step | What was created | Where |
|---|---|---|
| Step 1 |
jobs.txt (12 IDs) |
local |
| Step 2 |
exported_jobs/*.json (12 files) |
Git |
| Step 3 | initial commit "snapshot…" | Git history |
| Step 4 |
resources/jobs/*.yml (12 files) |
Git |
| Step 5 | validation report | local stdout |
| Step 6 | Updated 12 jobs in workspace … |
CLI output + workspace |
Rule of thumb. Every migration from click-ops to GitOps starts with jobs list + jobs get and ends with bundle deploy --target prod. Never rewrite a job by hand from a screenshot — the JSON export is canonical and lossless.
Worked example — picking between API, CLI, Terraform, and DAB
Detailed explanation. Picking the right tool for a Databricks change is a constant interview probe. The answer is rarely "one tool for everything" — it is "API for fine-grained automation, CLI for ergonomics, Terraform for cross-cloud infra, DAB for Databricks-native CI/CD." Saying so out loud separates seniors from juniors.
Question. For each of these tasks, which Databricks tool would you pick and why: (a) deploy a new job from a PR, (b) rotate 40 secret values nightly, (c) provision an S3 bucket plus a new Databricks workspace, (d) update a job's libraries field on an emergency basis?
Input.
| Task | Frequency | Cross-cloud? | Latency tolerance |
|---|---|---|---|
| (a) Deploy a job from PR | per PR | no | minutes |
| (b) Rotate 40 secrets nightly | nightly | no | minutes |
| (c) Provision bucket + workspace | one-off | yes | hours |
| (d) Patch one job's libraries | rare emergency | no | seconds |
Code.
# (a) Deploy a new job from a PR — Asset Bundle
databricks bundle deploy --target staging
databricks bundle deploy --target prod # after approval
# (b) Rotate 40 secrets nightly — CLI in a cron / scheduled GitHub Action
for scope in prod-keys; do
for key in $(databricks secrets list-secrets --scope "$scope" --output JSON | jq -r '.[].key'); do
new=$(./scripts/mint_secret.sh "$key")
databricks secrets put-secret --scope "$scope" --key "$key" --string-value "$new"
done
done
# (c) Provision an S3 bucket + Databricks workspace — Terraform
terraform plan
terraform apply
# (d) Patch one job's libraries field — REST API direct
curl -X POST "https://${HOST}/api/2.1/jobs/reset" \
-H "Authorization: Bearer ${TOKEN}" \
-H "Content-Type: application/json" \
-d @new_job.json
Step-by-step explanation.
- (a) is the canonical Asset Bundle case: declarative, diffable, atomic deploy, staging-then-prod with manual approval. Anything CI-pipeline-shaped that lives entirely inside one Databricks workspace is a bundle.
- (b) is a bulk-secret rotation. DAB does not model secrets-as-data well (you do not check secret values into Git), so the CLI in a scheduled action is the right tool. The loop is shell, the lift is the CLI.
- (c) crosses Databricks and the cloud provider (S3 + IAM + a new workspace). Terraform is the one tool that can span both, with one plan and one state. DAB cannot create a workspace; the CLI can only act inside one.
- (d) is a one-shot emergency patch — the kind of "fix one field in prod right now" where the CLI is overkill and Terraform is too slow. Hit the REST API with
curlagainstjobs/reset, document the change after.
Output.
| Task | Tool | Why |
|---|---|---|
| Deploy job from PR | DAB | declarative, atomic, staging → prod |
| Rotate 40 secrets | CLI | bulk + script-friendly + secrets-not-in-Git |
| Provision workspace + bucket | Terraform | cross-cloud, one state |
| Emergency one-field patch | REST API direct | lowest latency, minimal cognitive cost |
Rule of thumb. Pick the thinnest tool that handles the task. DAB for the 80% recurring deploy path; CLI for ad-hoc and bulk; Terraform when the change leaves the Databricks workspace; raw REST only when you need surgical precision on an undocumented or edge-case field.
Interview question on the click-ops vs GitOps shift
A senior interviewer often frames this as: "Your team inherits a Databricks workspace with no Git repository of job definitions. Walk me through the first month of work to bring it under CI/CD without breaking any running schedule."
Solution Using a discover → snapshot → bundle → deploy migration plan
# Week 1 — discover
databricks current-user me # confirm auth + workspace identity
databricks jobs list --output JSON > jobs.json # snapshot every job
databricks clusters list --output JSON > clusters.json
databricks repos list --output JSON > repos.json
# Week 2 — snapshot every job spec into Git
for id in $(jq -r '.[].job_id' jobs.json); do
databricks jobs get --job-id "$id" --output JSON > exported_jobs/$id.json
done
git add exported_jobs/ jobs.json clusters.json repos.json
git commit -m "snapshot: workspace inventory"
# Week 3 — author databricks.yml + per-job YAML, validate
databricks bundle validate --target staging
# Week 4 — atomic deploy to staging then prod with manual approval
databricks bundle deploy --target staging
# Run smoke test job: bundle run assert_row_count
databricks bundle deploy --target prod
Step-by-step trace.
| Week | Action | Reversible? | Risk |
|---|---|---|---|
| 1 | snapshot jobs/clusters/repos | yes (read-only) | none |
| 2 | commit snapshots to Git | yes | none |
| 3 | author + validate bundle | yes (no deploy yet) | none |
| 4a | deploy to staging | yes (rollback by reverting) | low |
| 4b | smoke test in staging | yes | low |
| 4c | deploy to prod | yes (re-deploy old SHA) | medium |
The plan is intentionally bottom-up — every step is reversible, and the highest-risk operation (prod deploy) only happens after a smoke test in staging passes. No schedule is interrupted because the bundle deploy updates existing job IDs in place.
Output:
| Artifact | Created in | Used for |
|---|---|---|
exported_jobs/*.json |
Week 2 | rollback baseline |
databricks.yml |
Week 3 | declarative source of truth |
resources/jobs/*.yml |
Week 3 | per-job declarative spec |
staging workspace job updates |
Week 4a | smoke testing |
prod workspace job updates |
Week 4c | GitOps in production |
Why this works — concept by concept:
-
Read-only discovery first —
jobs listandjobs getmutate nothing, so the snapshot phase carries zero outage risk and builds confidence in the API. -
Snapshot is canonical — the JSON exports are the authoritative "what is in prod today" record before any YAML translation. If the bundle gets the YAML wrong, you can always re-deploy from the JSON via
curl+jobs/reset. -
Bundle validation before deploy —
databricks bundle validatecatches schema errors, missing UC references, and undefined cluster pools without touching the workspace. Fail fast, fail cheap. -
Atomic update of existing job IDs — the bundle deploy uses the resource
namefield to find existing job IDs and update them in place. Schedules keep firing across the migration; there is no "the job vanished for ten minutes" window. -
Staging → prod with smoke test — the smoke test is a cheap
assert_row_countnotebook that proves the deploy actually works. Treat staging as a real gate, not a vanity environment. - Cost — O(jobs) API calls for the snapshot; O(1) deploy from then on. The migration is one engineer-week per ~50 jobs, after which every future change is a PR.
SQL · Python
Company — Databricks
Databricks company problems
Python
Topic — API integration
API integration problems (Python)
2. REST API surface — endpoint groups every DE uses
The Databricks REST API is eight groups, one Bearer token, and one pagination contract — knowing the eight groups by name lets you script anything
The mental model in one line: every Databricks resource you can manage from the UI has a /api/2.x/<group>/<verb> endpoint, every endpoint takes a Bearer token, and the catalogue of groups is small enough to memorise — Jobs, Clusters, Repos, Secrets, Workspace, DBSQL, Unity Catalog, and Workflows. Once you can name the eight groups and the top three verbs in each, every "how would you script this?" question has an immediate skeleton answer.
The eight groups in one table.
| Group | Version | Top verbs | Typical use case |
|---|---|---|---|
| Jobs | 2.1 / 2.2 |
create, reset, run-now, runs/get-output, runs/repair
|
scheduled and on-demand work |
| Clusters | 2.0 |
create, edit, start, restart, terminate, events, libraries
|
compute lifecycle |
| Repos | 2.0 |
create, update, list, delete
|
Git-backed working copies in the workspace |
| Secrets | 2.0 |
scopes/create, put, acls/put, list-scopes
|
sensitive config + ACLs |
| Workspace | 2.0 |
import, export, mkdirs, list
|
the legacy notebook FS API |
| DBSQL | 2.0 |
warehouses, queries, dashboards, alerts
|
the SQL persona surface |
| Unity Catalog | 2.1 |
catalogs, schemas, tables, grants, external-locations, storage-credentials
|
governance + lineage |
| Workflows | (via Jobs) | task types: notebook, python_wheel, dlt, sql, run_job
|
the orchestration view of jobs |
Conventions you can rely on.
-
Auth. Every call carries
Authorization: Bearer <token>. The token is a PAT, a U2M access token, an M2M access token, or the notebook context token — all four are interchangeable to the API. -
Rate limits. Hit a quota, get HTTP 429 with a
Retry-Afterheader. The CLI auto-retries with backoff; hand-rolled clients must. -
Idempotency. Long-running
jobs/run-nowaccepts anidempotency_tokenfield. Use it on every retry to avoid duplicate runs. -
Pagination. List endpoints return
next_page_token; pass it back aspage_tokenon the next call. Keep going untilnext_page_tokenis absent. -
Versioning. Pin
2.0,2.1, or2.2in every URL. The "current default" can change.
Jobs 2.x in three sentences.
-
POST /api/2.1/jobs/createaccepts a JSON spec of tasks, clusters, libraries, schedule, and access control. The response is ajob_id— the immortal handle. -
POST /api/2.1/jobs/resetreplaces the entire job spec atomically (no patch — full overwrite). This is what bundle deploy emits per job. -
POST /api/2.1/jobs/run-nowtriggers an immediate run, optionally with parameters; the response is arun_id.
Clusters 2.0 in three sentences.
-
POST /api/2.0/clusters/createaccepts a cluster spec (node type, runtime, num workers, autoscale, libraries) and returns acluster_id. -
POST /api/2.0/clusters/editmutates the spec in place — much faster than create-then-destroy, but only legal on a terminated cluster. -
GET /api/2.0/clusters/events?cluster_id=<id>is the only way to get the cluster's full event log including autoscale events and driver crashes — vital for debugging.
Repos 2.0 — the Git-backed primitive.
-
POST /api/2.0/reposwith aurl,provider, and optionalpathcreates a Git-backed working copy inside/Repos/<user>/<name>or/Repos/<service-principal>/<name>. -
PATCH /api/2.0/repos/<id>with abranchortagfield syncs the working copy. This is the "pull latest main into the workspace" call CI uses. - The Repos API is not the right place for production code — code in
/Reposis mutable and not idempotent. Use Workspace Files + bundles for prod.
Secrets 2.0 — never log a value.
- Scopes are namespaces; secrets are key-value entries inside a scope; ACLs control which principals can read which scope.
- Every endpoint accepts the value as a string — but
GET /api/2.0/secrets/getreturns NOT the value (by design) but only the key list. The only way to read a value is from a notebook viadbutils.secrets.get(scope, key).
Unity Catalog 2.1 — governance is a first-class API.
- Catalogs, schemas, tables, volumes, models, and grants all live behind
/api/2.1/unity-catalog/.... -
PATCH /api/2.1/unity-catalog/permissions/<securable_type>/<full_name>adjusts grants. This is the API that drives "self-service grant requests" tooling.
Common interview probes on the REST surface.
- "Which API version do you target for Jobs?" — 2.1 or 2.2 in 2025–2026. 2.0 is legacy.
- "How do you make
jobs/run-nowretry-safe?" — pass anidempotency_token; deduplicates retries at the server. - "How do you list every job in a workspace with 5000 jobs?" — paginate via
page_token+next_page_token. - "Why is
clusters/editfaster than create-then-delete?" — preserves the cluster ID, the metastore attachments, and the warm-cache plan. Cheaper for users referencing the cluster by ID.
Worked example — listing every job with pagination
Detailed explanation. A workspace with 5000 jobs cannot be listed in one call — the API returns at most 25 or 100 per page and includes a next_page_token. The script must loop until the token is empty. This is the canonical "use pagination correctly" interview probe.
Question. Write a Python script that lists every Databricks job in a workspace using the Jobs 2.1 API, handling pagination correctly.
Input.
| param | value |
|---|---|
| host | https://acme.cloud.databricks.com |
| token |
dapi... (M2M access token) |
| total_jobs | 5000 |
| page_size | 25 (default) |
Code.
import os
import httpx
HOST = os.environ["DATABRICKS_HOST"]
TOKEN = os.environ["DATABRICKS_TOKEN"]
def list_all_jobs() -> list[dict]:
jobs: list[dict] = []
page_token: str | None = None
while True:
params = {"limit": 25}
if page_token:
params["page_token"] = page_token
r = httpx.get(
f"{HOST}/api/2.1/jobs/list",
headers={"Authorization": f"Bearer {TOKEN}"},
params=params,
timeout=30.0,
)
r.raise_for_status()
body = r.json()
jobs.extend(body.get("jobs", []))
page_token = body.get("next_page_token")
if not page_token:
break
return jobs
if __name__ == "__main__":
all_jobs = list_all_jobs()
print(f"Total jobs: {len(all_jobs)}")
Step-by-step explanation.
- The first iteration calls
/api/2.1/jobs/list?limit=25with nopage_token. The response contains up to 25 jobs and anext_page_tokenopaque cursor. - Each subsequent iteration passes the previous
next_page_tokenas thepage_tokenquery param. The API returns the next 25 jobs. - The loop exits when the response does not contain
next_page_token— the server is telling you "there is no next page." - The script accumulates every page into the
jobslist and returns the full set. For 5000 jobs at 25 per page, that is 200 API calls — small.
Output.
| Iteration | Jobs returned | next_page_token? |
|---|---|---|
| 1 | 25 | yes |
| 2 | 25 | yes |
| … | … | … |
| 200 | 25 | no |
| Total | 5000 | — |
Rule of thumb. Every Databricks list endpoint paginates. Write the loop once as a reusable paginate(endpoint, params) helper and you never have to think about it again. Never assume "the workspace is small enough to fit in one page" — the workspace you inherit next quarter will not be.
Worked example — idempotent run-now with a retry-safe token
Detailed explanation. Triggering a job from CI is straightforward — until the network blip mid-call. Without an idempotency token, retrying the request after a timeout can run the job twice. With one, the server deduplicates: two POSTs with the same token map to the same run_id.
Question. Write a bash snippet that triggers a Databricks job via jobs/run-now, supplies an idempotency_token, and retries on transient errors without risking a duplicate run.
Input.
| field | value |
|---|---|
| job_id | 12345 |
| host | $DATABRICKS_HOST |
| token | $DATABRICKS_TOKEN |
| idempotency_token | UUID derived from the current Git SHA + date |
Code.
#!/usr/bin/env bash
set -euo pipefail
JOB_ID=12345
IDEMP=$(printf '%s-%s' "$(git rev-parse HEAD)" "$(date -u +%Y%m%d)")
PAYLOAD=$(jq -n --arg t "$IDEMP" '{
job_id: 12345,
idempotency_token: $t,
notebook_params: {date: "2026-06-04"}
}')
for attempt in 1 2 3; do
RESPONSE=$(curl -sS -w '\n%{http_code}' -X POST \
"$DATABRICKS_HOST/api/2.1/jobs/run-now" \
-H "Authorization: Bearer $DATABRICKS_TOKEN" \
-H "Content-Type: application/json" \
-d "$PAYLOAD") || RESPONSE="$?"
STATUS=$(printf '%s' "$RESPONSE" | tail -n1)
if [ "$STATUS" = "200" ]; then break; fi
echo "Attempt $attempt got HTTP $STATUS — retrying"
sleep $((2 ** attempt))
done
RUN_ID=$(printf '%s' "$RESPONSE" | head -n-1 | jq -r '.run_id')
echo "Run started: $RUN_ID"
Step-by-step explanation.
- The idempotency token is derived from the Git SHA + date — stable across retries of the same CI run, distinct across deploys.
- The payload includes
idempotency_tokenandnotebook_params. The server stores the token; if a second request with the same token arrives within 24 hours, the server returns the originalrun_idinstead of starting a new run. - The loop retries up to 3 times with exponential backoff. Because the token is fixed across attempts, even if the first call partially succeeded (started the run but the response did not reach the client), the second call returns the same
run_id. - The trailing
head -n-1 | jqparses the JSON body —-w '%{http_code}'appended the status code on its own line, so we strip that withtail -n1first.
Output.
| Scenario | Attempts | Final state |
|---|---|---|
| Clean call | 1 | run started, run_id=R1
|
| Network blip on first attempt | 2 | second attempt returns the same run_id=R1 (deduped) |
| Server 5xx on first two attempts | 3 | third attempt returns the original or a new run_id depending on whether the original landed |
Rule of thumb. Every CI script that calls jobs/run-now should supply an idempotency_token. The cost is a few characters in the payload; the benefit is "the network can fail and the job still runs exactly once."
Interview question on REST surface design
A senior interviewer often asks: "Suppose I gave you only the REST API — no CLI, no DAB — and asked you to deploy a job, run it, tail its output, and clean up afterwards. Walk me through every endpoint call you would make."
Solution Using a Jobs 2.1 lifecycle with create → run-now → runs/get → runs/get-output
# 1) Create the job
JOB_ID=$(curl -sS -X POST "$HOST/api/2.1/jobs/create" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d @job_spec.json | jq -r '.job_id')
# 2) Trigger an immediate run
RUN_ID=$(curl -sS -X POST "$HOST/api/2.1/jobs/run-now" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d "{\"job_id\": $JOB_ID, \"idempotency_token\": \"$(uuidgen)\"}" \
| jq -r '.run_id')
# 3) Poll until terminal state
while true; do
STATE=$(curl -sS "$HOST/api/2.1/jobs/runs/get?run_id=$RUN_ID" \
-H "Authorization: Bearer $TOKEN" | jq -r '.state.life_cycle_state')
if [ "$STATE" = "TERMINATED" ] || [ "$STATE" = "INTERNAL_ERROR" ]; then break; fi
sleep 10
done
# 4) Fetch output for the (single-task) run
curl -sS "$HOST/api/2.1/jobs/runs/get-output?run_id=$RUN_ID" \
-H "Authorization: Bearer $TOKEN" | jq '.notebook_output.result'
# 5) Optionally delete the job
curl -sS -X POST "$HOST/api/2.1/jobs/delete" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d "{\"job_id\": $JOB_ID}"
Step-by-step trace.
| Step | Endpoint | Method | Returns |
|---|---|---|---|
| 1 | /api/2.1/jobs/create |
POST | {job_id} |
| 2 | /api/2.1/jobs/run-now |
POST | {run_id} |
| 3 | /api/2.1/jobs/runs/get |
GET |
{state: {life_cycle_state}} (polled) |
| 4 | /api/2.1/jobs/runs/get-output |
GET | {notebook_output: {result}} |
| 5 | /api/2.1/jobs/delete |
POST | {} |
The lifecycle is intentionally CRUD-shaped: create returns a handle, trigger returns a run handle, polling reads state, output read returns the artifact, delete cleans up. The same five-step pattern works for any job, any task type, any cluster.
Output:
| Stage | Identifier | Comment |
|---|---|---|
| create | job_id |
persistent handle |
| run-now | run_id |
per-execution handle |
| poll | life_cycle_state |
one of PENDING, RUNNING, TERMINATED, INTERNAL_ERROR |
| get-output | notebook_output.result |
task output payload |
| delete | (none) | resource gone from workspace |
Why this works — concept by concept:
-
Resource handles are immortal —
job_idsurvives runs and config edits;run_idsurvives the run regardless of cluster state. Hold the IDs, not the JSON specs. -
Idempotency on triggers —
run-nowaccepts a token so a retried POST does not start a duplicate run. Every CI call should send one. -
Polling, not webhooks — Databricks does not push job-completion webhooks by default; you poll
runs/getevery 10–30 seconds. Set a timeout to avoid runaway polling. -
Output is endpoint-specific —
runs/getreturns metadata;runs/get-outputreturns the notebook return value. Two endpoints for the same run, different payloads. -
Delete is permanent —
jobs/deleteremoves the job spec but historicalrun_ids and their logs remain queryable. Safer than it looks for hygiene; the audit trail survives. - Cost — five HTTP calls per lifecycle; poll loop adds one call every 10s. Negligible compared to the work the job does.
Python
Topic — API integration
API integration problems (Python)
Python
Topic — data processing
Data processing problems (Python)
3. CLI cheat sheet — the 20 commands worth memorising
The databricks CLI is six categories of about twenty commands — auth, clusters, jobs, workspace, repos/secrets, bundle — and --output JSON | jq makes every one scriptable
The mental model in one line: the CLI is a Go binary that wraps the REST API with consistent flags (--profile, --output JSON, --json for input), six top-level groups (auth, clusters, jobs, workspace, repos/secrets, bundle), and a default profile in ~/.databrickscfg. Once you memorise the twenty commands below, eighty percent of daily DE work on Databricks is a single CLI invocation.
Install + first run.
# macOS / Linux — the supported install path
brew install databricks/tap/databricks
# or download the prebuilt binary release and add to PATH
# Confirm version (should be 0.2x+ in 2025)
databricks version
# First-time setup: prompts for host + auth method
databricks configure
Auth — six commands.
databricks configure # legacy PAT prompt; writes ~/.databrickscfg
databricks auth login --host https://acme.cloud.databricks.com # OAuth U2M
databricks auth profiles # list known profiles
databricks auth describe --profile prod # see auth mode of one profile
databricks current-user me # smoke test: who am I?
databricks --profile prod jobs list # use a named profile per call
Clusters — five commands.
databricks clusters list --output JSON
databricks clusters create --json @cluster.json
databricks clusters start <cluster-id>
databricks clusters restart <cluster-id>
databricks clusters events <cluster-id> --output JSON | jq '.events[].type'
Jobs — five commands.
databricks jobs list --output JSON | jq -r '.[].job_id'
databricks jobs get --job-id <id>
databricks jobs run-now --job-id <id> --notebook-params '{"date":"2026-06-04"}'
databricks jobs get-run --run-id <id> --output JSON | jq '.state'
databricks jobs repair-run --run-id <id> --rerun-all-failed-tasks
Workspace + FS — four commands.
databricks workspace list /Users/me/proj
databricks workspace import-dir ./src /Users/me/proj
databricks workspace export-dir /Users/me/proj ./backup
databricks fs ls dbfs:/databricks-datasets/samples/
Repos + Secrets — five commands.
databricks repos create --url https://github.com/me/repo --provider gitHub
databricks repos update <repo-id> --branch main
databricks secrets create-scope --scope prod-keys
databricks secrets put-secret --scope prod-keys --key snowflake_pwd --string-value "$VALUE"
databricks secrets list-secrets --scope prod-keys
Asset Bundles — four commands.
databricks bundle validate # schema + UC ref check, no deploy
databricks bundle deploy --target prod # upload code + create/update resources
databricks bundle run --target prod my_job # trigger a deployed job
databricks bundle destroy --target dev # remove everything the bundle owns
Two universal flags.
-
--profile <name>— selects a named section in~/.databrickscfg. Lets you keepdev/staging/prodprofiles side by side and never type a host URL twice. -
--output JSON— switches the default human-friendly table format to JSON, suitable for| jq. Every list / get / run command supports it.
Three non-obvious habits.
-
Pipe everything through
jq. The CLI's JSON output is stable; the table output is for humans. CI should always read JSON. -
Use
--json @file.jsonfor create-style commands. The JSON file checks into Git; the spec is reviewable; the deploy is reproducible. - The CLI does its own retries on transient 5xx, so wrap your scripts with explicit error handling only for client-side errors (4xx).
Common interview probes.
- "How do you switch between dev and prod Databricks workspaces in the CLI?" —
--profile devvs--profile prod, profiles stored in~/.databrickscfg. - "How do you script the CLI to feed downstream tools?" —
--output JSON | jqon every read;--json @spec.jsonon every write. - "What is the difference between the new
databricksbinary and the olddatabricks-cli?" — the new one is a Go binary, supported, ships withbundleandauth login; the old one is Python, deprecated, and has different sub-command names. - "How do you authenticate the CLI in GitHub Actions?" — install the binary, then either export
DATABRICKS_HOST+DATABRICKS_TOKENenv vars (PAT or M2M token) or rundatabricks auth login --hostwith an M2M client_id/client_secret.
Worked example — scripting "list all jobs and find the slowest ones"
Detailed explanation. A common SRE-style ask is "rank the slowest jobs over the last 24 hours so we know what to optimise." The CLI plus jq plus a little shell is more than enough — no Python required.
Question. Write a one-shot shell pipeline that lists every job, fetches the most recent run of each, and prints the top 10 jobs by total runtime over the last 24 hours.
Input.
| job_id | name | last_run_id | duration_ms (last_run) |
|---|---|---|---|
| 101 | nightly_etl | r1 | 4 200 000 |
| 102 | hourly_ingest | r2 | 380 000 |
| 103 | dq_audit | r3 | 1 200 000 |
| 104 | reprocess | r4 | 9 800 000 |
Code.
#!/usr/bin/env bash
set -euo pipefail
# 1) Get every job_id
JOB_IDS=$(databricks jobs list --output JSON --profile prod | jq -r '.[].job_id')
# 2) For each, fetch the most recent run and join the duration
{
for id in $JOB_IDS; do
NAME=$(databricks jobs get --job-id "$id" --output JSON --profile prod \
| jq -r '.settings.name')
LAST=$(databricks jobs list-runs --job-id "$id" --limit 1 --output JSON --profile prod \
| jq -r '.runs[0].execution_duration // 0')
printf '%s\t%s\t%s\n' "$id" "$LAST" "$NAME"
done
} | sort -k2 -n -r | head -n 10 | column -t -s$'\t'
Step-by-step explanation.
-
jobs list --output JSON | jq -r '.[].job_id'extracts everyjob_idin the workspace as plain text, one per line. - The loop iterates every ID. For each, two calls:
jobs getto read the human-friendly name,jobs list-runs --limit 1to read the most recent run'sexecution_duration(in milliseconds). - The
// 0default injqhandles "no runs yet" — those jobs sort to the bottom. - The tab-separated lines feed
sort -k2 -n -r(numerical, descending by column 2), thenhead -n 10keeps the top ten, andcolumn -t -s$'\t'aligns columns for human reading. - Total API calls =
2 × number_of_jobs(oneget, onelist-runs). For a workspace with 200 jobs that is 400 calls, well under the rate limit.
Output.
| Rank | job_id | duration (ms) | name |
|---|---|---|---|
| 1 | 104 | 9 800 000 | reprocess |
| 2 | 101 | 4 200 000 | nightly_etl |
| 3 | 103 | 1 200 000 | dq_audit |
| 4 | 102 | 380 000 | hourly_ingest |
Rule of thumb. Anything that "could be a one-off Python script" can usually be the CLI + jq + sort and run in 20 lines of bash. Reach for Python only when you need libraries (pandas, pydantic, retries with policy), or when the shell quoting becomes a maintenance hazard.
Worked example — creating a cluster from a JSON spec
Detailed explanation. Hand-typing clusters create arguments is fragile. The robust pattern is to keep cluster.json in Git, edit it as a code review, and deploy via --json @cluster.json. The spec is reviewable; the deploy is reproducible; the cluster is one CLI call away.
Question. Given a checked-in cluster.json describing a single-node 14.3.x LTS cluster, create it with the CLI and print the new cluster ID.
Input — cluster.json.
{"cluster_name":"dq-audit-cluster","spark_version":"14.3.x-scala2.12","node_type_id":"i3.xlarge","num_workers":0,"autotermination_minutes":30,"spark_conf":{"spark.databricks.cluster.profile":"singleNode","spark.master":"local[*]"},"custom_tags":{"ResourceClass":"SingleNode"}}
Code.
CLUSTER_ID=$(databricks clusters create \
--json @cluster.json \
--profile prod \
--output JSON \
| jq -r '.cluster_id')
echo "Created $CLUSTER_ID"
Step-by-step explanation.
-
--json @cluster.jsonreads the spec from disk and POSTs it to/api/2.0/clusters/create. The@prefix tells the CLI "treat the next argument as a path, not a literal." -
--profile prodselects the prod section in~/.databrickscfgso the call lands on the right workspace without an env-var dance. -
--output JSON | jq -r '.cluster_id'extracts the newcluster_idfrom the response. - The same
cluster.jsonis the source of truth. The next time you edit the cluster spec, change the file, commit, re-run — or rundatabricks clusters edit --cluster-id "$CLUSTER_ID" --json @cluster.jsonto update in place.
Output.
| Field | Value |
|---|---|
cluster_id |
0604-093425-abcd1234 |
| state (after create) | PENDING → RUNNING |
| billable | from RUNNING transition |
Rule of thumb. Keep every cluster spec in a JSON file under clusters/. Never invoke clusters create with inline flags in production — the audit trail is the JSON file's Git history.
Interview question on CLI-driven automation
The interviewer often asks: "How would you list every Databricks job in a workspace, then disable any that have not run in 30 days, from a one-shot script with no UI clicks?"
Solution Using jobs list, jobs list-runs, and jobs update with --output JSON | jq
#!/usr/bin/env bash
set -euo pipefail
THIRTY_DAYS_AGO_MS=$(($(date -u -v -30d +%s) * 1000))
databricks jobs list --output JSON --profile prod \
| jq -r '.[].job_id' \
| while read -r JOB_ID; do
LAST_RUN_MS=$(databricks jobs list-runs \
--job-id "$JOB_ID" --limit 1 --output JSON --profile prod \
| jq -r '.runs[0].start_time // 0')
if [ "$LAST_RUN_MS" -lt "$THIRTY_DAYS_AGO_MS" ]; then
echo "Pausing job $JOB_ID (last run = $LAST_RUN_MS)"
databricks jobs update --job-id "$JOB_ID" \
--json '{"new_settings":{"schedule":{"pause_status":"PAUSED"}}}' \
--profile prod
fi
done
Step-by-step trace.
| job_id | last_run_ms | < cutoff? | action |
|---|---|---|---|
| 101 | yesterday | no | no-op |
| 102 | 45 days ago | yes | pause |
| 103 | never (0) | yes | pause |
| 104 | 5 days ago | no | no-op |
The script reads the workspace, computes the cutoff as "now minus 30 days in milliseconds," compares each job's most recent run timestamp against the cutoff, and pauses the laggards in place — without deleting them.
Output:
| Job_ID | Final state | Why |
|---|---|---|
| 101 | UNPAUSED | recent activity |
| 102 | PAUSED | inactive 45d |
| 103 | PAUSED | never ran |
| 104 | UNPAUSED | recent activity |
Why this works — concept by concept:
-
--output JSONeverywhere — the table output is for humans; JSON is the contract for scripts. Never grep table output. -
jqdoes the data manipulation — no Python needed for "extract this field" / "filter this list."jqis the lingua franca of REST-API automation. -
jobs updateis non-destructive — pausing is reversible: re-run withpause_status: UNPAUSEDand the schedule resumes. Compare withjobs delete, which is irreversible. -
Profile flag —
--profile prodkeeps the script portable: drop it on any laptop with that profile in~/.databrickscfgand it runs unchanged. No env-var hygiene required. -
Cost — one
jobs list+ N × (list-runs+ maybeupdate) calls. For 200 jobs that is at most ~600 calls — well under the rate limit, no backoff needed.
Python
Topic — data processing
Data processing problems (Python)
SQL · Python
Company — Databricks
Databricks company problems
4. CI/CD with Databricks Asset Bundles
Databricks Asset Bundles are the YAML deployment format for the Jobs / Pipelines / Clusters / Permissions surface — one bundle, three targets, one approval gate, one auditable promotion path
The mental model in one line: a Databricks Asset Bundle is a databricks.yml plus a resources/ directory describing every job, pipeline, cluster, permission, and dashboard you want deployed; bundle validate runs schema + UC reference checks; bundle deploy --target <env> uploads the code and creates or updates resources atomically; and the GitHub Actions pattern is validate-on-PR, deploy-to-staging-on-merge, manual-approval-then-prod. Once you internalise that one YAML drives three workspaces, every CI/CD interview question reduces to "show me the bundle and the workflow file."
The DAB mental model.
-
Root file.
databricks.ymldeclaresbundle.name, the list ofinclude:paths to per-resource YAML files, and atargets:block defining each environment (dev / staging / prod) with its host and run-as identity. -
Resources directory. Each file under
resources/is a typed declaration:resources/jobs/my_job.yml,resources/pipelines/dlt.yml,resources/clusters/shared.yml. Strict schema, validated locally. -
Variables.
variables:declares parameterised inputs — catalog name, schema name, warehouse ID. Overridden per target. -
Atomic deploy.
databricks bundle deploy --target prodperforms every resource update in one logical transaction, with name-based identity preservation (jobs keep theirjob_ids across deploys).
A minimal databricks.yml.
bundle:
name: pipecode-dab-demo
include:
- resources/jobs/*.yml
- resources/pipelines/*.yml
variables:
catalog:
description: "UnityCatalogname"
default: "dev_catalog"
targets:
dev:
workspace:
host: https://dev.cloud.databricks.com
variables:
catalog: dev_catalog
staging:
workspace:
host: https://staging.cloud.databricks.com
variables:
catalog: staging_catalog
prod:
mode: production
workspace:
host: https://prod.cloud.databricks.com
root_path: /Shared/.bundle/prod/${bundle.name}
variables:
catalog: prod_catalog
run_as:
service_principal_name: 11111111-2222-3333-4444-555555555555
A resources/jobs/my_job.yml.
resources:
jobs:
nightly_etl:
name: "nightly_etl_${bundle.target}"
schedule:
quartz_cron_expression: "002**?"
timezone_id: "UTC"
pause_status: "UNPAUSED"
tasks:
- task_key: ingest
notebook_task:
notebook_path: ./src/notebooks/ingest.py
base_parameters:
catalog: ${var.catalog}
new_cluster:
spark_version: "14.3.x-scala2.12"
node_type_id: "i3.xlarge"
num_workers: 2
The three guardrails.
-
bundle validate— schema check + Unity Catalog reference check. Fails locally before any deploy. -
Smoke test — a tiny
assert_row_countnotebook that reads from the freshly-deployed pipeline and confirms non-zero rows. Pass before promoting from staging to prod. -
Drift detection —
bundle validate --target prodcompared to the deployed state highlights any field a human edited via the UI. Treat any non-empty drift output as an alert.
The GitHub Actions pattern.
-
On
pull_request—bundle validate --target staging. Catches typos before merge. -
On
pushtomain—bundle deploy --target staging, then run smoke test. -
On
workflow_dispatch(manual) —bundle deploy --target prodafter a human approval click in the GitHub Environments protection rule.
Auth in the CI step.
- The CI runner needs an M2M service principal with workspace admin (or scoped permissions). Set
DATABRICKS_HOST,DATABRICKS_CLIENT_ID,DATABRICKS_CLIENT_SECRETas GitHub Actions secrets. The CLI reads them automatically. - Never use a personal PAT in a CI runner. PAT scopes to a user; if that user leaves the company, the pipeline breaks. Service principal scopes to a role, which outlives any individual.
Common interview probes.
- "How do you make a staging-vs-prod parameter (like catalog name) flow through one bundle?" — declare
variables:indatabricks.yml, override per target, reference as${var.name}in resource YAML. - "How do you guarantee a prod deploy does not start until staging passed a smoke test?" — separate workflows or stages: deploy-to-staging job runs
bundle run smoke_test, only on success does the promote-to-prod job run (with environment protection rules). - "What happens if someone edits a bundle-managed job in the UI?" — next
bundle validate --target prodreports drift; nextbundle deploy --target prodoverwrites the manual edit. The bundle is the source of truth. - "How do you roll back a bad bundle deploy?" — re-run
bundle deployfrom the previous Git SHA. Because deploys are atomic and update existing resource IDs in place, rollback is a re-deploy of the older commit.
Worked example — the minimal validate-on-PR GitHub Actions workflow
Detailed explanation. The cheapest possible CI step is bundle validate on every PR. Even before any deploy automation, this catches 80% of typos: missing fields, wrong cluster references, undefined variables, broken UC references.
Question. Write a GitHub Actions workflow that runs databricks bundle validate --target staging on every PR, authenticated via an M2M service principal stored in GitHub Secrets.
Input.
| GitHub secret | maps to env var |
|---|---|
| DATABRICKS_HOST | DATABRICKS_HOST |
| DATABRICKS_CLIENT_ID | DATABRICKS_CLIENT_ID |
| DATABRICKS_CLIENT_SECRET | DATABRICKS_CLIENT_SECRET |
Code.
name: bundle-validate
on:
pull_request:
paths:
- "databricks.yml"
- "resources/**"
- "src/**"
- ".github/workflows/bundle-validate.yml"
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install Databricks CLI
run: |
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
- name: Validate bundle
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
DATABRICKS_CLIENT_ID: ${{ secrets.DATABRICKS_CLIENT_ID }}
DATABRICKS_CLIENT_SECRET: ${{ secrets.DATABRICKS_CLIENT_SECRET }}
run: |
databricks bundle validate --target staging
Step-by-step explanation.
- The
on: pull_requesttrigger plus thepaths:filter ensures the job only runs when bundle-relevant files change. Saves CI minutes. -
actions/checkout@v4makes the repo available; the bundle CLI needs to readdatabricks.yml+resources/**. - The install script drops a Linux build of the
databricksGo binary into/usr/local/bin. Two seconds. - The validate step receives three env vars. The CLI sees
DATABRICKS_CLIENT_ID+DATABRICKS_CLIENT_SECRETand automatically performs an OAuth M2M token exchange — no PAT, no human in the loop. -
bundle validate --target stagingresolves variables to staging values, runs schema validation, dereferences UC catalog/schema/table names, and returns non-zero if anything is wrong. The PR fails the check on any issue.
Output.
| PR scenario | validate exit | PR status |
|---|---|---|
| YAML syntax OK + UC refs OK | 0 | green |
typo in node_type_id
|
non-zero | red |
| missing variable | non-zero | red |
| undefined UC catalog | non-zero | red |
Rule of thumb. Add bundle validate as your first GitHub Actions step on day one of bundle adoption. It is the cheapest possible safety net and catches a wide class of bugs before any deploy ever happens.
Worked example — the staging-then-prod promote workflow
Detailed explanation. Once validate works, the second workflow is "deploy to staging on merge, run smoke test, then optionally promote to prod after a human approval." This is the canonical Databricks CI/CD shape.
Question. Write a GitHub Actions workflow that on merge to main deploys to staging, runs a smoke-test job, and gates a separate prod-deploy job behind a manual approval.
Input.
| Trigger | Action |
|---|---|
push: main |
deploy + smoke staging |
workflow_dispatch (with approval) |
deploy prod |
Code.
name: bundle-deploy
on:
push:
branches: [main]
workflow_dispatch:
jobs:
staging:
runs-on: ubuntu-latest
if: github.event_name == 'push'
steps:
- uses: actions/checkout@v4
- run: curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
- env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
DATABRICKS_CLIENT_ID: ${{ secrets.DATABRICKS_CLIENT_ID }}
DATABRICKS_CLIENT_SECRET: ${{ secrets.DATABRICKS_CLIENT_SECRET }}
run: |
databricks bundle deploy --target staging
databricks bundle run --target staging assert_row_count
prod:
runs-on: ubuntu-latest
needs: staging
if: github.event_name == 'workflow_dispatch'
environment: production # GitHub Environment with required reviewers
steps:
- uses: actions/checkout@v4
- run: curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
- env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_PROD_HOST }}
DATABRICKS_CLIENT_ID: ${{ secrets.DATABRICKS_PROD_CLIENT_ID }}
DATABRICKS_CLIENT_SECRET: ${{ secrets.DATABRICKS_PROD_CLIENT_SECRET }}
run: |
databricks bundle deploy --target prod
Step-by-step explanation.
- On
push: main, thestagingjob runs. It deploys the bundle to the staging workspace, then runs anassert_row_countjob in staging as a smoke test. If the smoke test fails, the workflow stops — prod is never touched. - The
prodjob is gated byenvironment: production. GitHub Environments support "required reviewers" — a configured human must click "Approve" before the job starts. That click is the audit trail. - The two jobs use different secrets for staging vs prod. Two service principals, two scopes — staging credentials cannot mint a prod deploy even if they leak.
-
needs: stagingis a soft dependency forworkflow_dispatchruns — theif:filter only runsprodon the manual dispatch, but theneeds:ensures the workflow file is logically chained when both events fire together.
Output.
| Event | Stage outcome |
|---|---|
| PR opened | (handled by validate workflow) |
| Merge to main, smoke green | staging deployed; prod NOT deployed (awaiting dispatch) |
Operator runs workflow_dispatch + approves |
prod deployed |
| Smoke test fails in staging | workflow fails; alarms fire; nobody clicks approve |
Rule of thumb. Treat the staging deploy + smoke as a gate, not a vanity step. If the smoke test ever passes when prod would have broken, fix the smoke test — it is your last line of defence before a manual prod approval.
Interview question on DAB drift detection
A senior interviewer often probes: "Your team adopted DAB six months ago, but engineers still occasionally edit jobs in the UI for hot-fixes. How do you detect and reconcile that drift automatically?"
Solution Using a scheduled bundle validate --target prod job that diffs against deployed state
name: drift-detect
on:
schedule:
- cron: "09**MON" # every Monday 09:00 UTC
workflow_dispatch:
jobs:
drift:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
- id: validate
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_PROD_HOST }}
DATABRICKS_CLIENT_ID: ${{ secrets.DATABRICKS_PROD_CLIENT_ID }}
DATABRICKS_CLIENT_SECRET: ${{ secrets.DATABRICKS_PROD_CLIENT_SECRET }}
run: |
set +e
OUTPUT=$(databricks bundle validate --target prod 2>&1)
ECHOEC=$?
echo "$OUTPUT" > drift.txt
echo "drifted=$([ $ECHOEC -ne 0 ] && echo true || echo false)" >> $GITHUB_OUTPUT
- if: steps.validate.outputs.drifted == 'true'
run: |
gh issue create \
--title "DAB drift detected on prod ($(date -u +%F))" \
--body-file drift.txt \
--label drift
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
Step-by-step trace.
| Day | bundle vs deployed | validate exit | action |
|---|---|---|---|
| Mon | identical | 0 | no-op |
| Tue | (someone edits a job in UI) | n/a | (no cron yet) |
| Mon next week | diff: schedule changed | non-zero | open GitHub issue, page on-call |
| (engineer reviews) | either commit the UI edit to YAML, or bundle deploy to reset |
— | issue closed |
The job runs weekly, compares the declared YAML to the live workspace, and opens an issue when drift appears. Engineers then either adopt the manual edit (commit it to YAML and merge) or reject it (re-deploy the bundle to restore the declared state).
Output:
| Drift scenario | Issue title | Issue body |
|---|---|---|
| no drift | (none — issue not created) | — |
| job schedule edited | "DAB drift detected on prod (2026-06-08)" | full validate output diff |
| cluster libraries added in UI | "DAB drift detected on prod (2026-06-15)" | full validate output diff |
Why this works — concept by concept:
-
bundle validateis the diff engine — it not only checks YAML schema but also compares against the deployed workspace state. Any difference is reported and exits non-zero. - Scheduled, not push-triggered — drift is a state, not an event. A weekly cron catches it without depending on someone making a code change. Run nightly if your team allows.
- Service principal scoped to prod — the drift workflow's M2M token only has read access on prod resources. Read-only credentials cannot be weaponised even if leaked.
- GitHub issue, not silent fix — auto-reconciling drift by re-deploying would clobber a legitimate hot-fix. An issue forces a human review and a decision (adopt vs reject).
- Adopt-or-reject loop — engineers either commit the UI edit to YAML (turning the manual fix into declarative code) or re-deploy the bundle to undo the edit. Either way the bundle becomes the source of truth again.
- Cost — one CI minute per week + one read-only API scan. Negligible compared to the cost of a silent click-ops outage.
SQL · Python
Topic — design
System design problems
Python
Topic — API integration
API integration problems (Python)
5. Authentication patterns — when each one fits
Databricks ships four authentication modes — PAT, OAuth U2M, OAuth M2M (service principal), and notebook context — and the production answer is OAuth M2M
The mental model in one line: PAT is fast and per-user; OAuth U2M is browser-login for laptops; OAuth M2M with a service principal is the answer for any scheduled or shared automation; notebook-context auth is the implicit token a job inherits inside its own notebook. Once you can name when each fits, the entire "how do I authenticate this script?" question becomes a four-way decision tree.
The four modes in one table.
| Mode | Use case | Lifetime | Rotation | Scope |
|---|---|---|---|---|
| PAT | ad-hoc human, quick script | up to 90 days | manual | user identity |
| OAuth U2M | laptop CLI | 1h access / 90d refresh | browser refresh | user identity |
| OAuth M2M (SP) | CI/CD, scheduled, shared automation | 1h access | client_secret rotation every 90d | service principal |
| Notebook context | inside a running job | per-run | inherited from job run-as | run-as identity |
PAT in detail.
-
How to mint. UI: User settings → Developer → Access tokens → Generate. Or via the
/api/2.0/token/createendpoint. -
Shape.
dapi<32 hex chars>. Carries the user's identity and entitlements. - Lifetime. Configurable, capped at 90 days by workspace policy.
-
Use case. Ad-hoc —
curlfrom your laptop, a one-offpythonscript you run interactively. The "I just need to poke the API once" tool. - What it should never be. A CI runner credential, a shared team credential, a value committed to a repo.
OAuth U2M in detail.
-
How to mint.
databricks auth login --host https://<workspace>opens a browser, the user logs in, the CLI stores a refresh token under~/.databrickscfg. - Lifetime. Each access token is 1 hour; the refresh token rotates and is valid 90 days.
- Use case. Developer laptops running the CLI. Token rotation is automatic and invisible. If the user gets offboarded, the refresh token dies — no orphaned credentials.
- What it should never be. A CI runner credential (interactive login required).
OAuth M2M with a service principal in detail.
-
How to mint. Create a service principal in the account console; generate an OAuth secret on it; the secret yields a
client_id+client_secret. The CLI exchanges those for a 1-hour access token automatically. - Use case. GitHub Actions, scheduled CLI jobs, shared deploy automation. Anything that runs without a human at the keyboard.
-
Rotation. The
client_secretrotates every 90 days by policy. Rotate viadatabricks service-principal-secrets create <sp_id>, deploy the new secret to the CI store, then revoke the old secret. - What it should never be. A handout to individual developers — service principals are role-shaped, not user-shaped, and personal use undermines the audit trail.
Notebook-context auth in detail.
-
How it works. Inside a notebook running as a job task,
dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()returns a per-run token. Most code instead callsdbutils.secrets.get(scope, key)or uses an SDK client that picks up the context automatically. -
Identity. Tied to the job's
run_as(a user or a service principal). The token cannot escalate beyondrun_asentitlements. - Lifetime. As long as the run. Cannot be exported, cannot be saved to a file.
- Use case. Any in-job automation — "list other jobs in the workspace," "trigger a downstream run," "write a record to a metadata table."
-
What it should never be. Logged. Ever.
print(token)in a notebook is a fireable mistake.
The four hard rules.
-
Never commit a PAT (or a
client_secret, or any token) to a repository — even a private one. Treat tokens as radioactive. -
Never log a secret value — not in
print, not indbutils.fs.put, not in stdout from a CLI. Every secrets API is designed so you do not need to. - Rotate service principal secrets every 90 days — automatable via a scheduled CLI job that mints, deploys, and revokes.
-
Audit via
system.access.audit— every API call carries the principal name; the audit table shows who deployed what and when.
Common interview probes on auth.
- "What is the production answer for authenticating a CI runner?" — OAuth M2M with a service principal. PAT in CI is a code smell.
- "How is OAuth U2M different from a PAT?" — U2M tokens are short-lived (1h), automatically refreshed via a 90-day refresh token, and die when the user is offboarded. PAT is a manually-issued long-lived bearer token.
- "If a notebook inside a job needs to call the Jobs API, how does it authenticate?" — via
notebook-contextauth — the inherited token of the job's run-as identity. Use the Databricks SDK and it picks the token up automatically. - "How would you migrate from PAT to M2M without downtime?" — provision the service principal, grant it the same workspace permissions as the PAT's user, swap CI secrets to the new client_id/client_secret, revoke the old PAT.
Worked example — provisioning a service principal for GitHub Actions
Detailed explanation. The first M2M setup is fiddly because three layers must agree: the Databricks account console (create the SP), the workspace (grant it permissions), and GitHub (store the secret). Once the three layers are wired, the CI script just sees three env vars.
Question. Walk through the CLI commands to provision a service principal ci-deploy-bot, grant it CAN_MANAGE on the workspace, mint an OAuth secret, and verify the secret works for a deploy.
Input.
| field | value |
|---|---|
| account_id | 00000000-1111-2222-3333-444444444444 |
| workspace_id | 12345 |
| sp display name | ci-deploy-bot |
Code.
# Run with an account-admin OAuth U2M session
databricks auth login --host https://accounts.cloud.databricks.com
# 1) Create the service principal at the account level
SP_ID=$(databricks account service-principals create \
--display-name "ci-deploy-bot" --output JSON | jq -r '.id')
# 2) Mint an OAuth secret
SECRET_JSON=$(databricks account service-principal-secrets create \
--service-principal-id "$SP_ID" --output JSON)
CLIENT_ID=$(echo "$SECRET_JSON" | jq -r '.client_id')
CLIENT_SECRET=$(echo "$SECRET_JSON" | jq -r '.secret')
# 3) Grant the SP access on the workspace (workspace-scoped command)
databricks workspace-assignments create \
--workspace-id 12345 \
--principal-id "$SP_ID" \
--permissions USER
# 4) Verify by listing jobs via the SP (use a new profile)
cat <<EOF >> ~/.databrickscfg
[ci-deploy-bot]
host = https://acme.cloud.databricks.com
client_id = $CLIENT_ID
client_secret = $CLIENT_SECRETEOF
databricks --profile ci-deploy-bot jobs list --output JSON | jq 'length'
Step-by-step explanation.
- The first login is human (U2M, account-admin scope) — required to call account-level APIs.
-
account service-principals createregisters the SP at the account level, returning a UUID that is the principal ID for everything downstream. -
service-principal-secrets createreturns a one-timeclient_secret— store it immediately; the API never returns it again. Theclient_idis paired with it. -
workspace-assignments creategrants the SPUSERaccess on the workspace. Without this, the SP can authenticate but cannot see any workspace resource. - Adding a
[ci-deploy-bot]profile in~/.databrickscfglets you test the credentials locally.jobs listshould return a non-empty array (assuming the SP has been grantedCAN_VIEWon at least one job). - The same
client_id/client_secretgo into GitHub Actions secrets — and the CI script uses them directly via env vars without any local profile file.
Output.
| Step | Output |
|---|---|
| 1 | (no output — interactive login) |
| 2 | SP_ID = 99999999-... |
| 3 |
client_id, client_secret
|
| 4 | {"permissions": "USER"} |
| 5 | jobs count (e.g. 12) |
Rule of thumb. Provision one service principal per deployment role — ci-deploy-bot for prod deploys, ci-validate-bot for read-only validates, drift-bot for read-only drift checks. Never reuse one SP across roles; if one credential leaks, you only need to rotate that one role's secret.
Worked example — rotating a service principal secret on a 90-day schedule
Detailed explanation. SP secrets are valid for 90 days. The rotation has three steps: mint a new secret, swap it into the CI store, then revoke the old secret. The trick is to overlap — do not revoke the old secret before the new one is live in CI, or the next deploy will fail.
Question. Write a bash script that mints a new SP secret, prints both the new and old secret IDs, and revokes the old one only after the user confirms the CI is using the new one.
Input.
| field | value |
|---|---|
| SP_ID | 99999999-... |
| current_secret_id |
aaaaaaaa-... (from ~/.databrickscfg or audit log) |
Code.
SP_ID="99999999-1111-2222-3333-444444444444"
OLD_SECRET_ID="aaaaaaaa-1111-2222-3333-444444444444"
# 1) Mint a new secret
NEW=$(databricks account service-principal-secrets create \
--service-principal-id "$SP_ID" --output JSON)
NEW_SECRET_ID=$(echo "$NEW" | jq -r '.id')
NEW_CLIENT_SECRET=$(echo "$NEW" | jq -r '.secret')
echo "New secret ID: $NEW_SECRET_ID"
echo "New client_secret (store in GitHub Actions NOW): $NEW_CLIENT_SECRET"
# 2) Pause for the operator to update the CI secret store
read -p "Press ENTER after the new secret is live in CI..."
# 3) Verify CI can authenticate with the new secret
echo "Run a smoke test deploy in CI now. Did it succeed? [y/n]"
read CONFIRM
if [ "$CONFIRM" != "y" ]; then
echo "Aborting — leaving both secrets active"
exit 1
fi
# 4) Revoke the old secret
databricks account service-principal-secrets delete \
--service-principal-id "$SP_ID" \
--secret-id "$OLD_SECRET_ID"
echo "Old secret $OLD_SECRET_ID revoked"
Step-by-step explanation.
- The new secret is minted first. Both new and old are valid simultaneously — that overlap window is the safety margin.
- The script pauses for the operator to deploy the new
client_secretto CI (GitHub Actions secrets, AWS Secrets Manager, whichever store the team uses). No automation here on purpose; CI store updates require a human. - The smoke test confirms the new secret actually works for CI deploys. If the operator says "no," the script aborts without revoking the old secret — the rotation can be retried.
- Only on
ydoes the script callservice-principal-secrets deleteagainst the old secret ID. After this call, anything still using the oldclient_secretbreaks.
Output.
| Phase | What is valid |
|---|---|
| Before rotation | old secret only |
| After step 1 | old + new both valid (overlap) |
| After step 4 (revoke) | new secret only |
Rule of thumb. Rotation always overlaps — mint, deploy, verify, then revoke. Never revoke first; never skip verify. A 30-minute window where both secrets work is the cost of "the deploys never fail at midnight on rotation day."
Interview question on the production auth answer
A senior interviewer often asks: "Your team is migrating from PATs to OAuth for all CI/CD. Walk me through the migration plan — what changes in CI, how do you handle the cutover, and what auditing do you add?"
Solution Using OAuth M2M with a service principal and system.access.audit reconciliation
# 1) Provision one SP per deploy role (already shown in prior example)
SP_ID=...
# 2) Grant the SP exactly the workspace permissions of the PAT user
# - workspace USER access
# - CAN_MANAGE on the bundle-owned jobs and pipelines
# 3) Add NEW CI secrets (do not delete the old PAT yet)
# DATABRICKS_HOST, DATABRICKS_CLIENT_ID, DATABRICKS_CLIENT_SECRET
# 4) Update CI workflows to use M2M env vars
# (databricks CLI automatically prefers client_id/client_secret over PAT
# when both are set, but explicit is better — set DATABRICKS_AUTH_TYPE=oauth-m2m)
# 5) Cut traffic — re-trigger the staging workflow, confirm it passes
# Then re-trigger prod with a manual dispatch
# 6) Run the audit query in DBSQL or via the REST API
SELECT
user_identity.email AS principal,
action_name,
response.status_code,
COUNT(*) AS n
FROM system.access.audit
WHERE event_time > current_date - INTERVAL 7 DAYS
AND service_name = 'databricks-cli'
GROUP BY principal, action_name, response.status_code
ORDER BY n DESC;
# 7) Confirm the SP's principal name shows up, the old user does NOT
# Revoke the old PAT via /api/2.0/token/delete
Step-by-step trace.
| Day | Action | What's valid |
|---|---|---|
| 0 | provision SP + grants | PAT + SP both valid |
| 0 | add new CI secrets alongside the old PAT | PAT + SP both valid |
| 1 | update workflows to use SP env vars | PAT + SP both valid |
| 2 | run staging workflow, verify in audit | PAT + SP both valid |
| 3 | run prod workflow (manual), verify in audit | PAT + SP both valid |
| 4 | confirm only SP appears in audit | PAT + SP both valid |
| 5 | delete old PAT | SP only |
Output:
| audit row | principal | action_name | n |
|---|---|---|---|
| latest week | ci-deploy-bot@.spn |
bundleDeploy |
14 |
| latest week | ci-deploy-bot@.spn |
jobRunNow |
320 |
| latest week | (old user) — none | — | 0 |
The audit confirms the SP is doing the work and the old user identity no longer appears, so the old PAT is safe to revoke.
Why this works — concept by concept:
- Overlap, don't cut over — provisioning the SP first and keeping the PAT live until the SP has handled real traffic is the only safe migration. Never delete the old credential first.
-
Per-role service principals —
ci-deploy-bot,drift-bot,validate-botare separate. If one credential leaks, you rotate one role's secret, not all of them. -
system.access.auditreconciliation — the audit table is the source of truth for "who actually called the API." If the SP shows up and the old user does not, the migration is genuinely complete. - Automatic token refresh — the M2M flow auto-refreshes the 1-hour access token from the client_id/client_secret pair. No human in the loop, no scheduled "renew" job needed for the access token (only the secret rotates every 90 days).
- Auditing as a first-class step — the migration is not done when CI is green; it is done when the audit log shows only the SP. Treat the audit query as a checklist item.
- Cost — the migration is one engineer-week of setup; ongoing cost is one rotation script every 90 days. Compared to the cost of a leaked PAT on a public repo, the math is trivially in favour of M2M.
Python
Topic — API integration
API integration problems (Python)
SQL · Python
Company — Databricks
Databricks company problems
Cheat sheet — API + CLI recipes
-
List every job in prod (just IDs).
databricks jobs list --profile prod --output JSON | jq -r '.[].job_id'. Pipe towc -lfor a count, or to awhile readloop for per-job actions. -
Trigger a run with parameters.
databricks jobs run-now --job-id <id> --notebook-params '{"date":"2026-06-04"}' --profile prod. Add--idempotency-token "$(uuidgen)"in CI. -
Tail a run until terminal state.
while STATE=$(databricks jobs get-run --run-id <id> --output JSON | jq -r '.state.life_cycle_state'); [ "$STATE" != "TERMINATED" ]; do sleep 10; done. -
Deploy a bundle to prod.
databricks bundle deploy --target prod. Add--force-lockonly when you know nobody else is mid-deploy. -
Validate a bundle without deploying.
databricks bundle validate --target prod. Treat any non-zero exit as a deploy blocker. -
Create a cluster from a checked-in JSON spec.
databricks clusters create --json @cluster.json --profile prod --output JSON | jq -r '.cluster_id'. -
Restart a cluster with new libraries.
databricks clusters restart --cluster-id <id>afterdatabricks libraries install --cluster-id <id> --pypi-package "mylib==1.2.3". -
Rotate a service principal secret.
databricks account service-principal-secrets create --service-principal-id <sp_id>— capture the newsecret, deploy to CI, verify, thendelete --secret-id <old_id>. -
Find drift between bundle and prod.
databricks bundle validate --target prod. Diff the output against the previous run; schedule weekly as a GitHub Action. -
Curl a raw REST endpoint.
curl -X POST "$DATABRICKS_HOST/api/2.1/jobs/run-now" -H "Authorization: Bearer $DATABRICKS_TOKEN" -H "Content-Type: application/json" -d @payload.json | jq. -
Smoke test a freshly-deployed bundle job.
databricks bundle run --target staging assert_row_count— fails non-zero if the notebook raises, gating prod promotion. -
Switch profiles for one call.
databricks --profile prod <command>. Same binary, different workspace, different identity.
Frequently asked questions
What is the difference between the Databricks API and the Databricks CLI?
The Databricks API is the canonical REST surface (/api/2.x/<group>/<verb>) — authenticated with a Bearer token, paginated with next_page_token, and the foundation everything else builds on. The Databricks CLI is a Go binary (databricks) that wraps the API with consistent flags (--profile, --output JSON, --json @file), six top-level command groups, and automatic OAuth token handling. They are the same surface; the CLI is just the ergonomic client. Use the CLI for daily ops and scripts; reach for raw HTTP calls only when you need a field the CLI does not expose or you are writing a deeper library. The legacy Python databricks-cli from before 2024 is deprecated — use the Go binary.
PAT vs OAuth — which authentication should I use for CI/CD?
OAuth M2M with a service principal is the production answer for CI/CD. A PAT is per-user, manually rotated, and dies when the issuing user is offboarded — a single human's offboarding can break every CI pipeline. An OAuth M2M service principal is role-shaped: it has a client_id and client_secret, rotates every 90 days, and survives any individual leaving the team. In GitHub Actions, set DATABRICKS_HOST, DATABRICKS_CLIENT_ID, and DATABRICKS_CLIENT_SECRET as repository secrets and the CLI handles the token exchange automatically. PATs are fine for laptop scripts and ad-hoc human work; never for shared automation.
What are Databricks Asset Bundles and why are they better than raw API calls?
Databricks Asset Bundles are a YAML deployment format that describes the full set of jobs, pipelines, clusters, permissions, and dashboards a project wants in a workspace. A bundle is a databricks.yml root file plus typed YAML files under resources/, with multiple targets: (dev, staging, prod) sharing the same structure. The CLI commands are bundle validate (schema and Unity Catalog reference check), bundle deploy --target <env> (atomic deploy that updates existing resource IDs in place), and bundle run (trigger a deployed job). Bundles beat raw API calls on three axes: they are declarative (the YAML is the source of truth), atomic (one transaction per deploy), and diffable (Git history is your audit trail). Use raw API calls only for surgical one-off fixes the bundle does not cover.
Can I manage Unity Catalog catalogs and grants via the REST API?
Yes — Unity Catalog has its own endpoint group at /api/2.1/unity-catalog/... covering catalogs, schemas, tables, volumes, models, external locations, storage credentials, and grants. The most common automated calls are POST /unity-catalog/catalogs to create a catalog, POST /unity-catalog/schemas to create a schema, and PATCH /unity-catalog/permissions/<securable_type>/<full_name> to adjust grants. The CLI exposes these as databricks catalogs, databricks schemas, databricks grants, and Asset Bundles support resources/grants/ blocks for declarative grant management. Anything you can do in the Unity Catalog UI is exposed via REST; treat grants as code and apply them via bundles.
How do I idempotently create a job from a GitHub Actions workflow?
The cleanest pattern is to use Databricks Asset Bundles. A bundle's bundle deploy --target staging creates the job on first run and updates the same job ID in place on every subsequent run — there is no "create vs update" distinction for the CI script to manage. The job's identity comes from its resource name in the bundle YAML, not from a per-deploy generated ID. If you must use raw API calls instead, hit POST /api/2.1/jobs/reset (full overwrite of an existing job) when you already know the job_id, or POST /api/2.1/jobs/create with an idempotency_token in the body to dedupe accidental retries — though jobs/create does not natively support a stable name-based lookup; you have to record the returned job_id yourself. The bundle path is dramatically simpler.
How do I migrate from the legacy databricks-cli to the new databricks CLI?
Three substitutions cover most of it. First, change the installed tool: brew install databricks/tap/databricks (or the official prebuilt binary) replaces pip install databricks-cli. Second, rewrite sub-command names: databricks workspace ls /Users becomes databricks workspace list /Users; databricks fs ls dbfs:/ is unchanged. Third, switch auth from the legacy ~/.databrickscfg PAT-only format to either OAuth U2M (databricks auth login --host) for laptops or M2M (client_id/client_secret env vars) for CI. The new CLI's bundle group is brand new and has no legacy equivalent. Update every script's databricks ... invocation in a single PR and run a CI smoke test before merging — the new CLI's exit codes are mostly compatible but a handful of edge cases changed.
Practice on PipeCode
- Drill the API integration practice library → for the REST-call, pagination, and retry-shape probes interviewers love.
- Warm up on Databricks company problems → for the company-specific SQL + Python + Spark surface.
- Rehearse system design drills → for the "design the deploy pipeline" interview question.
- Layer the data processing library → for the Jobs-API-driven ingest + transform shapes.
- Sharpen the SQL axis with the SQL for data engineering interviews course →.
- Stack the Spark internals with the Apache Spark internals course → — every Databricks job runs on Spark.
- For broader pipeline craft, work through the ETL system design course →.
- For the overall surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
Pipecode.ai is Leetcode for Data Engineering — every API and CLI recipe above ships with hands-on practice rooms where you write the `jobs/run-now` retry, the `bundle deploy` GitHub Actions workflow, and the OAuth M2M service principal cutover against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your `databricks bundle deploy --target prod` actually maps to the same idempotent behaviour interviewers expect on the whiteboard.