etl tools comparison has become a category — not a single decision — because the modern etl tool stack spans four very different jobs: workflow orchestration (Airflow), in-warehouse transformation (dbt), saas data ingest (Fivetran), and full-platform etl (AWS Glue, Talend, Informatica). The best etl tools 2026 lists keep clumping these together as if they competed head-to-head, but in production they layer on top of each other. This deep-guide counterpart to a feature-matrix comparison walks the etl tools list one tool at a time — DAGs, executors, models, materializations, connectors, MAR billing, DPU-hour pricing, IPU consumption, and the etl architecture patterns that bind them — so you can defend each pick the way a staff data engineer defends one in a system-design round.
Every section below uses the data engineering interview questions teaching shape: a one-paragraph etl explained preamble dense with keyword phrases, two or three ### Title sub-sections that unpack the moving parts (DAG anatomy, executor choices, model layering, schema-drift handling, the Glue Data Catalog, the Talend Studio canvas, the IDMC console), and at least one #### Worked example block that ships a runnable snippet plus a step-by-step trace. We close with a ### Solution flavoured pricing comparison worked example (a 1 TB / day pipeline costed across all six tools), a ### Decision framework ASCII tree, and three reference data engineering stack patterns (Fivetran + dbt + Airflow, pure AWS with Glue + Step Functions, legacy enterprise with Informatica IDMC) so the etl pipeline tools decision lands as a concrete pick, not a vendor brochure.
When you want hands-on reps on the SQL, modeling, and ETL skills every data engineering loop tests, browse ETL practice library →, drill data-modeling drills →, sharpen Python ETL patterns →, rehearse aggregation problems →, reinforce joins drills →, or widen coverage on the full SQL practice library →.
On this page
- Why the ETL toolscape is fragmented in 2026
- Orchestration deep-dive — Airflow (with Prefect + Dagster sidenotes)
- Transformation deep-dive — dbt (with SQLMesh sidenote)
- EL / SaaS ingest deep-dive — Fivetran (with Airbyte + Stitch sidenotes)
- Full ETL platforms — AWS Glue · Talend · Informatica
- Pricing and licensing models — comparing 6 cost shapes
- Decision framework + 2026 production stack patterns
- Choosing the right ETL tool (cheat sheet)
- Frequently asked questions
- Practice on PipeCode
1. Why the ETL toolscape is fragmented in 2026
The etl toolscape is fragmented because there are four distinct jobs inside a modern pipeline — ingest, orchestrate, transform, and govern — and no single vendor has won all four cleanly. The etl tools list that interview prep articles love to recite (Airflow, dbt, Fivetran, Glue, Talend, Informatica, Airbyte, Stitch, Matillion, Prefect, Dagster, SQLMesh) is really four small lists glued together by the etl pipeline label. Once you see the four-category taxonomy, every etl tools comparison prompt — "should we use airflow vs dbt?", "is fivetran vs airbyte about cost or control?", "glue vs talend vs informatica for a regulated enterprise?" — collapses into "which category does this question live in?"
This first section frames the etl tool taxonomy, explains the modern data stack shape that drives 2026 defaults, and names the etl architecture trade-offs the rest of the guide unpacks. Treat it as the map; sections 2–5 are the territory.
The four-category etl tool taxonomy — orchestration · transformation · el ingest · full etl
The etl tool selection decision starts with a category, not a tool. Map every tool onto exactly one of the four layers below — that is the etl tools comparison mental model every senior data engineer carries.
-
Orchestration (
workflow orchestration) — schedules and monitors other jobs.Airflowis the dominant choice;PrefectandDagsterare modern alternatives. Hands off the actual data movement to operators (Python, Spark, dbt, Fivetran triggers). The unit of work is adag(directed acyclic graph) oftasks. -
In-warehouse transformation (
transformation tools) — runs SQL inside the warehouse.dbtis the standard;SQLMeshis the newer challenger. Reads from raw tables, writes to mart tables, never moves data across systems. The unit of work is amodel(aSELECTstatement materialised as a view or table). -
EL / SaaS ingest (
saas data ingest) — pulls data from operational sources into the warehouse.Fivetranis the canonical managed SaaS;Airbyteis the open-source alternative;Stitchis the simpler budget option. Pre-builtconnectorsare the moat — every connector you don't write is engineer-time saved. Pairs withdbtfor the T (EL + T, notETL). -
Full ETL platforms (
enterprise etl) — ship ingest, transform, and orchestration in one product.AWS Glueis the AWS-native serverless Spark choice;Talendis the long-tenured GUI-first option;Informaticais the Fortune-500 enterprise incumbent (IDMCSaaS,PowerCenterlegacy). These platforms trade composability for convenience.
Why the taxonomy beats a one-dimensional ranking.
-
Airflowanddbtare not competitors — they live in different layers; many real pipelines run both. -
FivetranandAirfloware not competitors — Fivetran moves data in, Airflow schedules the downstream T and load. -
GlueandAirfloware not competitors either —Step Functionsorchestrates Glue jobs in pure-AWS stacks;Airfloworchestrates Glue jobs in multi-cloud or hybrid stacks. -
TalendandInformaticaare near-competitors — both are GUI-first full-ETL platforms; the pick is usually ecosystem + price. -
FivetranandAirbyteare direct competitors — same shape (managed connectors), different commercial model (closed SaaS vs open-source).
Why no single ETL tool wins — the four pressures
A single etl tools list ranking always lies because four different pressures pull buyers in different directions.
-
Composability vs convenience — code-first composable stacks (
Airflow + dbt + Fivetran) win for engineering-led teams; GUI-first all-in-one platforms (Talend,Informatica) win for low-code teams. -
Cloud lock vs portability —
Glueis AWS-only;Fivetranis cloud-only SaaS;Airflow/dbt/Talend/Informaticarun multi-cloud or hybrid. -
Pricing unit — free OSS, per-developer seat,
MAR(Monthly Active Rows),$/DPU-hour, per-user license,IPU(Informatica Processing Unit) consumption — five different cost shapes that don't compare apples-to-apples. -
Learning curve —
dbt(SQL + YAML) is fastest;AirflowandGlueare medium (Python / PySpark);TalendandInformaticacarry the steepest GUI learning curves.
The 2026 default — Fivetran + dbt + Airflow as the modern data stack
The modern data stack has standardised around Fivetran for ingest + dbt for transform + Airflow for orchestration on top of a cloud warehouse (Snowflake, BigQuery, Databricks, or Redshift). Greenfield mid-market teams default here because:
-
Connectorsare someone else's problem —Fivetranships 500+, you ship zero. -
Transformationstays inSQL— every SQL-fluent engineer can contribute, and the warehouse compute does the heavy lifting. -
Orchestrationis Python-first —Airflow DAGsare code, version-controlled, and reviewable like every other repo. -
Total tooling costis predictable —dbt Cloud+Fivetran Starter+AstronomerorMWAAlands in the $3–10k / month range for most teams.
Legacy Fortune-500 enterprises run a different default — Informatica IDMC + Airflow + cloud warehouse — because their regulated data-governance posture maps onto Informatica's catalog and lineage tooling. Pure-AWS shops often run Glue + Step Functions + Glue Data Catalog to avoid OSS Airflow operations entirely.
Worked example — placing eight real tools into the four-category taxonomy
Detailed explanation. A new data engineer joins a team and is told "we use Airflow, dbt, Fivetran, Snowflake, Looker, Glue, Soda, and Great Expectations." Before doing anything, the right first move is to place each tool into one of the four ETL categories (plus the adjacent warehouse, BI, and quality categories). This is the etl tool taxonomy drill every onboarding interview runs.
Question. Categorise the following eight tools and explain which layer of the pipeline they own: Airflow, dbt, Fivetran, Snowflake, Looker, AWS Glue, Soda, Great Expectations.
Input.
| Tool | Surface API | Cloud model | What it touches |
|---|---|---|---|
| Airflow | Python DAGs | OSS / MWAA / Astronomer | Schedules tasks |
| dbt | SQL + YAML | Core (CLI) / Cloud (SaaS) | Compiles SQL inside warehouse |
| Fivetran | Web UI | SaaS-only | Pulls from SaaS apps and DBs |
| Snowflake | SQL | SaaS warehouse | Stores raw + mart tables |
| Looker | LookML | SaaS BI | Reads marts, renders dashboards |
| AWS Glue | PySpark + visual | AWS-managed serverless | Crawls, transforms, loads |
| Soda | YAML checks | OSS / Cloud | Asserts data quality |
| Great Expectations | Python suites | OSS | Asserts data quality |
Code (a one-screen mental-model snippet you can keep in a README).
# Layer 1: Sources (operational systems — Salesforce, Stripe, Postgres, etc.)
# |
# v
# Layer 2: EL / SaaS ingest -----> Fivetran (managed) | Airbyte (OSS)
# |
# v
# Layer 3: Warehouse / Lakehouse -> Snowflake | BigQuery | Databricks | Redshift
# |
# v
# Layer 4: Transformation --------> dbt models (staging -> intermediate -> marts)
# | (or AWS Glue for AWS-native Spark ETL)
# v
# Layer 5: Orchestration ---------> Airflow DAGs trigger 2, 3, 4 (or Step Functions on AWS)
# |
# v
# Layer 6: Quality ---------------> Great Expectations / Soda assert at every layer
# |
# v
# Layer 7: BI / Activation -------> Looker / Mode / Tableau read marts
Step-by-step explanation.
-
Airflowis the only orchestration tool in the list — it sits across the pipeline and triggers everything below. -
Fivetranowns Layer 2 (EL / SaaS ingest) — pure ingest, no transforms. -
Snowflakeis the warehouse — not an ETL tool itself, but the substrate every layer above and below reads / writes. -
dbtowns Layer 4 (in-warehouse transformation) — runs SQL inside Snowflake; never moves data out. -
AWS Glueis a full-ETL platform — in this stack it's not used end-to-end becauseFivetran + dbtalready cover ingest + transform. It might run a one-off PySpark batch (e.g.crawl S3 logs into Snowflake). -
Lookeris BI (Layer 7) — reads dbt marts; not an ETL tool. -
SodaandGreat Expectationsare quality tools (Layer 6) — they check what the ETL tools produce; not ETL themselves.
Output (a one-line category map you can paste into a data-stack.md).
Airflow -> Orchestration
dbt -> Transformation (in-warehouse)
Fivetran -> EL / SaaS ingest
Snowflake -> Warehouse (adjacent, not ETL)
Looker -> BI (adjacent, not ETL)
AWS Glue -> Full ETL (used selectively for S3 + Spark batches)
Soda -> Quality (adjacent, not ETL)
Great Expectations -> Quality (adjacent, not ETL)
Rule of thumb: if you can't place a tool into the four-category etl tool taxonomy in one breath, you don't yet understand what problem it solves — and you won't defend it well in a system-design interview either.
SQL / Python
Topic — etl
ETL practice library
Data modeling
Topic — dimensional-modeling
Dimensional modeling drills
2. Orchestration deep-dive — Airflow (with Prefect + Dagster sidenotes)
apache airflow is the single most-tested workflow orchestration tool in data engineering interview questions, and for one reason: every other layer of the modern data stack hands off scheduling to it. Whether the actual work is a Fivetran sync, a dbt build, a PySpark batch, or a bash export, the airflow dag is what owns the when and the if it failed, retry it. If you can explain DAG, operator, executor, scheduler, metadata DB, and sensor in one minute, you've cleared the bar for most senior orchestration questions.
This section walks the airflow internals every senior interview round tests: DAG anatomy, the four operator families, the three executor choices (Celery, Kubernetes, Local), the scheduler + metadata DB pair, sensors, XComs, dynamic task mapping, and the three managed airflow hosting options (MWAA, Astronomer, Cloud Composer). It ends with sidenotes on Prefect and Dagster — the two modern challengers — and a worked example that wires a five-task daily_etl DAG.
DAG anatomy — tasks, operators, dependencies, schedule_interval
A dag (directed acyclic graph) is a Python file that defines what runs, when it runs, and what depends on what. Five primitives cover 95% of airflow dag interview questions.
-
DAGobject — the top-level container; carriesdag_id,schedule_interval,start_date,catchup,tags, anddefault_args. -
Task— a single unit of work; an instance of anoperatorclass. Tasks havetask_id, retries, retry_delay, and timeouts. -
Operator— the what of a task; the four families areBashOperator,PythonOperator,provider operators(SnowflakeOperator,DbtRunOperator,S3ToRedshiftOperator), andsensors(S3KeySensor,ExternalTaskSensor). -
Dependencies— set with>>(downstream) and<<(upstream);t1 >> t2 >> t3makes a three-task chain.set_upstream/set_downstreamare the explicit forms. -
schedule_interval— a cron expression ("0 7 * * *"), a preset ("@daily"), atimedelta(hours=1), orNone(manual / triggered).
The minimal DAG.
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
def transform_callable(**ctx):
print(f"Transforming for execution_date={ctx['ds']}")
default_args = {
"owner": "data-eng",
"retries": 3,
"retry_delay": timedelta(minutes=5),
"email_on_failure": True,
}
with DAG(
dag_id="daily_etl",
description="Daily ETL — extract, transform, load, notify",
schedule_interval="@daily",
start_date=datetime(2026, 1, 1),
catchup=False,
default_args=default_args,
tags=["etl", "daily"],
) as dag:
extract = BashOperator(task_id="extract", bash_command="python /opt/scripts/extract.py")
transform = PythonOperator(task_id="transform", python_callable=transform_callable)
load = BashOperator(task_id="load", bash_command="python /opt/scripts/load.py")
notify = BashOperator(task_id="notify", bash_command="curl -X POST $SLACK_HOOK")
extract >> transform >> load >> notify
-
catchup=False— don't backfill missed runs on first deploy; almost always what you want. -
retries=3+retry_delay— survives transient infra blips without paging on-call. -
>>chain — reads top-to-bottom; the four tasks run sequentially. -
@daily— preset equivalent to"0 0 * * *"(midnight UTC);@hourly,@weekly,@monthlyexist too.
Executors — Local, Celery, Kubernetes, CeleryKubernetes
The airflow executor is the how and where tasks run. Pick by scale + ops budget.
-
LocalExecutor— runs tasks as subprocesses on the scheduler node; fine for <50 concurrent tasks, single-machine dev, small teams. -
CeleryExecutor— distributes tasks across a pool of worker nodes via a Celery broker (Redis or RabbitMQ); the historic production default for self-hosted Airflow. -
KubernetesExecutor— every task runs as a freshpodon a Kubernetes cluster; perfect isolation, auto-scale, per-task resource limits. The 2026 production default. -
CeleryKubernetesExecutor— hybrid; short tasks go to Celery workers, heavy tasks to fresh pods. Niche.
Why KubernetesExecutor won the 2026 default. Each task gets a clean pod with its own requirements.txt / Docker image, resources are billed per second, failed pods don't pollute the next run, and Kubernetes' built-in autoscaling means you pay zero for idle capacity overnight.
Scheduler + metadata DB — how Airflow remembers everything
Airflow has three long-running services and one database.
-
scheduler— parses DAG files everyscheduler_heartbeat_sec, decides which task instances are ready, and pushes them to the executor. The scheduler is the brain. -
webserver— Flask UI that reads the metadata DB; lets engineers trigger / pause / inspect DAGs. -
triggerer— runsdeferrable operators(async sensors that don't block a worker slot); a 2.2+ feature. -
metadata DB— Postgres (production) or MySQL; remembers every DAG run, task instance, XCom value, connection, variable, pool, and log location. The metadata DB is the source of truth — if you lose it, you lose state.
Sensors, XComs, dynamic task mapping — the three power features
-
Sensors— tasks that wait for an external condition (a file lands in S3, an upstream DAG finishes, a Snowflake row appears).S3KeySensor,ExternalTaskSensor,SqlSensorare the staples.mode="reschedule"frees the worker slot between pokes;deferrable=Truemoves the wait to the triggerer entirely. -
XComs(cross-communication) — small key-value blobs that one task pushes and another pulls. Limited to ~48KB per row (stored in the metadata DB); for larger payloads use S3 / GCS as the transport with the XCom holding a URI. -
Dynamic task mapping—MyOp.partial(...).expand(...)(Airflow 2.3+) fans out N parallel task instances from a list of inputs. The pythonic alternative to manually generating tasks at DAG-parse time.
Managed Airflow — MWAA, Astronomer, Cloud Composer
Self-hosting Airflow on Kubernetes is non-trivial. Three managed options dominate.
-
Amazon MWAA(Managed Workflows for Apache Airflow) — AWS-native; you pickmw1.small/medium/largeinstance sizes; metadata DB and scheduler are managed; you bring the DAG repo + plugins. -
Astronomer— vendor-neutral managed Airflow on Astronomer Cloud or your own Kubernetes; shipsAstro CLIandAstro Runtime(their hardened image); strong developer experience. -
Google Cloud Composer— GCP-native managed Airflow on GKE; tight integration withBigQuery,Dataflow, andCloud Functions.
Prefect + Dagster — the two modern challengers
-
Prefect— Python-first orchestrator; replaces DAGs with@flowand@taskdecorators;Prefect Cloudis the managed control plane. Strength: better dynamic-workflow ergonomics (task.map(...)first-class). Weakness: smaller ecosystem of provider operators. -
Dagster— opinionated aboutsoftware-defined assets; every materialised dataset is an asset, and the DAG is the asset graph. Strength: lineage and asset observability are first-class. Weakness: steeper learning curve; smaller community.
Both are credible alternatives in greenfield 2026 builds. Airflow still wins on ecosystem breadth (200+ provider packages, every cloud, every tool) and is what interviewers reach for by default.
Worked example — a five-task daily_etl Airflow DAG with sensor + dbt + Snowflake
Detailed explanation. The canonical production DAG shape is sense → load → transform → quality-check → publish. Each task uses a different operator family — S3KeySensor waits for the file, S3ToSnowflakeOperator loads it, DbtCloudRunJobOperator transforms, a PythonOperator runs a quality check, and a BashOperator notifies Slack. This worked example wires all five with the >> chain.
Question. Build an Airflow 2.7+ DAG daily_etl that runs at 07:00 UTC daily and (1) waits for s3://lake/raw/{ds}.csv to appear, (2) loads it into RAW.ORDERS in Snowflake, (3) triggers a dbt Cloud job, (4) runs a row-count quality check, (5) notifies Slack. The DAG must retry 3× with a 5-minute back-off and not backfill historical dates on first deploy.
Input.
| Source | Path / connection | Notes |
|---|---|---|
| S3 lake | s3://lake/raw/{ds}.csv |
One file per day |
| Snowflake |
SNOWFLAKE_DEFAULT connection |
Loads into RAW.ORDERS
|
| dbt Cloud | job id 12345
|
Runs the prod environment |
| Slack | webhook URL in Variable.get("slack_hook")
|
Notify on success |
Code.
from datetime import datetime, timedelta
from airflow import DAG
from airflow.models import Variable
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
from airflow.providers.amazon.aws.sensors.s3 import S3KeySensor
from airflow.providers.snowflake.transfers.s3_to_snowflake import S3ToSnowflakeOperator
from airflow.providers.dbt.cloud.operators.dbt import DbtCloudRunJobOperator
from airflow.providers.snowflake.hooks.snowflake import SnowflakeHook
def quality_check_callable(**ctx):
hook = SnowflakeHook(snowflake_conn_id="SNOWFLAKE_DEFAULT")
row_count = hook.get_first("SELECT COUNT(*) FROM RAW.ORDERS")[0]
if row_count < 100:
raise ValueError(f"Row count too low: {row_count}")
print(f"Quality check passed — {row_count} rows")
default_args = {
"owner": "data-eng",
"retries": 3,
"retry_delay": timedelta(minutes=5),
"email_on_failure": True,
}
with DAG(
dag_id="daily_etl",
schedule_interval="0 7 * * *",
start_date=datetime(2026, 1, 1),
catchup=False,
default_args=default_args,
tags=["etl", "daily", "snowflake"],
) as dag:
sense_source = S3KeySensor(
task_id="sense_source",
bucket_key="raw/{{ ds }}.csv",
bucket_name="lake",
aws_conn_id="aws_default",
mode="reschedule",
poke_interval=300,
timeout=60 * 60 * 6,
)
load_raw = S3ToSnowflakeOperator(
task_id="load_raw",
snowflake_conn_id="SNOWFLAKE_DEFAULT",
s3_keys=["raw/{{ ds }}.csv"],
table="RAW.ORDERS",
stage="LAKE_STAGE",
file_format="(TYPE = CSV, FIELD_DELIMITER=',', SKIP_HEADER=1)",
)
transform = DbtCloudRunJobOperator(
task_id="transform",
dbt_cloud_conn_id="dbt_cloud_default",
job_id=12345,
wait_for_termination=True,
)
quality_check = PythonOperator(
task_id="quality_check",
python_callable=quality_check_callable,
)
publish = BashOperator(
task_id="publish",
bash_command=(
"curl -X POST -H 'Content-Type: application/json'"
"-d '{\"text\":\"daily_etl finished for {{ ds }}\"}'"
f"{Variable.get('slack_hook')}"
),
)
sense_source >> load_raw >> transform >> quality_check >> publish
Step-by-step explanation.
-
S3KeySensorwaits fors3://lake/raw/2026-05-26.csvto appear;mode="reschedule"frees the worker slot between 5-minute pokes, so it doesn't burn a Celery / pod slot for 6 hours. -
S3ToSnowflakeOperatorissues aCOPY INTO RAW.ORDERS FROM @LAKE_STAGE/raw/2026-05-26.csvonce the sensor passes. -
DbtCloudRunJobOperatortriggers dbt Cloud job12345and polls until it finishes; failure here fails the task with the dbt run URL in the log. -
PythonOperatorruns an inlineSELECT COUNT(*)quality check; raises if fewer than 100 rows landed — exception kills the DAG before notification. -
BashOperatorposts to Slack with the{{ ds }}Jinja-templated date; runs only if the previous four succeeded.
Output (one DAG run in the Airflow UI).
2026-05-26 07:00:00 sense_source ── success (waited 0m17s)
2026-05-26 07:00:17 load_raw ── success (loaded 14,532 rows)
2026-05-26 07:00:42 transform ── success (dbt Cloud run 9876)
2026-05-26 07:03:11 quality_check ── success (14,532 rows >= 100)
2026-05-26 07:03:12 publish ── success
Total elapsed: 3m12s
Rule of thumb: every senior airflow dag ends with sense → do work → check → notify; tasks that skip the check step are why pipelines silently produce wrong dashboards.
Python
Topic — etl
Airflow + ETL drills
Python
Topic — design
Workflow design problems
3. Transformation deep-dive — dbt (with SQLMesh sidenote)
dbt (data build tool) is the in-warehouse transformation standard, and its design choice — every transformation is a SELECT statement compiled by dbt and executed by the warehouse — is what made it the second-most-tested tool after Airflow in modern data engineering interview questions. dbt does not move data across systems; it never spins up its own compute. The warehouse runs the SQL, dbt orchestrates the compilation, dependency graph, materialization, testing, and documentation. If Airflow is the when, dbt is the what.
This section unpacks the dbt project anatomy every interview round tests: models, sources, refs, the four materializations (view, table, incremental, ephemeral), the canonical staging → intermediate → marts layering, tests, snapshots, dbt Cloud vs dbt Core, and dbt Mesh. It closes with a SQLMesh sidenote — the modern challenger that adds first-class column-level lineage and virtual-environment dev/prod separation.
Models, refs, and sources — the three primitives of every dbt project
A dbt model is a .sql file containing a single SELECT. dbt compiles each model with ref() and source() macros to resolve dependencies and physical table names at runtime.
-
model—models/staging/stg_orders.sqlcontainsselect order_id, ... from {{ source('raw', 'orders') }}. dbt compiles this intoselect order_id, ... from raw.ordersand runsCREATE OR REPLACE VIEW analytics.stg_orders AS ...(or table, depending on materialization). -
source—sources.ymldeclares external tables that dbt does not own.{{ source('raw', 'orders') }}resolves toraw.ordersand letsdbt source freshnesscheck that the upstream data is fresh. -
ref—{{ ref('stg_orders') }}resolves to the physical table name of another model. Refs are how dbt builds the DAG; everyref()becomes an edge in the dependency graph.
Why ref() is the load-bearing primitive.
-
Environment isolation —
dev,staging,prodall resolveref()to the right schema (e.g.dev_alice.stg_ordersvsanalytics.stg_orders). -
Dependency graph —
dbt runruns models in topological order based onref()calls; you never hand-code an execution order. - No hard-coded table names — refactoring a model name updates every downstream consumer automatically.
Materializations — view, table, incremental, ephemeral
The dbt materialization decides how the compiled SQL becomes a physical artifact in the warehouse. Four choices cover 99% of projects.
-
view—CREATE OR REPLACE VIEW analytics.stg_orders AS SELECT .... Fastest to materialise (just metadata); slow to query (re-runs every time). Default for staging. -
table—CREATE OR REPLACE TABLE analytics.fct_orders AS SELECT .... Slowest to materialise (full scan + write); fastest to query. Default for marts. -
incremental— first run creates a table; subsequent runsMERGEonly new / changed rows. Required for any model on top of anappend-onlyraw table with >100M rows. Driven byis_incremental()macro and aunique_key. -
ephemeral— no physical artifact; the SQL is inlined as a CTE in downstream models. Saves storage; loses observability. Niche.
The incremental shape every dbt interview tests.
{{ config(
materialized = 'incremental',
unique_key = 'order_id',
on_schema_change = 'fail'
) }}
select
order_id,
customer_id,
amount,
order_ts,
current_timestamp() as dbt_loaded_at
from {{ source('raw', 'orders') }}
{% if is_incremental() %}
where order_ts > (select max(order_ts) from {{ this }})
{% endif %}
-
{% if is_incremental() %}— runs only after the first build; gates theWHEREfilter that bounds the merge to new rows. -
unique_key = 'order_id'— the natural key dbt uses toMERGE(UPDATEif exists,INSERTotherwise). -
on_schema_change = 'fail'— refuses to deploy a model whose schema changed without explicit migration; the safest default.
Staging → intermediate → marts — the canonical dbt layering
Every senior dbt project enforces the same three-layer shape.
-
staging(stg_*) — one-to-one with each source table; renames columns, casts types, light cleaning. Materialised asview. Naming:stg_<source>__<entity>(e.g.stg_stripe__charges). -
intermediate(int_*) — composes staging models; pre-joins, pre-aggregates, derives flags. Materialised asvieworephemeral. Hidden from the BI tool. -
marts(fct_*/dim_*) — business-facing facts and dimensions; the only layer BI tools query. Materialised astableorincremental. Naming:fct_orders,dim_users,fct_funnel.
Why the layering matters in interviews.
- Single source of truth — every concept lives in exactly one mart; downstream consumers reuse it.
-
Refactor safety — renaming a source column updates only the matching
stg_*model; downstream consumers don't break. -
Test isolation — schema tests on
stg_*catch source-data regressions early; tests onfct_*catch business-logic regressions.
Tests, snapshots, seeds — the dbt quality toolkit
-
tests— generic (unique,not_null,accepted_values,relationships) declared inschema.yml; or singular (tests/*.sql) one-offSELECTstatements that fail if any row is returned. Run withdbt test. -
snapshots— type-2 SCD (slowly changing dimension) capture;dbt snapshotwrites a row every time tracked columns change; gives you "what did this row look like on 2026-03-01?" queries. -
seeds— CSV files committed to the repo and loaded as tables viadbt seed; perfect for small reference tables (country_code → region, fee-tier lookups).
dbt Cloud vs dbt Core
-
dbt Core— the OSS CLI; free; runs locally or under Airflow / Dagster / Prefect. The base layer. -
dbt Cloud— the managed SaaS; adds an IDE, scheduling, hosted docs, CI/CD on PR, semantic layer, and lineage UI. Priced per-developer-seat (~$50/seat/month) plus a Team / Enterprise jump for advanced features. -
dbt Mesh— multi-project, multi-team support; lets amarketingproject depend on afinanceproject's published models via cross-projectref(). The 2024+ enterprise pattern.
SQLMesh — the modern challenger
SQLMesh is a newer transformation tool that fixes two dbt pain points: (1) first-class virtual environments so dev and prod share data physically but isolate logically (no more expensive dev clones), and (2) column-level lineage baked into the planner. The teaching shape is similar (models, refs, materializations), but the planner is more aggressive — it produces a plan you review before applying, like Terraform. Worth knowing for 2026 greenfield builds; dbt still wins on ecosystem and hiring market.
Worked example — a three-layer dbt project for an e-commerce funnel
Detailed explanation. This worked example ships the minimum production-shaped dbt project: one source declaration, two staging models, one intermediate model, one fact mart, and one schema test. The pattern is what every senior dbt project looks like — staging is thin and renames-only; intermediate composes; marts are business-facing.
Question. Given a raw schema with raw.orders(order_id, customer_id, amount_cents, status, order_ts) and raw.customers(customer_id, region, signup_ts), build a dbt project that produces analytics.fct_paid_orders_by_region with one row per region containing paid_orders and paid_revenue_usd.
Input.
| Source table | Columns | Volume |
|---|---|---|
raw.orders |
order_id, customer_id, amount_cents, status, order_ts |
~10M / day |
raw.customers |
customer_id, region, signup_ts |
~100M total |
Code.
models/staging/stg_orders.sql
{{ config(materialized='view') }}
select
order_id::varchar as order_id,
customer_id::varchar as customer_id,
amount_cents / 100.0 as amount_usd,
lower(status) as status,
order_ts::timestamp as order_ts
from {{ source('raw', 'orders') }}
models/staging/stg_customers.sql
{{ config(materialized='view') }}
select
customer_id::varchar as customer_id,
region::varchar as region,
signup_ts::timestamp as signup_ts
from {{ source('raw', 'customers') }}
models/intermediate/int_orders_enriched.sql
{{ config(materialized='ephemeral') }}
select
o.order_id,
o.customer_id,
o.amount_usd,
o.status,
o.order_ts,
c.region
from {{ ref('stg_orders') }} o
left join {{ ref('stg_customers') }} c
on c.customer_id = o.customer_id
models/marts/fct_paid_orders_by_region.sql
{{ config(materialized='table') }}
select
region,
count(*) as paid_orders,
sum(amount_usd) as paid_revenue_usd
from {{ ref('int_orders_enriched') }}
where status = 'paid'
group by region
models/marts/schema.yml
version: 2
models:
- name: fct_paid_orders_by_region
description: "Paidorders+revenueaggregatedbyregion."
columns:
- name: region
tests: [unique, not_null]
- name: paid_orders
tests: [not_null]
- name: paid_revenue_usd
tests: [not_null]
Step-by-step explanation.
-
stg_ordersis aviewthat renamesamount_cents → amount_usdand lowercasesstatus— pure normalisation, no joins. -
stg_customersis aviewthat re-typescustomer_idandsignup_ts— pure normalisation. -
int_orders_enrichedisephemeral— it joins the two staging models but never materialises; the mart inlines its CTE. -
fct_paid_orders_by_regionis atable— filters tostatus = 'paid', groups by region, sums revenue. The only model the BI tool queries. -
schema.ymldeclares three tests on the mart —unique+not_nullonregion,not_nullon the two metrics.dbt testfails the run if any test fails.
Output (the mart after dbt build).
| region | paid_orders | paid_revenue_usd |
|---|---|---|
| US | 14,532 | 1,247,850.00 |
| EU | 9,128 | 765,432.50 |
| APAC | 5,341 | 412,108.75 |
Rule of thumb: a senior dbt review reads three files — sources.yml (where data comes from), models/marts/*.sql (what the business sees), and schema.yml (what we promise is true). Everything else is plumbing.
SQL
Topic — joins
SQL joins drills
Data modeling
Topic — dimensional-modeling
dbt + dimensional modeling
4. EL / SaaS ingest deep-dive — Fivetran (with Airbyte + Stitch sidenotes)
Fivetran is the saas data ingest tool that taught the industry to say "EL, not ETL" — by separating extract + load (its job) from transform (dbt's job), Fivetran made ingest a buy-not-build decision for the first time. If you've ever copied a Salesforce API client into your repo and watched it rot, you understand the fivetran value proposition in one sentence: every connector you don't write is engineer-time saved, every schema-drift bug you don't debug is a Friday night saved.
This section unpacks the fivetran internals every modern data engineering interview questions round circles back to: the connector model, MAR (Monthly Active Rows) billing, schema drift handling, incremental sync, change data capture (CDC), when Fivetran beats a hand-rolled ingest, and the Airbyte + Stitch alternatives. It closes with a worked example that wires a Salesforce → Snowflake sync with a downstream dbt staging layer.
Connector model — 500+ pre-built integrations as the moat
A fivetran connector is a managed integration that handles three things on your behalf: authentication (OAuth, API keys, IAM roles), schema discovery (introspecting the source API to derive a table layout), and incremental sync (only pulling rows that changed since the last run). 500+ pre-built connectors span SaaS apps (Salesforce, HubSpot, Stripe, Zendesk), databases (Postgres CDC, MySQL, SQL Server, Oracle), event streams (Kafka, Kinesis), and files (S3, Google Drive).
The connector contract.
-
Authentication —
Fivetranstores credentials in its control plane; you configure once via the web UI. -
Schema discovery — connector introspects the source (Salesforce describe API, Postgres
information_schema) and creates one destination table per source object, with columns matching the source schema. -
Incremental sync — every connector tracks a per-table
cursor(a timestamp, a change-log offset, or a CDC LSN) and only pulls rows after that cursor on each run. - Sync frequency — configurable from 1 minute (Business Critical) to 24 hours; default 5 minutes.
MAR — Monthly Active Rows billing, explained
MAR (Monthly Active Rows) is Fivetran's billing unit and the source of every "why is our Fivetran bill huge?" Slack thread. A row counts as active in a month if it is inserted, updated, or deleted during that month — not the total row count of the table.
- A 10M-row dimension that doesn't change in a month = 0 MAR.
- A 1M-row order table where every row updates once = 1M MAR.
- A 100k-row table that fully refreshes daily = 3M MAR (100k × 30 days).
The four price tiers.
| Tier | What it adds | Pricing slope |
|---|---|---|
| Free | 500k MAR / month, 5-min sync | $0 |
| Starter | All Free + standard connectors | ~$1 per 1k MAR (declining with volume) |
| Standard | + Database connectors + 1-min sync | ~$1.50 per 1k MAR |
| Enterprise | + Business Critical + advanced security | ~$2 per 1k MAR |
MAR optimisation patterns.
-
Disable columns you don't need —
Fivetranlets you de-select columns per table; fewer columns sometimes means fewer MAR (depends on the connector). - Disable tables you don't need — every selected table contributes MAR; audit quarterly.
-
Prefer history mode off —
history modedoubles MAR (tracks updates as new rows); only enable on slowly-changing dimensions you need to audit. -
Use Postgres CDC carefully — high-write Postgres tables can blow MAR; consider
Airbyte CDC(cheaper) or a hand-rolledDebezium+Kafkapipeline for very high write volumes.
Schema drift handling — the load-bearing differentiator
Schema drift is what happens when a source adds a column, renames a column, changes a column's type, or drops a column. Hand-rolled ingest scripts break loudly when this happens; Fivetran handles it gracefully.
-
Added column —
Fivetranadds the column to the destination table on the next sync. -
Renamed column —
Fivetrantreats it as a new column + a dropped column (the old data stays under the old name; new writes go to the new name). -
Type change —
Fivetranwidens the destination type (e.g.INT → BIGINT,VARCHAR(50) → VARCHAR(255)) without dropping data. -
Dropped column — column stays in the destination, marked as deleted in
Fivetran's sync log; downstream models keep working until you explicitly drop them.
Incremental sync + CDC — what Fivetran actually does on each run
-
Cursor-based — for SaaS APIs with a
updated_atcolumn;FivetranqueriesWHERE updated_at > cursor. -
Log-based CDC — for
Postgres,MySQL,SQL Server;Fivetranreads the WAL / binlog / transaction log and replays inserts, updates, deletes. - Append-only — for event streams; every event is a new row.
- Full table refresh — fallback for tiny tables or sources without a cursor; the most expensive in MAR terms.
When Fivetran beats hand-rolled ingest — the four trigger conditions
- >5 SaaS sources — the engineer-hours saved on auth, schema, retries dominate the licensing cost.
-
Schema drift is frequent — Salesforce admins add custom fields constantly;
Fivetranhandles it. -
You don't have an on-call rotation —
Fivetranis the on-call. -
The MAR math works — pencil in $1 / 1k MAR; if MAR × $1 is less than 0.25× of an engineer-month,
Fivetranwins.
Airbyte + Stitch sidenotes
-
Airbyte— open-sourceFivetranalternative; 350+ connectors; self-host on Kubernetes (free) orAirbyte Cloud(managed). Strength: full source/code access, cheaper at high MAR. Weakness: thinner connector polish; more ops overhead. -
Stitch— the budget option (acquired by Talend in 2018); fewer connectors (~150); flat per-row pricing; perfect for tiny ingest needs.
Worked example — wiring Salesforce + Postgres CDC to Snowflake via Fivetran + dbt
Detailed explanation. The canonical 2026 ingest pattern is Fivetran lands raw → dbt staging cleans → dbt marts publish. This worked example wires two connectors (Salesforce + Postgres CDC), shows the destination table shape Fivetran lands, and shows the matching dbt stg_* model that turns the raw Fivetran output into a usable shape.
Question. A team has Salesforce (Accounts + Opportunities) and a Postgres application database (Users + Orders). Land everything into Snowflake via Fivetran, then build a stg_* layer in dbt that renames _FIVETRAN_* columns and lowercases ID prefixes.
Input.
| Source | Destination schema | Sync mode | Estimated MAR / month |
|---|---|---|---|
| Salesforce Accounts | raw_salesforce.accounts |
Cursor-based (SystemModstamp) |
50k |
| Salesforce Opportunities | raw_salesforce.opportunities |
Cursor-based | 200k |
Postgres users
|
raw_app.users |
Log-based CDC | 30k |
Postgres orders
|
raw_app.orders |
Log-based CDC | 4.5M |
Code (dbt staging model that cleans the Fivetran raw output).
-- models/staging/stg_salesforce__accounts.sql
{{ config(materialized='view') }}
select
id as account_id,
name as account_name,
industry,
annualrevenue::numeric as annual_revenue,
billingcountry as billing_country,
createddate::timestamp as created_at,
systemmodstamp::timestamp as updated_at,
_fivetran_synced::timestamp as dbt_loaded_at
from {{ source('raw_salesforce', 'accounts') }}
where _fivetran_deleted is false
-- models/staging/stg_app__orders.sql
{{ config(materialized='view') }}
select
id as order_id,
user_id,
amount_cents / 100.0 as amount_usd,
status,
created_at::timestamp as order_ts,
_fivetran_synced::timestamp as dbt_loaded_at
from {{ source('raw_app', 'orders') }}
where _fivetran_deleted is false
Step-by-step explanation.
-
Fivetranlandsraw_salesforce.accountswith the source column names plus three system columns:_fivetran_synced(when the row was loaded),_fivetran_deleted(soft-delete flag),_fivetran_id(deterministic primary key). -
dbt stg_salesforce__accountsrenames columns to snake_case, casts types, and filters out soft-deleted rows withwhere _fivetran_deleted is false. -
Postgres CDC behaves identically —
Fivetranlandsraw_app.orderswith the same three system columns; thestg_app__ordersview rents the same shape. -
MAR math —
50k + 200k + 30k + 4.5M = 4.78M MAR / month≈$4,780on the Standard tier (~$1.50 / 1k MAR with volume discount lands closer to$3.5k–4k). -
Downstream — every
int_*andfct_*modelref()s thestg_*views; the rest of the pipeline is identical to a hand-rolled ingest world.
Output (Snowflake INFORMATION_SCHEMA.TABLES after a few syncs).
TABLE_SCHEMA | TABLE_NAME | ROW_COUNT | BYTES
RAW_SALESFORCE | ACCOUNTS | 8,432 | 12 MB
RAW_SALESFORCE | OPPORTUNITIES | 31,587 | 48 MB
RAW_APP | USERS | 1,204,332 | 380 MB
RAW_APP | ORDERS | 28,402,118 | 12.4 GB
ANALYTICS | STG_SALESFORCE__ACCOUNTS | (view)
ANALYTICS | STG_APP__ORDERS | (view)
Rule of thumb: Fivetran wins when the source surface is wide (>5 connectors) and the team is small (<10 engineers); a hand-rolled ingest wins when the source surface is narrow, the volume is extreme, or the MAR math doesn't pencil.
SQL
Topic — etl
ETL ingest drills
SQL
Topic — data-transformation
Data-transformation patterns
5. Full ETL platforms — AWS Glue · Talend · Informatica
Full ETL platforms bundle ingest, transformation, orchestration, and governance into a single product — trading the composability of the modern Fivetran + dbt + Airflow stack for a one-vendor-throat-to-choke shape. Three platforms dominate this category, each anchored in a different ecosystem: AWS Glue (the AWS-native serverless Spark choice), Talend (the long-tenured GUI-first mid-market option), and Informatica (the Fortune-500 enterprise incumbent). The etl tools comparison question "should we go composable or all-in-one?" almost always lives here.
This section unpacks each platform's architecture: Glue (DPU autoscale, Glue Data Catalog, Glue Crawler, Glue Studio), Talend (Open Studio OSS, Talend Cloud SaaS, the Talend Studio drag-drop designer, Java code generation), and Informatica (IDMC SaaS, PowerCenter legacy, IPU consumption pricing, the Cloud Data Integration console). Each platform gets a worked example so the etl architecture trade-offs land as concrete picks.
AWS Glue — serverless Spark + Data Catalog + Crawler
AWS Glue is AWS's serverless spark etl service. You write PySpark (or Scala Spark) jobs, AWS provisions Spark clusters on demand, and you pay per DPU-hour (Data Processing Unit hour, 4 vCPU + 16 GB RAM). Four primitives carry the platform.
-
Glue Jobs— PySpark / Scala Spark scripts that read from S3 / RDS / Redshift / JDBC, transform, write back.Glue 4.0runs Spark 3.3 with Iceberg / Hudi / Delta support baked in. -
Glue Crawlers— scan an S3 prefix, infer schema, register the result as a table in theGlue Data Catalog. Run on a schedule or trigger. -
Glue Data Catalog— Hive-metastore-compatible metadata layer; every AWS analytics tool (Athena,EMR,Redshift Spectrum,Lake Formation) reads schema from it. -
Glue Studio— low-code visual editor for authoring Spark jobs; generates the underlying PySpark for you.
Glue pricing — $0.44 / DPU-hour (US-East), minimum 1-minute billing. A job that runs for 10 minutes on 10 DPUs costs 10 × (10/60) × $0.44 = $0.73. Autoscale (Glue 4.0) lowers idle DPU cost; flex execution discounts non-urgent jobs by ~30%.
When Glue wins.
-
Pure-AWS stacks — Glue +
Step Functions+Lake Formation+Athenakeeps everything inside one cloud. - S3-as-lakehouse workloads — Glue is the de-facto S3 Spark engine.
- Bursty workloads — serverless billing means you pay zero between runs.
Talend — GUI drag-drop + Java codegen + Open Studio vs Enterprise
Talend is the long-tenured (2005-founded; Qlik-acquired 2023) gui etl platform. The signature Talend Studio desktop app lets engineers drag components onto a canvas, draw arrows between them, and click Run — Talend generates Java code under the hood and executes it.
-
Talend Open Studio— the free OSS edition; runs on your laptop or any JVM; produces standalone.jarfiles you deploy where you like. -
Talend Data Integration(Enterprise) — the paid tier; adds scheduling, monitoring, change-management, MDM, team collaboration. -
Talend Cloud/Data Fabric— the SaaS replatform; cloud-managed runtime, governance, lineage; replaces the legacy on-premTAC(Talend Administration Center). -
Talend Studiocanvas — components liketFileInputDelimited,tMap,tFilterRow,tAggregateRow,tDBOutput; thetMapcomponent is the heart of every real Talend job (joins + transforms + conditional routing in one tile).
When Talend wins.
- Low-code mid-market — non-Python teams that want point-and-click ETL.
- On-prem + cloud hybrid — Talend Studio runs anywhere a JVM does.
- Mature data-governance needs — Talend Data Catalog + MDM is integrated.
Informatica — IDMC SaaS + PowerCenter legacy + IPU billing
Informatica has been the enterprise etl incumbent since 1993. Two products dominate today.
-
PowerCenter— the legacy on-prem product; massive installed base in Fortune 500 finance / healthcare / telco; mainframe + SAP integration is unmatched. Slowly being migrated to IDMC. -
Informatica IDMC(Intelligent Data Management Cloud) — the modern SaaS replacement; unified console forCloud Data Integration(the ETL service),Cloud Data Governance and Catalog,Cloud Mass Ingestion(the Fivetran-style EL competitor),Cloud Data Quality, andCloud MDM.
IDMC pricing — IPU (Informatica Processing Unit) consumption. One IPU is a unit of compute consumption; the IDMC console shows you IPU usage per service. Prepaid IPU bundles (e.g. 50k IPU / year) are the typical procurement shape; enterprise contracts run from $100k / year to multi-million for Fortune 50 deployments.
When Informatica wins.
-
Regulated industries —
Informatica's governance, lineage, and catalog satisfy banking / healthcare auditors out of the box. -
Mainframe + SAP integration —
PowerCenter's legacy connectors are unmatched. - VP-of-Data pickability — single vendor, single contract, single throat to choke.
-
Migrating from
PowerCenter— IDMC is the only path that re-uses existing mappings and skills.
Comparison at a glance
| Aspect | AWS Glue | Talend | Informatica |
|---|---|---|---|
| Origin | 2017 (AWS) | 2005 (OSS GUI) | 1993 (enterprise) |
| Surface API | PySpark + Glue Studio visual | Drag-drop Talend Studio (Java codegen) | Drag-drop IDMC + PowerCenter |
| Compute model | Serverless Spark (per DPU-hour) | JVM runtime (self or Cloud-managed) | IDMC SaaS (consumption) |
| Catalog | Glue Data Catalog (Hive-compatible) | Talend Data Catalog | IDMC Data Governance + Catalog |
| Pricing unit | $/DPU-hour | Per-user / per-server license + SaaS | IPU consumption |
| Cloud lock | AWS-only | Multi-cloud + on-prem | Multi-cloud + on-prem |
| Open source | No | Open Studio only | No |
| Strongest at | AWS-native S3 Spark ETL | Mid-market GUI ETL | Regulated enterprise governance |
| Weakest at | Non-AWS targets | Cloud-native polish vs Fivetran | Cost + learning curve |
Worked example — costing a 1 TB / day pipeline three ways
Detailed explanation. The fairest etl tools comparison is a head-to-head cost exercise on identical workload. This worked example takes a single 1 TB / day pipeline — 8 SaaS sources, 1 large Postgres DB, 3 transformation stages, 1 Snowflake target — and prices it on all three full-ETL platforms.
Question. Cost the same workload (1 TB / day raw, 8 SaaS sources, 1 Postgres CDC source, 3 transformation steps, Snowflake target) on (a) AWS Glue + Step Functions, (b) Talend Cloud, (c) Informatica IDMC. Use US-East list prices, 2026.
Input.
| Workload component | Volume / day | Notes |
|---|---|---|
| 8 SaaS source ingests | 4M MAR-equivalent | Salesforce, Stripe, Shopify, etc. |
| 1 Postgres CDC ingest | 4M MAR-equivalent | Order events |
| 3 transformation steps | 6 DPU-hours / day | Joins, aggregations, dedupe |
| Snowflake target | 1 TB / day landed | Warehouse cost excluded |
Code (the rough monthly cost math).
# AWS Glue + Step Functions (pure AWS)
# Ingest: 8 sources × 0.5 DPU-hr × 30 days = 120 DPU-hr -> $52.80
# CDC: 1 source × 1.0 DPU-hr × 30 days = 30 DPU-hr -> $13.20
# Transform: 6 DPU-hr × 30 days = 180 DPU-hr -> $79.20
# Step Fns: negligible (~$1)
# ---------------------------------------------
# Glue total: ~$145 / month (compute only; S3 + Snowflake separate)
# Talend Cloud
# Talend Cloud Data Integration: 3 named users × $1,170/mo = $3,510
# (or volume-based equivalent; pricing varies by deal)
# Talend runtime VMs (or Cloud runtime): ~$500 / mo
# ---------------------------------------------
# Talend total: ~$4,000 / month
# Informatica IDMC
# Estimated IPU consumption: ~8k IPU / month
# Pre-paid bundle: ~$4 / IPU
# ---------------------------------------------
# Informatica total: ~$32,000 / month (entry enterprise; quotes vary 2-5x)
Step-by-step explanation.
-
Glueis the cheapest in raw compute — at$0.44 / DPU-hour, 330 DPU-hours / month is ~$145. The catch: you wrote the PySpark, you own the operations, you fix the failures. -
Talend Cloudlands around ~$4k / month for a small team — three named users on Data Integration plus a runtime VM. The GUI saves engineer-time; the licensing is the cost. -
Informatica IDMCis the most expensive — entry deployments start around $30k / month and scale to $250k+ for Fortune 50. You're paying for governance, catalog, lineage, and a vendor relationship. - Why the spread is 200× — Glue is compute-only and you do the work; Informatica is a managed platform with governance built in.
-
Real picks —
Gluefor engineering-led AWS teams,Talendfor mid-market GUI shops,Informaticafor regulated Fortune 500 enterprises.
Output (monthly cost summary).
| Platform | Monthly cost (entry) | Engineer-hours / month | Total $ + opportunity cost |
|---|---|---|---|
| AWS Glue | ~$145 | ~40 hrs (own the PySpark) | ~$5–7k loaded |
| Talend Cloud | ~$4,000 | ~10 hrs (GUI maintenance) | ~$5–6k loaded |
| Informatica IDMC | ~$32,000 | ~5 hrs (vendor handles most) | ~$33–34k loaded |
Rule of thumb: Glue is "cheap if your team is senior"; Talend is "moderate, flat, predictable"; Informatica is "expensive but the auditors smile" — and the right pick depends on which line of that sentence describes your company.
SQL
Topic — etl
Full ETL practice
SQL
Topic — data-aggregation
Data aggregation drills
6. Pricing and licensing models — comparing 6 cost shapes
etl tools pricing is where every etl tool selection decision gets real — and where most procurement decks lie because the six tools use six different billing units that don't compare apples-to-apples. Airflow is free OSS; dbt Core is free OSS; dbt Cloud is per-seat; Fivetran is MAR (Monthly Active Rows); Glue is $ / DPU-hour; Talend is per-user-license + runtime; Informatica is IPU consumption. To compare, you have to flatten every cost into a monthly dollar figure on the same workload.
This section walks the six cost shapes, then runs the canonical workload (1 TB / day, 8 SaaS sources, 3 transform stages) across all six tools so the etl pricing comparison lands as a single side-by-side table you can show your VP.
The six cost shapes — what each tool actually charges for
-
Airflow (OSS) — free software; you pay for the underlying infrastructure (Kubernetes cluster, metadata Postgres, scheduler / webserver / workers). Self-hosted ~$500–2,000 / month for small teams; managed (
MWAA,Astronomer) ~$500–5,000 / month. - dbt Core (OSS) — free CLI; you run it under Airflow / Dagster; effectively zero incremental cost.
-
dbt Cloud (per-seat) —
Developer(free, single seat),Team(~$100 / developer / month),Enterprise(custom; usually $1,500+ / month all-in for SSO, MS Teams, etc.). -
Fivetran (MAR) — volume tiers; rough
$1–2 per 1k MARwith declining unit cost at scale; a 5M MAR workload lands ~$3,500–5,000 / month. -
AWS Glue (DPU-hour) —
$0.44 / DPU-hour; minimum 1-minute billing; serverless autoscale. A heavy daily pipeline runs ~$100–500 / month. -
Talend Cloud (per-user + runtime) —
Data Integrationnamed-user licenses (~$1,170 / user / month) plus Talend Cloud runtime; small teams ~$3–5k / month. - Informatica IDMC (IPU) — pre-paid IPU bundles; ~$3–5 / IPU consumed; mid-size enterprise ~$30k+ / month, large enterprise $250k+ / month.
The licensing posture each cost shape implies
-
OSS + infra (
Airflow,dbt Core) — flat infra cost, scales with cluster size, predictable. The "if our engineers are senior" pick. -
Per-seat (
dbt Cloud) — flat per developer; predictable; doesn't scale with data volume. The "10-engineer team" pick. -
Volume (
MAR,DPU-hour,IPU) — variable with workload; great when small, painful when big. The "we'll grow into the bill" pick. -
Per-license (
Talend) — flat per named user; predictable; doesn't scale with data volume but does scale with team. The "non-engineering teams co-build" pick.
Worked example — 1 TB / day, 8 SaaS sources, 3 transforms across all six tools
Detailed explanation. The fairest pricing comparison is a single workload priced across every option. We use the same workload as §5's worked example (1 TB / day, 8 SaaS + 1 CDC source, 3 transform stages, Snowflake target) and add Airflow orchestration plus dbt transformation rows.
Question. Estimate the monthly tooling cost (compute + license, excluding Snowflake) for the canonical workload across all six tools (or stacks involving them).
Input. Same workload as §5.
| Component | Volume / month |
|---|---|
| SaaS source MAR | ~5M MAR |
| Postgres CDC MAR | ~4M MAR |
| Glue DPU-hours | ~330 DPU-hr |
| dbt models | ~80 models, 6 runs / day |
| Airflow DAGs | ~10 DAGs, 24 runs / day |
| Team size | 3 data engineers |
Code (cost shape per tool / stack).
# Tool / stack # Monthly cost (entry)
Airflow (MWAA mw1.medium) ~$500
Airflow (Astronomer Hosted Standard) ~$1,500
dbt Core under Airflow $0
dbt Cloud (Team, 3 devs) ~$300
Fivetran Standard (9M MAR) ~$3,500-5,000
AWS Glue (330 DPU-hr / month) ~$145
Talend Cloud (3 named users + runtime) ~$4,000
Informatica IDMC (entry mid-size) ~$32,000
# Three canonical 2026 stacks:
# Stack A: Fivetran + dbt Cloud + MWAA ~$4,300-5,800 / mo
# Stack B: Glue + Step Functions only ~$200-300 / mo
# Stack C: Informatica IDMC end-to-end ~$32,000 / mo
Step-by-step explanation.
-
Stack A (modern data stack) —
Fivetran($4k) +dbt Cloud Team($300) +MWAA($500) = ~$4,800 / mo. Predictable, code-first, easy to hire for. -
Stack B (pure AWS) —
Glue($145) +Step Functions(~$1) +Glue Data Catalog(~$1) = ~$150 / mo. Cheapest by far; trades the buy with engineer-hours owning the PySpark. -
Stack C (legacy enterprise) —
Informatica IDMCend-to-end = ~$32k / mo. Buys governance, lineage, vendor relationship; 200× the cost of Stack A. -
Cost per TB ingested —
Fivetranis ~$160 / TB on this workload;Glueis ~$5 / TB;Informaticais ~$1,000 / TB. -
Hidden costs not shown — Snowflake compute (~$3–10k / month at this volume) is the same across all three stacks;
Airflowengineer-hours running self-hosted are ~10–20 hours / month;Informaticaadmin training and licensing add another 20–30%.
Output (apples-to-apples monthly bill summary).
| Stack | Tooling / month | Cost per TB ingested | Best for |
|---|---|---|---|
| Stack A — Fivetran + dbt Cloud + MWAA | ~$4,800 | ~$160 / TB | Modern data team, 3–20 engineers |
| Stack B — Glue + Step Functions | ~$150 | ~$5 / TB | Pure-AWS engineering-led shops |
| Stack C — Informatica IDMC | ~$32,000 | ~$1,000 / TB | Regulated enterprise, governance-first |
Rule of thumb: expect the modern data stack (Stack A) to be the default for greenfield 2026 builds, AWS-native (Stack B) for cost-pressured pure-AWS shops, and Informatica (Stack C) for regulated incumbents with existing PowerCenter footprints. Every other stack is a variant of one of these three.
SQL
Topic — etl
Cost-aware ETL drills
SQL
Topic — design
System-design ETL problems
7. Decision framework + 2026 production stack patterns
The etl tool selection decision is rarely "which is best" and almost always "which fits my team, ecosystem, and budget?" — so a decision framework beats a feature-matrix every time. This final section ships a two-question decision tree as ASCII art, then walks three realistic data engineering stack patterns you'll see in the wild: the modern data stack, the pure AWS stack, and the legacy enterprise stack. By the end you can defend a 2026 stack pick in a system-design round with the same confidence you defend an O(n) SQL plan.
The two-question decision tree
START — pick an ETL tool stack
|
v
Q1: Is your warehouse already cloud-native
(Snowflake / BigQuery / Databricks)?
|
+--------------+--------------+
| |
YES NO
| |
v v
Q2a: Is your team code-first Q2b: Is your team in a regulated
(Python / SQL / Git)? industry (banking / pharma /
| healthcare / insurance)?
+-----------+-----------+ |
| | +-------+--------+
YES NO | |
| | YES NO
v v | |
Stack A: Fivetran + Stack T: Talend v v
dbt + Airflow Cloud (GUI-first) Stack C: Stack B:
(modern data stack) Informatica AWS Glue +
IDMC Step Functions
(regulated (pure AWS,
enterprise) engineer-led)
Stack A — Fivetran + dbt + Airflow (the modern data stack)
The 2026 greenfield default. Pre-built Fivetran connectors land raw data into Snowflake / BigQuery / Databricks; dbt models clean, layer, and publish marts; Airflow orchestrates Fivetran syncs, dbt build, and downstream notifications.
- Strengths — code-first, Git-reviewable, easy to hire for, every layer is best-in-class.
-
Weaknesses —
Fivetran MARcost ramps with volume; you operate three tools, not one. - Reference cost — $3–6k / month for a small / mid-market team.
- Reference team size — 2–20 data engineers.
- Reference industries — SaaS, e-commerce, ad-tech, marketplaces, fintech.
Stack B — Glue + Step Functions + Glue Data Catalog (pure AWS)
The cost-conscious engineering-led AWS-native stack. Glue Crawlers register S3 schemas in the Glue Data Catalog; Glue Jobs run PySpark transforms; Step Functions orchestrates; Lake Formation adds row / column security; Athena / Redshift Spectrum queries the result.
- Strengths — cheapest by an order of magnitude, AWS-native security, zero idle cost.
-
Weaknesses — AWS-only, you own all the PySpark, less polish than
dbt. - Reference cost — $150–600 / month at modest volume.
- Reference team size — 3–10 senior engineers.
- Reference industries — AWS-native startups, ML-platform teams, gaming, IoT.
Stack C — Informatica IDMC + Airflow + cloud warehouse (legacy enterprise)
The Fortune-500 incumbent shape. IDMC Mass Ingestion replaces Fivetran for SaaS ingest; Cloud Data Integration runs ETL mappings; Cloud Data Governance ships lineage / catalog / quality as one console; Airflow runs alongside for non-Informatica jobs (Spark batches, custom Python).
- Strengths — governance / lineage / compliance auditors love it; mainframe + SAP connectors unmatched; single vendor relationship.
- Weaknesses — 10–100× the cost of Stack A; steeper learning curve; less Git-native dev workflow.
- Reference cost — $30k–500k / month depending on Fortune ranking.
- Reference team size — 20+ data engineers, dedicated Informatica admins.
- Reference industries — banking, insurance, healthcare, pharma, telco, public sector.
Picking — the staff-engineer one-liner
The senior interview answer always starts the same: "It depends on three things — your warehouse, your team profile, and your regulatory posture." If your warehouse is cloud-native and your team is code-first, Stack A. If you're AWS-only and cost-pressured, Stack B. If you're regulated and have a PowerCenter footprint, Stack C. Everything else is a variant.
SQL / Python
Topic — design
System-design practice
SQL
Topic — etl
End-to-end ETL drills
Choosing the right ETL tool (cheat sheet)
A one-screen cheat sheet for etl tool selection — pick the tool that matches your job-to-be-done.
| You want to … | Reach for | Notes |
|---|---|---|
| Orchestrate any kind of job |
Airflow (or Prefect / Dagster) |
DAGs as code; 200+ operators |
| Run SQL transforms inside a cloud warehouse |
dbt (Core or Cloud) |
Models, refs, materializations |
| Pull from 50+ SaaS sources without writing code | Fivetran |
MAR-billed; schema-drift safe |
| Same job as Fivetran but open-source | Airbyte |
350+ connectors; self-host |
| Run serverless Spark on AWS S3 | AWS Glue |
DPU-hour billing; Glue Data Catalog |
| Drag-and-drop ETL for a mixed-skill team | Talend Cloud |
Java codegen; named-user licenses |
| Enterprise governance + lineage + MDM | Informatica IDMC |
IPU consumption; PowerCenter migration path |
| Greenfield modern data stack | Fivetran + dbt + Airflow |
Stack A — the 2026 default |
| Pure-AWS cost-conscious stack | Glue + Step Functions + Lake Formation |
Stack B — cheapest at scale |
| Regulated enterprise rebuild | Informatica IDMC + Airflow |
Stack C — governance-first |
| Track data quality alongside any of the above |
Soda or Great Expectations
|
Layer 6 quality, not ETL itself |
| Pair Fivetran with a transform tool | Fivetran → dbt |
EL + T; the canonical 2026 ingest+transform |
| Replace dbt with a planner-style alternative | SQLMesh |
Virtual envs; column-level lineage |
| Add an asset-centric orchestrator | Dagster |
Software-defined assets; lineage-first |
| Cost a workload across tools |
MAR × $1 vs DPU-hr × $0.44 vs IPU × $4
|
Flatten to monthly $ on same workload |
| Avoid vendor lock-in | OSS stack: Airflow + dbt Core + Airbyte
|
Higher ops burden; full portability |
Frequently asked questions
Which ETL tool should I learn first as a data engineer in 2026?
Start with Airflow and dbt — together they cover the two highest-frequency interview topics and the two most-deployed tools in the modern data stack. Airflow gives you DAGs, operators, executors, sensors, and dynamic task mapping — the canonical orchestration vocabulary every senior round tests. dbt gives you models, refs, materializations, the staging → intermediate → marts layering, and tests — the canonical in-warehouse transformation pattern. After those two, learn Fivetran conceptually (you'll configure it through a UI, not code, so the learning curve is hours not weeks) and pick one full-ETL platform based on your target employers — AWS Glue if you're targeting cloud-native shops, Informatica or Talend if you're targeting regulated enterprises. Skip the "which is best" debate; the right answer in 2026 is "all of them serve different layers, and senior engineers know what each is for."
Is Airflow being replaced by Prefect or Dagster?
No, but the competitive field has matured. Airflow still dominates by ecosystem breadth — 200+ provider packages, every cloud, every operator, every interview prep deck. Prefect is a credible Python-first alternative with better dynamic-workflow ergonomics (task.map(...) is first-class), Prefect Cloud is the managed control plane, and many greenfield 2026 builds pick Prefect. Dagster is the most opinionated of the three; it elevates software-defined assets to first-class status so every materialised dataset is an asset and the DAG is the asset graph — strong for lineage and observability use cases. In interviews, Airflow is the default you should be fluent in; mentioning Prefect / Dagster as alternatives is a senior-signal move, but the loop will keep coming back to Airflow concepts (DAGs, operators, executors, sensors).
What's the practical difference between dbt and Fivetran?
dbt and Fivetran are not substitutes — they live in different layers of the pipeline and 2026's canonical stack runs both. Fivetran is EL / ingest — it pulls data from operational sources (Salesforce, Stripe, Postgres) into your warehouse, with pre-built connectors handling auth, schema discovery, and incremental sync; you never write SQL inside Fivetran. dbt is in-warehouse transformation — it runs SQL SELECT statements inside the warehouse to turn raw landed tables into clean marts; you never move data across systems with dbt. The canonical 2026 stack is Fivetran lands raw → dbt staging cleans → dbt marts publish → Airflow orchestrates. If anyone tells you to pick dbt vs Fivetran, they've misread the question — pick both, in different layers.
When does AWS Glue beat Fivetran + dbt for the same workload?
Three conditions. First, when your data is already on S3 (logs, click-streams, IoT, ML feature stores) — Glue Crawlers and Glue Jobs are built for S3-as-lakehouse workloads, while Fivetran has weaker S3-source connectors. Second, when you're AWS-only and your security team prefers IAM + Lake Formation over a third-party SaaS — Glue lives inside your VPC; Fivetran is external SaaS that needs source credentials. Third, when MAR math doesn't pencil — at very high write volumes (10s of millions of rows / day), Fivetran's MAR bill exceeds Glue's DPU-hour bill by 10–50×. The trade-off: Glue requires PySpark fluency and you own the operations, while Fivetran + dbt ships managed connectors and SQL transforms. The default for most teams is still Fivetran + dbt + Airflow (modern data stack), with Glue slotted in for specific S3-heavy workloads.
Is Informatica still relevant in 2026, or should we go cloud-native?
Informatica is still very relevant in regulated Fortune-500 enterprises — banking, healthcare, insurance, pharma, telco, public sector — and it isn't going away. PowerCenter still runs many Fortune-500 data warehouses; IDMC is the modern SaaS replacement that bundles Cloud Data Integration (ETL), Cloud Data Governance (catalog + lineage), and Cloud Mass Ingestion (the Fivetran-style EL competitor). The reason regulated enterprises stick with Informatica is the governance and lineage posture — auditors trust the platform, and the migration cost from PowerCenter to anything else is enormous. For greenfield builds in cloud-native sectors (SaaS, e-commerce, ad-tech, fintech, marketplaces), the cloud-native Fivetran + dbt + Airflow stack wins on cost, agility, and hiring. Pick Informatica when your industry forces it; pick the modern data stack when it doesn't.
How do I compare the price of ETL tools when they all use different billing units?
Flatten everything to monthly dollars on the same workload. Pick a single reference workload — for example, 1 TB / day landed, 8 SaaS sources, 1 Postgres CDC, 3 transform stages, 3 engineers — and compute the monthly cost across each option. Airflow OSS ≈ $500–1,500 / month (infra only); dbt Cloud Team ≈ $100 / developer / month; Fivetran Standard ≈ $1.50 per 1k MAR (with volume discounts at scale); AWS Glue ≈ $0.44 / DPU-hour × your DPU-hours; Talend Cloud ≈ $1,170 / named user / month plus runtime; Informatica IDMC ≈ $3–5 / IPU consumed. Then add up the stack: Fivetran + dbt + MWAA lands ~$4–6k / month; Glue + Step Functions lands ~$150–500 / month; Informatica IDMC end-to-end lands ~$30k+ / month. Compare on the same workload, and the apples-to-apples picture emerges immediately.
Practice on PipeCode
PipeCode ships 450+ data-engineering interview problems — including SQL, Python, ETL, data-modeling, aggregation, joins, window functions, dimensional modeling, and the system-design rounds where etl tool selection lands. Whether you're prepping for an Airflow + dbt modern-data-stack interview or a system-design round defending a Glue + Step Functions pick, the practice library mirrors the same four-category taxonomy this guide teaches.
Kick off via Explore practice →; drill the SQL practice lane →; fan out into the ETL practice lane →; reinforce data-modeling drills →; rehearse joins patterns →; sharpen aggregation problems →; reinforce Python practice →; or widen coverage on the full dimensional-modeling library →.