PhantomCodeAIPhantomCodeAI
FeaturesMock InterviewDashboardJobsPricing
FeaturesMock InterviewDashboardJobsPricing
HomeInterview QuestionsData Engineer
Updated for 2026 hiring loops

Data Engineer Interview Questions — What Senior Candidates Actually Get Asked

Data engineering interviews in 2026 look very different from the generic backend-engineer loop they were copied from a decade ago. The role has stabilised around three skill axes that every loop probes — deep SQL fluency (window functions, sweep-line patterns, query optimisation against real query plans), distributed data systems literacy (Spark, Kafka, Flink, and a lakehouse table format like Iceberg or Delta), and pipeline design taste (modeling, lineage, data contracts, quality assertions). The bar has moved up on all three. A SQL screen in 2018 was "find duplicates with GROUP BY HAVING"; today the median screen is a multi-step CTE with window functions and a follow-up about handling NULLs and ties. A system design round in 2018 was "draw a Kafka topic and a Spark job"; today you are expected to argue lakehouse vs warehouse, kappa vs lambda, and name the operational trade-offs of Iceberg vs Delta. The behavioural round increasingly probes data quality ownership and stakeholder management — because senior data engineers spend as much time saying "the right version of this request is different" as they do writing pipelines. What follows is the pattern of questions you should expect, with the structured answers interviewers grade against.

How data engineering loops are structured

  • SQL round. Almost always live SQL on a shared editor against a small fake schema. Expect window functions (LAG, LEAD, ROW_NUMBER, DENSE_RANK, PERCENTILE_CONT), CTEs, and at least one optimisation discussion. Talking out loud about the query plan matters more than getting the SQL syntactically perfect on the first pass.
  • Data system design. Usually one round, sometimes two for senior candidates. Pipelines, lakehouses, feature stores, real-time analytics, CDC ingestion are the common prompts. You drive — the interviewer will not feed you the architecture. Start with data volume, freshness SLA, query patterns, and consistency requirements before drawing anything.
  • Coding round (Python or Scala). Often a PySpark or pandas problem — read this file, dedupe by latest event_time per key, join with this dimension, write partitioned output. Tests basic data-manipulation fluency. Less algorithmically demanding than a SWE coding round, but you must know your dataframe API cold.
  • Behavioural / ownership round. Heavy emphasis on data quality incidents, model redesigns, and stakeholder pushback. The signal is whether you treat data quality and modeling as first-class engineering concerns rather than afterthoughts.

SQL Q&A

Window functions, sweep-line patterns, and optimisation. The interviewer is grading whether you reach for the right tool — and whether you ask about ties, NULLs, and dialect before writing the query.

Q1. Write a SQL query to find the second-highest salary in an Employee table. Then generalize to the Nth-highest.

This is one of the most common SQL screening questions because it tests window function fluency and edge-case thinking at the same time. There are three canonical approaches. The window function approach: SELECT salary FROM (SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rk FROM Employee) t WHERE rk = 2. The LIMIT/OFFSET approach: SELECT DISTINCT salary FROM Employee ORDER BY salary DESC LIMIT 1 OFFSET 1 — short but dialect-dependent and you must remember the DISTINCT or duplicates eat your offset. The correlated subquery approach (older but still asked in screens): SELECT MAX(salary) FROM Employee WHERE salary < (SELECT MAX(salary) FROM Employee). The real signal is how you handle ties — DENSE_RANK() treats two people earning the same top salary as both rank 1, so the second-highest distinct salary is rank 2; RANK() leaves gaps (1, 1, 3); ROW_NUMBER() arbitrarily picks one of the tied rows. Generalising to the Nth highest, swap the literal for a parameter and use DENSE_RANK() if you want distinct salary tiers, ROW_NUMBER() if you want the Nth specific employee. The L5 follow-up always lands on ties: ask the interviewer which version of Nth highest they want before you write a line.

Q2. Given a Sessions table with (user_id, session_start, session_end), write a query that returns the maximum number of concurrent sessions at any moment.

Sweep-line. Generate a +1 event at every session_start and a -1 event at every session_end, sort by timestamp, then take the running sum and pick its maximum. In SQL: WITH events AS (SELECT session_start AS ts, 1 AS delta FROM Sessions UNION ALL SELECT session_end AS ts, -1 AS delta FROM Sessions) SELECT MAX(running) FROM (SELECT ts, SUM(delta) OVER (ORDER BY ts, delta DESC ROWS UNBOUNDED PRECEDING) AS running FROM events) t. The ORDER BY ts, delta DESC matters — when a session ends at the same instant another begins, you want to count the overlap, so process the +1 before the -1 (or the opposite, depending on whether the interval is half-open). State the convention out loud. The common follow-up is concurrent sessions per minute: round the timestamp to the minute, group by minute, take the max within each group, or generate a calendar of minutes and self-join. The other follow-up is doing this at scale — on a 100B-row sessions table you cannot sort globally, so you partition by hour and process the boundary intervals separately.

Q3. Write a query that for each user returns the time gap between consecutive transactions. Then return users whose median gap exceeds 30 days.

Two-pass CTE pipeline. First pass uses LAG() to grab the previous transaction time per user: WITH gaps AS (SELECT user_id, txn_ts, DATEDIFF(day, LAG(txn_ts) OVER (PARTITION BY user_id ORDER BY txn_ts), txn_ts) AS gap_days FROM Transactions). The first transaction per user yields NULL from LAG — filter it out in the next step or PERCENTILE_CONT will skew. Second pass computes the median per user: SELECT user_id, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY gap_days) AS median_gap FROM gaps WHERE gap_days IS NOT NULL GROUP BY user_id HAVING PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY gap_days) > 30. Median is the tricky part — most dialects do not have a plain MEDIAN function. BigQuery and Postgres have PERCENTILE_CONT (continuous interpolation) and PERCENTILE_DISC (discrete pick); Snowflake has MEDIAN as a shortcut. If you are on a dialect without any of these, fall back to NTILE(2) or a self-rank approach: rank gaps per user, pick the middle one (or average the two middle ones for even counts). The senior signal is recognising you need a CTE pipeline, handling the LAG NULL, and naming the dialect-specific median function rather than waving at it.

Q4. Optimize a query that joins a 100B-row fact table with a 100M-row dimension table. The query is taking 30 minutes; reduce it to under 1 minute.

Walk through the toolkit in order of cheapest fix. (1) Partition pruning — the fact table should be partitioned by date (or whatever the high-cardinality filter dimension is); if the query has a date predicate, make sure partitions are actually being pruned by checking the query plan. (2) Broadcast join — 100M rows of dimension data is roughly a few GB depending on width, which fits in memory on a modern Spark executor; force a broadcast hint (BROADCAST in Spark, MAPJOIN in Hive, /*+ broadcast(d) */ in Trino) so the dimension ships to every executor instead of doing a shuffle. (3) Columnar storage — if the fact table is row-oriented (CSV, JSON, Avro for write-heavy paths), convert to Parquet or ORC; a 100B-row scan that only needs 5 columns drops by an order of magnitude on columnar formats. (4) Pre-aggregation — if the query is repeated, build a daily rollup table and query that instead; the L5 answer mentions materialised views or dbt incremental models. (5) File sizing — many small files kills Spark; aim for 128MB–1GB Parquet files. The L5 distinction is knowing when broadcast joins flip to sort-merge joins (when the dimension stops fitting in memory, usually around 8GB depending on cluster sizing) and when to switch the storage format entirely — Iceberg or Delta with Z-ordering on the join key beats raw Parquet on selective queries.

Data system design Q&A

Lakehouses, real-time analytics, feature stores. Start with volume, freshness, and query patterns before drawing boxes. Name the trade-offs — that is the senior signal.

Q1. Design a real-time analytics pipeline for an e-commerce site — events from web/mobile flow in, dashboards must show conversion funnels with under-5-minute latency.

Layered architecture. Ingest layer: web and mobile SDKs send events over HTTPS to a thin collector service that writes to Kafka (or Kinesis or Pub/Sub depending on cloud). Partition by user_id or session_id so events for the same session stay ordered. Stream processing: Flink or Spark Structured Streaming consumes the topic, does windowed aggregations (5-minute tumbling windows for funnel counts), enriches with dimension lookups (user attributes from a cached Redis layer), and writes to the serving store. Serving layer: Druid, ClickHouse, or Pinot — all three give you sub-second queries on freshly-ingested data and support the rollup-and-slice access pattern dashboards need. The harder problem is late-arriving events and event-time vs processing-time windowing. Mobile clients go offline and flush events hours later; if you window on processing time you mis-attribute. Use event-time windows with a watermark (Flink calls this allowed lateness) and decide policy: drop, side-output to a late-arrivals topic, or update previously-emitted windows. Lambda architecture (batch path correcting the speed path) was the standard answer in 2018; in 2026 most teams pick kappa (stream-only) because Flink with state checkpointing and replayable Kafka topics gives you correctness without the duplicated batch path. Mention the trade: kappa is operationally simpler but requires longer Kafka retention and burns more streaming compute on backfills.

Q2. Design a data lakehouse that supports 100PB of data, with both batch ETL and real-time queries.

Storage: object storage (S3, GCS, or Azure Blob) as the durable, cheap base layer — 100PB on S3 with intelligent tiering is the only economically sensible substrate. Table format: Apache Iceberg or Delta Lake (Hudi is the third option but loses on adoption). Both give you ACID semantics over object storage — atomic commits, schema evolution, time travel, hidden partitioning — without which you cannot safely run concurrent writers. Iceberg has the more open ecosystem (Trino, Spark, Flink, Snowflake all read it natively); Delta is tighter with Databricks. Batch processing: Spark for heavy ETL, dbt for SQL-based transformation orchestration on top of Trino/Snowflake/BigQuery — dbt is the modeling layer, not a compute engine, and it has won the transformation tier. Ad-hoc query: Trino or Athena for federated SQL across Iceberg tables. Streaming ingestion: Kafka into Flink writing Iceberg sinks (or a managed equivalent like Tabular's product); landing data in Iceberg lets streaming and batch share the same source of truth. Catalog: Hive Metastore is the legacy default; AWS Glue, Unity Catalog (Databricks), or Polaris (Snowflake) are the modern picks — the catalog choice is also a governance choice because it owns column-level lineage and access control. Data quality: dbt tests for declarative assertions, Great Expectations for richer profiling, and contracts at the producer side (Schema Registry for Kafka topics) so bad data never lands. The L5 thread is naming the trade-offs across these layers, not just listing tools.

Q3. Design a feature store for an ML platform — must support both training (point-in-time correct lookups across years of history) and online inference (sub-50ms feature retrieval).

Two storage tiers backed by one feature definition. Offline store: Parquet (or Iceberg) on S3, partitioned by event_time. Point-in-time correctness is the hard requirement — when generating training data for a model that predicts at t, every feature value must be as-of t, not the latest known value. This is an event-time join (sometimes called an as-of join), not a regular join. Implement via Spark with a join condition like feature.event_time <= label.event_time AND feature.event_time > label.event_time - lookback_window. Online store: Redis, DynamoDB, or ScyllaDB for sub-50ms lookups by entity_id (user_id, item_id). Keep only the latest value per (entity, feature) — online inference does not need history. Consistency: features computed in batch must match what is served online — this is the offline-online skew problem and it kills more model performance than any other operational issue. The clean solution is to define each feature once in a declarative DSL (Feast, Tecton, or in-house), then generate both the batch SQL/Spark job that backfills the offline store and the streaming job that updates the online store from the same definition. The dual-write problem still bites — if your batch backfill and your streaming update both write to the online store, you need ordering guarantees or you will overwrite fresh data with stale data. Solutions: write through a single ordered log (Kafka with topic compaction), or version every write with the event_time and have the online store reject older writes. Mention feature versioning (changing a feature's logic means a new feature name, not an in-place change, so old models keep working) and transformation reuse (the same join logic in training and serving).

Behavioural Q&A

Data quality incidents, model redesigns, and stakeholder pushback. Interviewers want specific stories with named trade-offs, not generalities.

Q1. Tell me about a data pipeline that broke in production.

Strong answers walk through the full incident lifecycle, not just the fix. Detection: how you found out — a data quality alert firing on a freshness check, a downstream dashboard showing impossible numbers (negative revenue, NULL conversion rate), an on-call page from a SLA breach. Diagnosis: how you traced it — checking lineage tools (dbt docs, OpenLineage, Marquez) to find the upstream source, running queries against intermediate tables to localise where the data went wrong, comparing schemas across runs to find a silent schema change. Immediate fix: backfill the affected window, replay the Kafka topic from an earlier offset, manually correct the dimension table while you investigate root cause. Durable fix: the part most candidates skip — what you changed so it does not recur. Add a dbt test or Great Expectations assertion for the failure mode (NOT NULL, unique, value-in-set, freshness), add an alert on the upstream schema change, add a contract at the producer side. The senior signal is owning data quality as a first-class concern rather than treating quality as an afterthought you handle once it has burned you. Avoid blaming the upstream team — even if they shipped the breaking change, your pipeline was the one that propagated bad data downstream, and your fix should make that harder regardless of upstream behaviour.

Q2. Describe a time you redesigned a data model.

Pick a real case and name what was wrong with the old schema. The shape: an existing model started causing problems — query times growing as data scaled, joins exploding (a five-way join across normalised tables that took 30 minutes), business logic becoming hard to express (every analyst writing the same 200-line CTE because the model did not match the question), or a slowly-changing dimension being handled wrong so historical reports drifted. You proposed an alternative — star schema (Kimball) for analytics-heavy workloads, Data Vault for highly regulated environments that need full auditability of source changes, One Big Table (OBT) for self-serve analytics where joins are the enemy. Migration matters more than the choice: incremental shadow writes (new model populated alongside old, validated by comparison queries for weeks), dual reads (dashboards switched one at a time with rollback in place), eventually retire the old. Measure the impact — query time dropped from 30 minutes to 90 seconds, analyst time-to-insight on common questions dropped from hours to minutes, the on-call alert volume on stale dimensions went to zero. Strong candidates name the trade-offs explicitly: star schema sacrifices source fidelity for query simplicity; Data Vault sacrifices query simplicity for auditability; OBT sacrifices storage efficiency and update simplicity for self-serve. What you gave up to gain what — say it out loud.

Q3. Tell me about a time you pushed back on a data request from a stakeholder.

Common at senior data engineering levels — managing demand from analytics, ML, and product teams is half the job. Pick a case where the literal ask was wrong but the underlying need was real. The pattern: the stakeholder asked for a specific table or dashboard ('give me a daily export of every user event'), you talked to them about the decision they were trying to make ('we want to know if engagement dropped after the redesign'), and you reframed — the real ask was a funnel metric with a comparison window, not a raw event dump. You set expectations on cost and time ('a daily 500GB export would cost roughly $X/month in egress and take two weeks of pipeline work; the funnel metric I am proposing is two days of work on existing tables'). You maintained the working relationship by making the trade-off visible rather than just refusing — give them options A, B, C with cost and time on each, and let them pick. Avoid making yourself the hero. Show collaboration: 'I sat down with the analyst, we whiteboarded what they actually needed, and we agreed the funnel metric was a better fit.' The interviewer is grading two things — can you say no to engineering work that should not exist, and can you do it without damaging the relationship you will still need next quarter.

Going deeper

Data engineering loops reward fluency that is hard to fake — you either know the window functions cold and have argued lakehouse vs warehouse before, or you do not. For SQL drills, the SQL interview questions deep dive walks through window functions, CTEs, and query optimisation patterns in detail. For the live coding portion of the loop — where you are writing PySpark or SQL under observation — PhantomCodeAI's coding copilot acts as a silent second set of eyes during the actual interview, surfacing the trade-offs the interviewer is grading against in real time.