Q: Given a Sessions table with (user_id, session_start, session_end), write a query that returns the maximum number of concurrent sessions at any moment.

Sweep-line. Generate a +1 event at every session_start and a -1 event at every session_end, sort by timestamp, then take the running sum and pick its maximum. In SQL: WITH events AS (SELECT session_start AS ts, 1 AS delta FROM Sessions UNION ALL SELECT session_end AS ts, -1 AS delta FROM Sessions) SELECT MAX(running) FROM (SELECT ts, SUM(delta) OVER (ORDER BY ts, delta DESC ROWS UNBOUNDED PRECEDING) AS running FROM events) t. The ORDER BY ts, delta DESC matters — when a session ends at the same instant another begins, you want to count the overlap, so process the +1 before the -1 (or the opposite, depending on whether the interval is half-open). State the convention out loud. The common follow-up is concurrent sessions per minute: round the timestamp to the minute, group by minute, take the max within each group, or generate a calendar of minutes and self-join. The other follow-up is doing this at scale — on a 100B-row sessions table you cannot sort globally, so you partition by hour and process the boundary intervals separately.

Q: Write a query that for each user returns the time gap between consecutive transactions. Then return users whose median gap exceeds 30 days.

Two-pass CTE pipeline. First pass uses LAG() to grab the previous transaction time per user: WITH gaps AS (SELECT user_id, txn_ts, DATEDIFF(day, LAG(txn_ts) OVER (PARTITION BY user_id ORDER BY txn_ts), txn_ts) AS gap_days FROM Transactions). The first transaction per user yields NULL from LAG — filter it out in the next step or PERCENTILE_CONT will skew. Second pass computes the median per user: SELECT user_id, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY gap_days) AS median_gap FROM gaps WHERE gap_days IS NOT NULL GROUP BY user_id HAVING PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY gap_days) > 30. Median is the tricky part — most dialects do not have a plain MEDIAN function. BigQuery and Postgres have PERCENTILE_CONT (continuous interpolation) and PERCENTILE_DISC (discrete pick); Snowflake has MEDIAN as a shortcut. If you are on a dialect without any of these, fall back to NTILE(2) or a self-rank approach: rank gaps per user, pick the middle one (or average the two middle ones for even counts). The senior signal is recognising you need a CTE pipeline, handling the LAG NULL, and naming the dialect-specific median function rather than waving at it.

Question 1

Write a SQL query to find the second-highest salary in an Employee table. Then generalize to the Nth-highest.

Accepted Answer

This is one of the most common SQL screening questions because it tests window function fluency and edge-case thinking at the same time. There are three canonical approaches. The window function approach: SELECT salary FROM (SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rk FROM Employee) t WHERE rk = 2. The LIMIT/OFFSET approach: SELECT DISTINCT salary FROM Employee ORDER BY salary DESC LIMIT 1 OFFSET 1 — short but dialect-dependent and you must remember the DISTINCT or duplicates eat your offset. The correlated subquery approach (older but still asked in screens): SELECT MAX(salary) FROM Employee WHERE salary < (SELECT MAX(salary) FROM Employee). The real signal is how you handle ties — DENSE_RANK() treats two people earning the same top salary as both rank 1, so the second-highest distinct salary is rank 2; RANK() leaves gaps (1, 1, 3); ROW_NUMBER() arbitrarily picks one of the tied rows. Generalising to the Nth highest, swap the literal for a parameter and use DENSE_RANK() if you want distinct salary tiers, ROW_NUMBER() if you want the Nth specific employee. The L5 follow-up always lands on ties: ask the interviewer which version of Nth highest they want before you write a line.

Question 2

Given a Sessions table with (user_id, session_start, session_end), write a query that returns the maximum number of concurrent sessions at any moment.

Accepted Answer

Sweep-line. Generate a +1 event at every session_start and a -1 event at every session_end, sort by timestamp, then take the running sum and pick its maximum. In SQL: WITH events AS (SELECT session_start AS ts, 1 AS delta FROM Sessions UNION ALL SELECT session_end AS ts, -1 AS delta FROM Sessions) SELECT MAX(running) FROM (SELECT ts, SUM(delta) OVER (ORDER BY ts, delta DESC ROWS UNBOUNDED PRECEDING) AS running FROM events) t. The ORDER BY ts, delta DESC matters — when a session ends at the same instant another begins, you want to count the overlap, so process the +1 before the -1 (or the opposite, depending on whether the interval is half-open). State the convention out loud. The common follow-up is concurrent sessions per minute: round the timestamp to the minute, group by minute, take the max within each group, or generate a calendar of minutes and self-join. The other follow-up is doing this at scale — on a 100B-row sessions table you cannot sort globally, so you partition by hour and process the boundary intervals separately.

Question 3

Write a query that for each user returns the time gap between consecutive transactions. Then return users whose median gap exceeds 30 days.

Accepted Answer

Two-pass CTE pipeline. First pass uses LAG() to grab the previous transaction time per user: WITH gaps AS (SELECT user_id, txn_ts, DATEDIFF(day, LAG(txn_ts) OVER (PARTITION BY user_id ORDER BY txn_ts), txn_ts) AS gap_days FROM Transactions). The first transaction per user yields NULL from LAG — filter it out in the next step or PERCENTILE_CONT will skew. Second pass computes the median per user: SELECT user_id, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY gap_days) AS median_gap FROM gaps WHERE gap_days IS NOT NULL GROUP BY user_id HAVING PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY gap_days) > 30. Median is the tricky part — most dialects do not have a plain MEDIAN function. BigQuery and Postgres have PERCENTILE_CONT (continuous interpolation) and PERCENTILE_DISC (discrete pick); Snowflake has MEDIAN as a shortcut. If you are on a dialect without any of these, fall back to NTILE(2) or a self-rank approach: rank gaps per user, pick the middle one (or average the two middle ones for even counts). The senior signal is recognising you need a CTE pipeline, handling the LAG NULL, and naming the dialect-specific median function rather than waving at it.

Question 4

Optimize a query that joins a 100B-row fact table with a 100M-row dimension table. The query is taking 30 minutes; reduce it to under 1 minute.

Accepted Answer

Walk through the toolkit in order of cheapest fix. (1) Partition pruning — the fact table should be partitioned by date (or whatever the high-cardinality filter dimension is); if the query has a date predicate, make sure partitions are actually being pruned by checking the query plan. (2) Broadcast join — 100M rows of dimension data is roughly a few GB depending on width, which fits in memory on a modern Spark executor; force a broadcast hint (BROADCAST in Spark, MAPJOIN in Hive, /*+ broadcast(d) */ in Trino) so the dimension ships to every executor instead of doing a shuffle. (3) Columnar storage — if the fact table is row-oriented (CSV, JSON, Avro for write-heavy paths), convert to Parquet or ORC; a 100B-row scan that only needs 5 columns drops by an order of magnitude on columnar formats. (4) Pre-aggregation — if the query is repeated, build a daily rollup table and query that instead; the L5 answer mentions materialised views or dbt incremental models. (5) File sizing — many small files kills Spark; aim for 128MB–1GB Parquet files. The L5 distinction is knowing when broadcast joins flip to sort-merge joins (when the dimension stops fitting in memory, usually around 8GB depending on cluster sizing) and when to switch the storage format entirely — Iceberg or Delta with Z-ordering on the join key beats raw Parquet on selective queries.

Question 5

Design a real-time analytics pipeline for an e-commerce site — events from web/mobile flow in, dashboards must show conversion funnels with under-5-minute latency.

Accepted Answer

Layered architecture. Ingest layer: web and mobile SDKs send events over HTTPS to a thin collector service that writes to Kafka (or Kinesis or Pub/Sub depending on cloud). Partition by user_id or session_id so events for the same session stay ordered. Stream processing: Flink or Spark Structured Streaming consumes the topic, does windowed aggregations (5-minute tumbling windows for funnel counts), enriches with dimension lookups (user attributes from a cached Redis layer), and writes to the serving store. Serving layer: Druid, ClickHouse, or Pinot — all three give you sub-second queries on freshly-ingested data and support the rollup-and-slice access pattern dashboards need. The harder problem is late-arriving events and event-time vs processing-time windowing. Mobile clients go offline and flush events hours later; if you window on processing time you mis-attribute. Use event-time windows with a watermark (Flink calls this allowed lateness) and decide policy: drop, side-output to a late-arrivals topic, or update previously-emitted windows. Lambda architecture (batch path correcting the speed path) was the standard answer in 2018; in 2026 most teams pick kappa (stream-only) because Flink with state checkpointing and replayable Kafka topics gives you correctness without the duplicated batch path. Mention the trade: kappa is operationally simpler but requires longer Kafka retention and burns more streaming compute on backfills.

Question 6

Design a data lakehouse that supports 100PB of data, with both batch ETL and real-time queries.

Accepted Answer

Storage: object storage (S3, GCS, or Azure Blob) as the durable, cheap base layer — 100PB on S3 with intelligent tiering is the only economically sensible substrate. Table format: Apache Iceberg or Delta Lake (Hudi is the third option but loses on adoption). Both give you ACID semantics over object storage — atomic commits, schema evolution, time travel, hidden partitioning — without which you cannot safely run concurrent writers. Iceberg has the more open ecosystem (Trino, Spark, Flink, Snowflake all read it natively); Delta is tighter with Databricks. Batch processing: Spark for heavy ETL, dbt for SQL-based transformation orchestration on top of Trino/Snowflake/BigQuery — dbt is the modeling layer, not a compute engine, and it has won the transformation tier. Ad-hoc query: Trino or Athena for federated SQL across Iceberg tables. Streaming ingestion: Kafka into Flink writing Iceberg sinks (or a managed equivalent like Tabular's product); landing data in Iceberg lets streaming and batch share the same source of truth. Catalog: Hive Metastore is the legacy default; AWS Glue, Unity Catalog (Databricks), or Polaris (Snowflake) are the modern picks — the catalog choice is also a governance choice because it owns column-level lineage and access control. Data quality: dbt tests for declarative assertions, Great Expectations for richer profiling, and contracts at the producer side (Schema Registry for Kafka topics) so bad data never lands. The L5 thread is naming the trade-offs across these layers, not just listing tools.

Question 7

Design a feature store for an ML platform — must support both training (point-in-time correct lookups across years of history) and online inference (sub-50ms feature retrieval).

Accepted Answer

Two storage tiers backed by one feature definition. Offline store: Parquet (or Iceberg) on S3, partitioned by event_time. Point-in-time correctness is the hard requirement — when generating training data for a model that predicts at t, every feature value must be as-of t, not the latest known value. This is an event-time join (sometimes called an as-of join), not a regular join. Implement via Spark with a join condition like feature.event_time <= label.event_time AND feature.event_time > label.event_time - lookback_window. Online store: Redis, DynamoDB, or ScyllaDB for sub-50ms lookups by entity_id (user_id, item_id). Keep only the latest value per (entity, feature) — online inference does not need history. Consistency: features computed in batch must match what is served online — this is the offline-online skew problem and it kills more model performance than any other operational issue. The clean solution is to define each feature once in a declarative DSL (Feast, Tecton, or in-house), then generate both the batch SQL/Spark job that backfills the offline store and the streaming job that updates the online store from the same definition. The dual-write problem still bites — if your batch backfill and your streaming update both write to the online store, you need ordering guarantees or you will overwrite fresh data with stale data. Solutions: write through a single ordered log (Kafka with topic compaction), or version every write with the event_time and have the online store reject older writes. Mention feature versioning (changing a feature's logic means a new feature name, not an in-place change, so old models keep working) and transformation reuse (the same join logic in training and serving).

Question 8

Tell me about a data pipeline that broke in production.

Accepted Answer

Strong answers walk through the full incident lifecycle, not just the fix. Detection: how you found out — a data quality alert firing on a freshness check, a downstream dashboard showing impossible numbers (negative revenue, NULL conversion rate), an on-call page from a SLA breach. Diagnosis: how you traced it — checking lineage tools (dbt docs, OpenLineage, Marquez) to find the upstream source, running queries against intermediate tables to localise where the data went wrong, comparing schemas across runs to find a silent schema change. Immediate fix: backfill the affected window, replay the Kafka topic from an earlier offset, manually correct the dimension table while you investigate root cause. Durable fix: the part most candidates skip — what you changed so it does not recur. Add a dbt test or Great Expectations assertion for the failure mode (NOT NULL, unique, value-in-set, freshness), add an alert on the upstream schema change, add a contract at the producer side. The senior signal is owning data quality as a first-class concern rather than treating quality as an afterthought you handle once it has burned you. Avoid blaming the upstream team — even if they shipped the breaking change, your pipeline was the one that propagated bad data downstream, and your fix should make that harder regardless of upstream behaviour.

Question 9

Describe a time you redesigned a data model.

Accepted Answer

Pick a real case and name what was wrong with the old schema. The shape: an existing model started causing problems — query times growing as data scaled, joins exploding (a five-way join across normalised tables that took 30 minutes), business logic becoming hard to express (every analyst writing the same 200-line CTE because the model did not match the question), or a slowly-changing dimension being handled wrong so historical reports drifted. You proposed an alternative — star schema (Kimball) for analytics-heavy workloads, Data Vault for highly regulated environments that need full auditability of source changes, One Big Table (OBT) for self-serve analytics where joins are the enemy. Migration matters more than the choice: incremental shadow writes (new model populated alongside old, validated by comparison queries for weeks), dual reads (dashboards switched one at a time with rollback in place), eventually retire the old. Measure the impact — query time dropped from 30 minutes to 90 seconds, analyst time-to-insight on common questions dropped from hours to minutes, the on-call alert volume on stale dimensions went to zero. Strong candidates name the trade-offs explicitly: star schema sacrifices source fidelity for query simplicity; Data Vault sacrifices query simplicity for auditability; OBT sacrifices storage efficiency and update simplicity for self-serve. What you gave up to gain what — say it out loud.

Question 10

Tell me about a time you pushed back on a data request from a stakeholder.

Accepted Answer

Common at senior data engineering levels — managing demand from analytics, ML, and product teams is half the job. Pick a case where the literal ask was wrong but the underlying need was real. The pattern: the stakeholder asked for a specific table or dashboard ('give me a daily export of every user event'), you talked to them about the decision they were trying to make ('we want to know if engagement dropped after the redesign'), and you reframed — the real ask was a funnel metric with a comparison window, not a raw event dump. You set expectations on cost and time ('a daily 500GB export would cost roughly $X/month in egress and take two weeks of pipeline work; the funnel metric I am proposing is two days of work on existing tables'). You maintained the working relationship by making the trade-off visible rather than just refusing — give them options A, B, C with cost and time on each, and let them pick. Avoid making yourself the hero. Show collaboration: 'I sat down with the analyst, we whiteboarded what they actually needed, and we agreed the funnel metric was a better fit.' The interviewer is grading two things — can you say no to engineering work that should not exist, and can you do it without damaging the relationship you will still need next quarter.

Data Engineer Interview Questions — What Senior Candidates Actually Get Asked

How data engineering loops are structured

SQL Q&A

Q1. Write a SQL query to find the second-highest salary in an Employee table. Then generalize to the Nth-highest.

Q2. Given a Sessions table with (user_id, session_start, session_end), write a query that returns the maximum number of concurrent sessions at any moment.

Q3. Write a query that for each user returns the time gap between consecutive transactions. Then return users whose median gap exceeds 30 days.

Q4. Optimize a query that joins a 100B-row fact table with a 100M-row dimension table. The query is taking 30 minutes; reduce it to under 1 minute.

Data system design Q&A

Q1. Design a real-time analytics pipeline for an e-commerce site — events from web/mobile flow in, dashboards must show conversion funnels with under-5-minute latency.

Q2. Design a data lakehouse that supports 100PB of data, with both batch ETL and real-time queries.

Q3. Design a feature store for an ML platform — must support both training (point-in-time correct lookups across years of history) and online inference (sub-50ms feature retrieval).

Behavioural Q&A

Q1. Tell me about a data pipeline that broke in production.

Q2. Describe a time you redesigned a data model.

Q3. Tell me about a time you pushed back on a data request from a stakeholder.

Going deeper