Q: Find the third highest distinct salary in the employees table.

Two valid answers, two different conversations. With DENSE_RANK you handle ties correctly — three people tied for second place still produce a meaningful third rank, where ROW_NUMBER would skip ranks and RANK would leave gaps. The subquery approach is the answer you give if the interviewer says no window functions. Always ask the clarifying question first: does third highest mean third distinct salary, or third row when ordered? That question alone earns points. If they want top-N for arbitrary N, the window-function form generalizes; the nested-MAX form does not.

Q: Reconcile two systems: list every customer that exists in either CRM or Billing, with the source.

FULL OUTER JOIN is the question that filters the people who learned SQL on toy datasets from people who have done real reconciliation work. The COALESCE on the join key is essential — without it, the right-only rows have NULL on the left and you cannot reference a single customer_id column. The CASE classifies each row, which is exactly what data engineers do during system migrations. MySQL does not have FULL OUTER JOIN; if the interview is on MySQL, simulate it with a LEFT JOIN UNION RIGHT JOIN. Postgres, Snowflake, BigQuery, SQL Server all support it natively.

Q: A junior engineer writes JOIN a JOIN b JOIN c and gets ten times the expected row count. Why?

Row multiplication is the most common silent bug in production SQL. When you join a parent table to two child tables that each have many rows per parent, you produce the cross product of the two children for every parent — N order_items times M payments per order. The fix is either to aggregate one side first (so it becomes one row per order), or to query the two children separately and union or report them independently. The deeper lesson: every JOIN should be classified as one-to-one, one-to-many, or many-to-many before you write it, and many-to-many should always raise a flag.

Q: Compute a running total of order amount per user, ordered by order date.

The running total is the textbook window function. PARTITION BY isolates each user; ORDER BY defines the running direction; the frame clause specifies the window of rows considered. Many candidates omit the frame clause and rely on the default — and the default is dangerous. For aggregate functions like SUM, the default frame when ORDER BY is present is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which behaves correctly until you hit ties on order_date, and then it lumps all tied rows together. Always specify ROWS explicitly when you want a row-by-row running total. Saying that out loud earns the round.

Q: Rank products by total revenue within each category.

The aggregation has to happen before the ranking — you cannot RANK rows that have not been summed yet. So the inner query produces one row per (category, product) with its total revenue, and the outer window applies the rank. The choice between RANK, DENSE_RANK, and ROW_NUMBER matters. RANK leaves gaps after ties (1, 2, 2, 4); DENSE_RANK does not (1, 2, 2, 3); ROW_NUMBER breaks ties arbitrarily, which is non-deterministic and a common bug source. For top-product reporting, DENSE_RANK is usually the human-readable choice. Discuss the trade-off in the interview rather than picking silently.

Q: Find users whose weekly session count has decreased for three consecutive weeks.

LAG and LEAD are the time-series Swiss army knife of SQL. LAG(sessions, 1) returns the previous row's sessions value within the partition, LAG(sessions, 2) the row before that, and so on. The CTE pattern keeps the query readable: aggregate first, attach lagged columns next, filter last. A subtle interview moment — the strict inequalities (prev3 > prev2 > prev1 > sessions) require that every week be present in the data; if a user has a week with zero sessions, that week is missing entirely and the LAG skips it. Mention generate_series or a calendar table to densify weeks if the interviewer pushes on edge cases.

Question 1

Find users who have never placed an order.

Accepted Answer

The classic anti-join. The instinct most candidates show first is a NOT IN subquery, which silently breaks when the inner query returns NULLs (NOT IN with a NULL row evaluates to UNKNOWN and the outer row is dropped). LEFT JOIN with WHERE right.id IS NULL is the safe form and usually the most readable. NOT EXISTS is also correct and sometimes faster, because the planner can short-circuit on the first match. In an interview, name all three options, pick the LEFT JOIN, and call out the NOT IN NULL trap. That single observation distinguishes someone who has shipped SQL from someone who has only studied it.

Question 2

Find pairs of users who live in the same city (no duplicates, no self-pairs).

Accepted Answer

Self-joins look strange the first time you see one, but they are just the same table aliased twice. The trick is the user_id < condition, which eliminates the (a, a) self-pair and de-duplicates (a, b) versus (b, a). Without that condition you would emit every pair twice plus every user paired with themselves. If the city column is high-cardinality and the table is large, this query benefits from an index on city — the join becomes a lookup on equal city values. Mention the index when asked about scale; interviewers love when you connect query shape to physical layout.

Question 3

Find the third highest distinct salary in the employees table.

Accepted Answer

Two valid answers, two different conversations. With DENSE_RANK you handle ties correctly — three people tied for second place still produce a meaningful third rank, where ROW_NUMBER would skip ranks and RANK would leave gaps. The subquery approach is the answer you give if the interviewer says no window functions. Always ask the clarifying question first: does third highest mean third distinct salary, or third row when ordered? That question alone earns points. If they want top-N for arbitrary N, the window-function form generalizes; the nested-MAX form does not.

Question 4

Reconcile two systems: list every customer that exists in either CRM or Billing, with the source.

Accepted Answer

FULL OUTER JOIN is the question that filters the people who learned SQL on toy datasets from people who have done real reconciliation work. The COALESCE on the join key is essential — without it, the right-only rows have NULL on the left and you cannot reference a single customer_id column. The CASE classifies each row, which is exactly what data engineers do during system migrations. MySQL does not have FULL OUTER JOIN; if the interview is on MySQL, simulate it with a LEFT JOIN UNION RIGHT JOIN. Postgres, Snowflake, BigQuery, SQL Server all support it natively.

Question 5

A junior engineer writes JOIN a JOIN b JOIN c and gets ten times the expected row count. Why?

Accepted Answer

Row multiplication is the most common silent bug in production SQL. When you join a parent table to two child tables that each have many rows per parent, you produce the cross product of the two children for every parent — N order_items times M payments per order. The fix is either to aggregate one side first (so it becomes one row per order), or to query the two children separately and union or report them independently. The deeper lesson: every JOIN should be classified as one-to-one, one-to-many, or many-to-many before you write it, and many-to-many should always raise a flag.

Question 6

Compute a running total of order amount per user, ordered by order date.

Accepted Answer

The running total is the textbook window function. PARTITION BY isolates each user; ORDER BY defines the running direction; the frame clause specifies the window of rows considered. Many candidates omit the frame clause and rely on the default — and the default is dangerous. For aggregate functions like SUM, the default frame when ORDER BY is present is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which behaves correctly until you hit ties on order_date, and then it lumps all tied rows together. Always specify ROWS explicitly when you want a row-by-row running total. Saying that out loud earns the round.

Question 7

Rank products by total revenue within each category.

Accepted Answer

The aggregation has to happen before the ranking — you cannot RANK rows that have not been summed yet. So the inner query produces one row per (category, product) with its total revenue, and the outer window applies the rank. The choice between RANK, DENSE_RANK, and ROW_NUMBER matters. RANK leaves gaps after ties (1, 2, 2, 4); DENSE_RANK does not (1, 2, 2, 3); ROW_NUMBER breaks ties arbitrarily, which is non-deterministic and a common bug source. For top-product reporting, DENSE_RANK is usually the human-readable choice. Discuss the trade-off in the interview rather than picking silently.

Question 8

Find users whose weekly session count has decreased for three consecutive weeks.

Accepted Answer

LAG and LEAD are the time-series Swiss army knife of SQL. LAG(sessions, 1) returns the previous row's sessions value within the partition, LAG(sessions, 2) the row before that, and so on. The CTE pattern keeps the query readable: aggregate first, attach lagged columns next, filter last. A subtle interview moment — the strict inequalities (prev3 > prev2 > prev1 > sessions) require that every week be present in the data; if a user has a week with zero sessions, that week is missing entirely and the LAG skips it. Mention generate_series or a calendar table to densify weeks if the interviewer pushes on edge cases.

Question 9

Get the top 3 highest-paid employees per department.

Accepted Answer

The top-N-per-group pattern shows up in product analytics, leaderboards, and recommendation pipelines. ROW_NUMBER is the right window function here because we want exactly 3 rows per department even if there are salary ties; RANK or DENSE_RANK could return more than 3. Add a tie-breaker column (employee_id) to ORDER BY to make the result deterministic — without it, two runs on the same data can return different employees. On a large table, a covering index on (department_id, salary DESC, employee_id) lets the planner satisfy the window without a sort.

Question 10

For each user, return the first and last login timestamp in the same row.

Accepted Answer

FIRST_VALUE and LAST_VALUE require an explicit frame to return the actual first and last in the partition; otherwise LAST_VALUE returns the current row's value because the default frame ends at the current row. This is the single most repeated bug in window-function code. The DISTINCT collapses the per-row duplicates that the window produces. An equivalent and arguably cleaner alternative is GROUP BY user_id with MIN(login_at) and MAX(login_at), which avoids the frame trap entirely. Bring up the alternative in the interview — showing two correct paths is stronger than showing one.

Question 11

Find users with more than 5 orders in the last 30 days.

Accepted Answer

WHERE filters rows before grouping; HAVING filters groups after grouping. That distinction is on the rubric of every SQL screen. Putting COUNT(*) > 5 in WHERE is a syntax error because the count does not exist yet at row-evaluation time. Putting order_date filtering in HAVING is legal but slow — the engine has to group every row, including 5-year-old orders, before discarding them. Keep predicates that reference raw columns in WHERE, predicates that reference aggregates in HAVING. On large tables, also mention that an index on (user_id, order_date) lets the planner range-scan recent orders and stream the GROUP BY.

Question 12

List departments whose average salary exceeds the company-wide average salary.

Accepted Answer

Two aggregates in one query: the per-department average in the GROUP BY, and the company-wide average in the scalar subquery. The subquery runs once and the planner caches its result, so the cost is the GROUP BY itself. A common variant — what if you also want each department's headcount and the percentage above the company average — extends naturally with COUNT(*) and arithmetic in the SELECT. Avoid putting AVG(salary) > AVG(...) inside HAVING with both halves aggregated over the same scope; the scalar subquery makes the second average independent and is the cleanest read.

Question 13

For each product category, count the number of distinct users who have bought from it, but only categories with at least 100 distinct buyers.

Accepted Answer

COUNT(DISTINCT) is correct but expensive — every group has to deduplicate the user_ids before counting. On warehouses like BigQuery and Snowflake this is fine; on Postgres or MySQL at large scale it can become the bottleneck. If you see the interviewer push on performance, offer APPROX_COUNT_DISTINCT (BigQuery, Snowflake) or HyperLogLog (Postgres extension) and explain that you trade a bounded error for an order-of-magnitude speedup. The HAVING clause filters categories after the distinct count is computed; you cannot push it into WHERE because the count does not exist yet.

Question 14

Given an employees table with a manager_id column, return every employee's chain of command up to the CEO.

Accepted Answer

Recursive CTEs have an anchor (the WHERE manager_id IS NULL row that seeds the recursion) and a recursive term that joins back to the CTE itself. The engine repeats the recursive term until it produces zero new rows. The depth column tracks how many levels deep we are; the path column accumulates the human-readable chain. A real-world hazard: cycles. If two employees mistakenly manage each other, the recursion never terminates. Postgres lets you set a cycle clause; otherwise add a depth cap with a WHERE depth < 50 in the recursive term. Always mention cycle protection — interviewers love seeing engineers think about adversarial data.

Question 15

Reconstruct user sessions from raw event timestamps, where a new session starts after 30 minutes of inactivity.

Accepted Answer

Sessionization is one of the highest-signal SQL questions for analytics and DE roles. The pattern is gap-and-island: flag where a new island begins (the 30-minute gap), then run a cumulative sum to assign a stable session_index to all events in the same island. The final GROUP BY collapses the events into one row per session. This pattern shows up in fraud detection, attribution, and user-journey analysis. If the interviewer pushes deeper, discuss the trade-offs of a fixed gap threshold versus session-aware models, and mention that some warehouses (BigQuery's SESSION_USER_ID, Snowflake's QUALIFY) provide higher-level constructs.

Question 16

Find users who have been active for at least 7 consecutive days.

Accepted Answer

The day - row_number trick is the islands pattern in disguise. For consecutive days, day minus its row number is constant across the streak — the moment a gap appears, the constant changes, and grouping by it collapses each streak into a row. This is one of those SQL idioms that looks obscure until you see it once, and then it shows up everywhere. Variants include consecutive months, consecutive logins, consecutive wins. Be ready to walk the interviewer through why the constant is constant — that explanation is the actual signal they are testing.

Question 17

How would you index this slow query that scans 50M rows?

Accepted Answer

Walk through it predicate by predicate. The two filters are status (low cardinality) and created_at (high cardinality, range). The ORDER BY is on created_at DESC. A composite index on (status, created_at DESC) is the strongest candidate — the planner can seek to status='shipped', then range-scan created_at in the desired order, and stop after 100 rows. Adding included columns (user_id, order_id, total) makes it a covering index and avoids the heap fetch entirely. Discuss the trade-off: indexes accelerate reads but cost on writes, especially on a hot table like orders. Always ask the read/write ratio before recommending index changes — that question alone is a senior signal.

Question 18

What is wrong with this query?

Accepted Answer

Three anti-patterns in five lines. First, SELECT * pulls every column even when you need two — wasteful network and forced heap fetch. Second, LOWER(email) and created_at::date wrap columns in functions, which kills any index on those columns; the planner cannot use a B-tree on email if it has to evaluate LOWER on every row. Fix by storing emails lower-cased on insert, or by creating a functional index on LOWER(email). For created_at::date, rewrite to a range: created_at >= '2026-05-05' AND created_at < '2026-05-06'. Third, IN (subquery) is correct but often slower than EXISTS or a JOIN, especially when the subquery returns many rows; show the planner an EXISTS and let it pick.

Question 19

An EXPLAIN plan shows a Seq Scan with a Filter and 80,000 rows returned out of 50M. What do you do?

Accepted Answer

Seq Scan on a 50M-row table to return 80K rows is a textbook missing-index situation. Read the plan top-down: the Seq Scan is the access method, the Filter is the post-scan predicate, the Rows Removed by Filter is the throwaway. A composite index on (customer_id, status) would convert the Seq Scan to an Index Scan or Index Only Scan and reduce the scan to roughly 80K row touches plus root walks. After creating the index, run EXPLAIN ANALYZE again — note the difference between estimated and actual rows; large divergence suggests stale statistics and you should ANALYZE the table. Talk through the loop: read plan, identify dominant cost, propose index, verify, repeat. That loop is the actual job.

Question 20

Design a schema for a ride-sharing app.

Accepted Answer

Schema design questions reward a structured walk-through, not a dump. Start with entities — users, drivers, vehicles, trips, payments, ratings. Note that drivers extend users (a driver is a user with extra columns), so a one-to-one drivers table referencing users keeps the role flexible. Trips are the central event table; everything else hangs off them. Decide what is normalized versus denormalized — a partial index on active trip statuses keeps the dispatcher query fast even when historical trips dominate row count. Talk about hot-path latency: matching riders to drivers requires geospatial queries, so PostGIS or a separate Redis geo set sits alongside this schema. Ratings, payments, surge pricing, and promotions each get their own tables; do not stuff them into trips.

Question 21

Design a schema for analytics events.

Accepted Answer

Analytics events have two opposing pressures. They need to be wide enough to query without joins (every event has user_id, time, country, app_version inline), but flexible enough to accept new event types without schema migrations (the JSONB properties column). Partitioning by day is critical: most analytics queries filter by a time range, and partition pruning turns a 5-billion-row scan into a 50-million-row scan. The GIN index on properties enables key-based lookups on the JSONB blob. For a real product, the warehouse layer (BigQuery, Snowflake, ClickHouse) is usually the long-term home; OLTP is just the landing zone. Mention the lambda-style split — fast partitioned OLTP for the last 7 days, columnar warehouse for the long tail.

What SQL rounds test

JOINs

Find users who have never placed an order.

Find pairs of users who live in the same city (no duplicates, no self-pairs).

Find the third highest distinct salary in the employees table.

Reconcile two systems: list every customer that exists in either CRM or Billing, with the source.

A junior engineer writes JOIN a JOIN b JOIN c and gets ten times the expected row count. Why?

Window functions

Compute a running total of order amount per user, ordered by order date.

Rank products by total revenue within each category.

Find users whose weekly session count has decreased for three consecutive weeks.

Get the top 3 highest-paid employees per department.

For each user, return the first and last login timestamp in the same row.

GROUP BY and HAVING

Find users with more than 5 orders in the last 30 days.

List departments whose average salary exceeds the company-wide average salary.

For each product category, count the number of distinct users who have bought from it, but only categories with at least 100 distinct buyers.

CTEs and recursion

Given an employees table with a manager_id column, return every employee's chain of command up to the CEO.

Reconstruct user sessions from raw event timestamps, where a new session starts after 30 minutes of inactivity.

Find users who have been active for at least 7 consecutive days.

Query optimization

How would you index this slow query that scans 50M rows?

What is wrong with this query?

An EXPLAIN plan shows a Seq Scan with a Filter and 80,000 rows returned out of 50M. What do you do?

Schema design

Design a schema for a ride-sharing app.

Design a schema for analytics events.

Practice these patterns under interview conditions

What SQL rounds test

JOINs

Find users who have never placed an order.

Find pairs of users who live in the same city (no duplicates, no self-pairs).

Find the third highest distinct salary in the employees table.

Reconcile two systems: list every customer that exists in either CRM or Billing, with the source.

A junior engineer writes JOIN a JOIN b JOIN c and gets ten times the expected row count. Why?

Window functions

Compute a running total of order amount per user, ordered by order date.

Rank products by total revenue within each category.

Find users whose weekly session count has decreased for three consecutive weeks.

Get the top 3 highest-paid employees per department.

For each user, return the first and last login timestamp in the same row.

GROUP BY and HAVING

Find users with more than 5 orders in the last 30 days.

List departments whose average salary exceeds the company-wide average salary.

For each product category, count the number of distinct users who have bought from it, but only categories with at least 100 distinct buyers.

CTEs and recursion

Given an employees table with a manager_id column, return every employee's chain of command up to the CEO.

Reconstruct user sessions from raw event timestamps, where a new session starts after 30 minutes of inactivity.

Find users who have been active for at least 7 consecutive days.

Query optimization

How would you index this slow query that scans 50M rows?

What is wrong with this query?

An EXPLAIN plan shows a Seq Scan with a Filter and 80,000 rows returned out of 50M. What do you do?

Schema design

Design a schema for a ride-sharing app.

Design a schema for analytics events.

Practice these patterns under interview conditions