Phantom Code AIPhantom Code AI
FeaturesEarn with UsMy WorkspacePricing
FeaturesEarn with UsMy WorkspacePricing
Home/Interview Questions/Data Scientist
Hybrid technical and business interview guide

Data Scientist Interview Questions — SQL, ML System Design, Statistics, Case Studies

Data scientist interviews are not software engineering interviews with a sprinkle of stats. They are a distinct discipline — fewer DSA puzzles, more SQL fluency, more applied statistics, and far more product judgment. This guide collects the data scientist interview questions we see most often across SQL, statistics, ML system design, product case studies, and behavioral rounds, with answers written in the hybrid technical-and-business voice that hiring managers actually want to hear.

How DS interviews are structured

At the FAANG-tier and growth-stage companies, the data scientist interview loop is usually five to six rounds spread across one or two onsite days. The exact mix depends on whether the role is product DS, ML engineer-leaning DS, or research-leaning DS, but a representative loop looks like this:

  • SQL round (45–60 min): two to four questions ranging from joins and aggregations to window functions, query optimization, and ambiguous business questions where the SQL is the easy part and the metric definition is the hard part.
  • Python / coding round (45–60 min): data manipulation with pandas or NumPy, sometimes a light algorithmic question, often a small modeling exercise where you fit and evaluate a model end-to-end.
  • Statistics and experimentation round (45–60 min): A/B test design, sample size estimation, p-values vs confidence intervals, variance reduction techniques, and how to handle messy real-world experiments.
  • ML system design (60 min): design a recommender, a fraud system, a search ranker, or a moderation pipeline. The interviewer cares less about the exact model and more about how you scope, decompose, and trade off.
  • Product case study (45–60 min): metric definition, root cause analysis on a metric drop, ROI estimation. This round is what most distinguishes DS from MLE — bring product judgment, not just modeling.
  • Behavioral round (45 min): cross-functional collaboration, stakeholder communication, navigating ambiguity, and one or two project deep-dives.

The defining feature of the loop is that no single round is purely technical. Even the SQL round will probe how you would communicate the result to a PM. Even the behavioral round will dig into a model decision you made. Treating the loop as a hybrid technical-business assessment from minute one is the single highest-leverage piece of advice in this guide.

SQL Q&A

Joins, window functions, query optimization — and the business framing that turns a query into an answer.

Q1.Write a SQL query to find users whose engagement is declining month-over-month for the last three months.

Engagement decline is a classic DS-style SQL question because the business definition matters more than the syntax. Start by clarifying the metric — sessions, time spent, or actions per user — and the cohort. Assuming a daily events table, aggregate to monthly active counts per user, then use LAG twice to compare the current month to the previous and the one before. The query looks like: SELECT user_id FROM (SELECT user_id, month, COUNT(*) AS actions, LAG(COUNT(*),1) OVER (PARTITION BY user_id ORDER BY month) AS prev1, LAG(COUNT(*),2) OVER (PARTITION BY user_id ORDER BY month) AS prev2 FROM events GROUP BY user_id, DATE_TRUNC('month', event_ts)) t WHERE actions < prev1 AND prev1 < prev2. Interviewers also probe whether you handle users who churned entirely (no row in current month) — a left join from a calendar table or COALESCE on a generated user-month grid is the rigorous answer. Mentioning the business follow-up — what threshold counts as material decline, and how would you exclude seasonal drops — separates a senior DS from a junior one.

Q2.Explain the difference between INNER JOIN, LEFT JOIN, and a self-join with a real DS example.

INNER JOIN keeps only rows present in both tables — useful when you need confirmed matches, like users who placed an order. LEFT JOIN keeps every row from the left table and fills nulls where the right side is missing — essential for funnel analysis where you want to count users who did not convert. A self-join joins a table to itself, typically aliased twice, to compare rows within the same entity — for example, finding pairs of users in the same household, or computing month-over-month deltas without window functions on databases that lack them. The DS-specific nuance is data quality: LEFT JOINs often inflate row counts when the right table has duplicates, so always verify uniqueness with a COUNT(DISTINCT) sanity check. In a real interview, I would walk through how an inner join silently dropped 12 percent of users in a churn analysis I once ran because the dimension table was incomplete, and how switching to a left join with a null bucket revealed an entire segment we had been ignoring.

Q3.How do you compute a 7-day rolling average of daily revenue per product using window functions?

Rolling averages are bread-and-butter DS SQL. Use AVG with a ROWS BETWEEN frame: SELECT product_id, date, AVG(revenue) OVER (PARTITION BY product_id ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rolling_7d FROM daily_revenue. Two traps are common in interviews. First, the frame must be ROWS, not RANGE, unless your date column is contiguous — RANGE on dates with gaps will silently behave differently across engines. Second, if days are missing in the source data (no sales on that product), the window will average over fewer than seven actual days. The rigorous fix is to densify with a calendar cross join before applying the window, so you average over a true seven-day window with zero-revenue days included. I would also mention that for very large tables, partitioning on product_id with ORDER BY date can be expensive — consider precomputing into a daily aggregate table if the dashboard refreshes hourly.

Q4.How would you optimize a slow query that joins a 10-billion-row events table with a 100-million-row users table?

First, profile before optimizing — pull the EXPLAIN plan and identify whether the bottleneck is a full scan, a hash spill, or shuffle in distributed engines. Common wins include filtering events down by date partition before the join, broadcasting the smaller table if your engine supports it (BigQuery, Spark, Snowflake all do), and ensuring the join key is the partition or cluster key on at least one side. If the events table is partitioned by date and clustered by user_id, the join becomes a bucketed merge instead of a full shuffle. Second, materialize intermediate results — if you join events to users multiple times in a CTE, compute the join once into a temp table. Third, push down aggregations — if you only need user-level counts, GROUP BY user_id on events first, then join the much smaller aggregate to the user dimension. The DS-specific framing in an interview is to mention cost: a 10-billion-row scan on BigQuery is real money, so query optimization is also FinOps.

Q5.Write a query to find the top three categories by revenue in each region, broken by month.

This is a top-N-per-group question, the canonical use case for ROW_NUMBER. The query: WITH ranked AS (SELECT region, month, category, SUM(revenue) AS rev, ROW_NUMBER() OVER (PARTITION BY region, month ORDER BY SUM(revenue) DESC) AS rn FROM sales GROUP BY region, month, category) SELECT * FROM ranked WHERE rn <= 3. Use ROW_NUMBER for strict top three, RANK if ties should all be included, and DENSE_RANK if you want the top three distinct revenue values regardless of ties. The follow-up I always bring up unprompted is null handling — if revenue can be negative due to refunds, the ORDER BY DESC still works but interviewers love when you mention it. Also worth flagging: if month is a string, the partition will work but you cannot easily compute month-over-month rank deltas; convert to a date for that.

Q6.How do you detect and handle duplicate rows in an events table without losing real duplicate events?

Duplicates fall into two classes: ingestion duplicates (the same event written twice by an at-least-once pipeline) and legitimate repeats (a user really did click twice). Distinguishing them requires a unique event identifier — most modern event schemas include an event_id UUID for exactly this reason. Use SELECT event_id, COUNT(*) FROM events GROUP BY event_id HAVING COUNT(*) > 1 to detect ingestion dupes, then deduplicate with ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY ingestion_ts) and keep rn = 1. If event_id is missing, you fall back to a composite key like user_id plus event_type plus second-truncated timestamp, but this is lossy — two genuine clicks in the same second collapse to one. Always communicate this trade-off to the interviewer. The downstream impact matters: deduplicating a clickstream wrong can shift conversion metrics by several percentage points, which is exactly the kind of silent data bug that makes A/B tests untrustworthy.

Statistics Q&A

A/B test design, sample size, p-values, and confidence intervals applied to real product decisions.

Q1.How would you design an A/B test to evaluate a new ranking algorithm on a feed?

Start by defining the primary metric, ideally one tied to long-term value rather than the click — sessions per user per week, or seven-day retention, are common. State the hypothesis crisply: the new ranker increases sessions per user by at least two percent over a four-week test window. Compute the sample size using the existing baseline mean and variance, the desired minimum detectable effect, alpha of 0.05, and power of 0.8. For a metric with high variance like sessions, this often means tens of millions of users — say so explicitly. Randomize at the user level, not the session level, to avoid contamination from a single user being exposed to both arms. Pre-register guardrail metrics — latency, error rate, content diversity — so a regression on those can stop the test even if the primary metric improves. Run for at least one full weekly cycle to absorb day-of-week effects, and apply CUPED or stratification to reduce variance if the platform supports it. Finally, plan the analysis before launch: a single primary metric test, Bonferroni-corrected secondary metrics, and a documented decision rule.

Q2.What is a p-value, what is it not, and how do you explain it to a product manager?

A p-value is the probability of observing data at least as extreme as what was observed, assuming the null hypothesis is true. It is not the probability that the null is true, not the probability the result is due to chance, and not a measure of effect size. To a PM, I would phrase it as: if our new feature truly does nothing, how surprised would we be by the lift we measured? A p-value of 0.03 means: if the feature did nothing, we would see a lift this big or bigger only three times in a hundred — surprising enough that we usually act on it. The honest follow-up is that p-values say nothing about how big the effect is or whether it matters to the business. A 0.1 percent lift can be statistically significant in a billion-user test and economically irrelevant. I always pair the p-value with a confidence interval and a practical-significance threshold the team agreed on before the test.

Q3.How do you compute the required sample size for an A/B test on a binary conversion metric?

For a two-proportion z-test, the standard formula is n per arm equals two times (z_alpha/2 plus z_beta) squared times p times one-minus-p divided by delta squared, where p is the baseline conversion rate and delta is the minimum detectable absolute lift. For a baseline of 5 percent conversion and a target lift of 0.5 percentage points, alpha 0.05 two-sided, power 0.8, that comes out to roughly 30,000 users per arm. The DS judgment call is choosing the minimum detectable effect: too small and the test runs forever, too large and you miss real but modest wins. I tie the MDE to business value — what lift would justify shipping this — rather than picking a number out of thin air. For ratio metrics (revenue per user) the formula uses variance instead of p times one-minus-p, and for skewed distributions like revenue I prefer a bootstrap or a delta method to get honest confidence intervals.

Q4.Explain the difference between a confidence interval and a credible interval.

A 95 percent frequentist confidence interval means: if we repeated this experiment many times, 95 percent of the intervals we constructed would contain the true parameter. It is a statement about the procedure, not about the parameter itself — given a single interval, the parameter is either in it or not. A 95 percent Bayesian credible interval means: given the data and our prior, there is a 95 percent probability the parameter lies in this interval. It is the statement most stakeholders intuitively want. In practice they often agree numerically when priors are weak and sample sizes are large, but they diverge with strong priors or small samples. In a DS interview I would mention that for product decisions I prefer Bayesian framing because it answers the question executives actually ask — what is the probability the new variant is better — without the awkward frequentist hedge.

Q5.When would you use a non-parametric test instead of a t-test?

Use non-parametric tests when the assumptions of the t-test break down — heavy skew, heavy tails, small samples that prevent invoking the central limit theorem, or ordinal data without a meaningful mean. Revenue per user is the classic example: a long right tail driven by whales means the sample mean has high variance and the t-test confidence interval can be misleading. The Mann-Whitney U test compares distributions without assuming normality, and the bootstrap is even more flexible — resample the data, recompute the difference, and read off the percentile interval. For paired data, the Wilcoxon signed-rank test is the non-parametric analog of the paired t-test. The trade-off is power: when assumptions hold, the t-test is more powerful; when they fail badly, non-parametric methods are more honest. In production A/B testing I default to t-tests with CUPED for regular metrics and bootstrap for revenue or session-length metrics with heavy tails.

ML System Design Q&A

Recommender systems, fraud detection, search ranking, and content moderation at scale.

Q1.Design a recommendation system for a streaming platform with 100 million users and 50 million items.

Start by clarifying the objective: are we optimizing for watch time, completion rate, retention, or content diversity? The metric drives every downstream choice. The standard architecture is a two-stage funnel — a candidate generator that retrieves a few hundred items from millions, followed by a heavier ranker that orders them. For retrieval, two-tower neural networks trained with contrastive loss are the modern default: a user tower encodes recent watch history and demographics, an item tower encodes title metadata and embeddings, and the inner product approximates affinity. Index the item embeddings in an ANN store like ScaNN or FAISS for sub-10ms lookup. For ranking, a gradient-boosted model or wide-and-deep network on the few hundred candidates, scored on click probability and predicted watch duration, multi-objective combined. Cover the cold start path explicitly: new users get popularity-weighted recommendations within the demographic, new items get content-based retrieval until they accumulate engagement. Discuss feedback loops — popular items get more exposure, get more engagement, get even more exposure — and the mitigations: exploration via Thompson sampling or off-policy correction. Close with the metrics framework: offline NDCG and recall at K, online A/B on retention and watch time, with diversity and freshness as guardrails.

Q2.Design a fraud detection system for a payments platform processing 10,000 transactions per second.

Fraud detection is a real-time, highly imbalanced, adversarial problem — three properties that shape every design choice. The architecture has three layers. First, deterministic rules — velocity checks, blocklists, geo-impossible patterns — catch obvious fraud at single-digit milliseconds and provide explainability for compliance. Second, a real-time ML model scores every transaction in under 100ms — gradient-boosted trees on hundreds of features (transaction amount, merchant category, device fingerprint, hours-since-last-transaction, historical user behavior). Features come from a feature store with both online (low-latency Redis or DynamoDB) and offline (batch) materialization to guarantee training-serving consistency. Third, a slower graph-based model runs on a sliding window to detect coordinated rings — accounts sharing devices, addresses, or money flow. On imbalance: positive rates are below 0.5 percent, so the right metric is precision-at-K or recall at a fixed precision target, never raw accuracy. Use focal loss or class weighting and calibrate probabilities post-hoc. On adversarial drift: fraudsters adapt within days, so retrain weekly, monitor PSI on key features daily, and shadow-deploy challenger models. The action layer matters too: high-risk transactions are declined, medium-risk go to step-up auth, low-risk pass — three tiers, each with measurable false-positive and false-negative cost.

Q3.Design a search ranking system for a marketplace with hundreds of millions of listings.

Search ranking is a learning-to-rank problem layered on top of an inverted-index retrieval. Stage one is retrieval: an inverted index (Elasticsearch or a custom Lucene-based system) returns the top few thousand listings matching the query terms, augmented by semantic retrieval via a dense embedding ANN store for queries the lexical index misses. Stage two is ranking: a gradient-boosted LambdaMART or a deep neural ranker scores those candidates on hundreds of features — query-listing text match, listing quality (reviews, photos, response rate), personalization (user history, location), and listing freshness. Train on click and conversion data using pairwise or listwise loss. Stage three is business logic and diversification — boost geographic relevance, ensure category diversity in the top ten, apply marketplace-specific constraints. The unique DS challenges here are position bias (users click top results because they are top, not because they are best) which requires inverse propensity weighting or randomized exploration; and fairness across sellers which requires explicit exposure constraints. Evaluate offline on NDCG and online on conversion rate, search-to-purchase, and a satisfaction proxy like dwell time on the listing page.

Q4.Design a content moderation ML system for a social platform with 500 million daily uploads.

Content moderation is multi-modal (text, image, video, audio), multi-policy (violence, hate, spam, sexual content, misinformation), and high-stakes — false negatives harm users, false positives erode trust. The pipeline has three tiers. Tier one is real-time, low-cost screening on every upload — perceptual hashing against known-bad content, lightweight CNN classifiers, keyword filters — designed for sub-second decisions and 99.9 percent uptime. Tier two is heavier ML for ambiguous content: vision-language models for image and video, fine-tuned LLMs for text, with separate per-policy heads since training a single multi-policy model dilutes signal. Tier three is human review for the highest-uncertainty content, with the human labels feeding back into the next training cycle. Critical design choices: action thresholds are per-policy and per-region (legal definitions vary), model outputs feed actions on a graded scale (remove, demote, label, age-gate), and an appeals path is mandatory. On evaluation: precision and recall by policy, with a focus on rare but catastrophic categories, and per-language and per-demographic slicing to catch fairness gaps. Monitor drift aggressively — adversaries actively probe the system, so weekly retrains on red-team examples are standard. The honest framing in an interview is that no system is perfect at this scale; the goal is to balance harm reduction against expression and to make the trade-offs auditable.

Product Case Q&A

Metric definition, root cause analysis, and ROI estimation — the product judgment rounds.

Q1.A PM asks you to define success for a new in-app messaging feature. How do you choose the metric?

Metric definition is the single highest-leverage skill in product DS. I start by asking what the feature is supposed to do for the user — connect them faster with friends, drive richer conversations, replace a worse external channel — because the metric should map to the user benefit. From there I propose a three-tier framework. The North Star is the metric most directly tied to long-term value — for messaging, that is usually weekly active sender-receiver pairs, since both ends are required for value. Tier two is input metrics that move the North Star — messages sent per active user, reply rate, time-to-first-reply. Tier three is guardrails — overall app retention, time spent on other surfaces, notification opt-out rate. I would explicitly reject vanity metrics like total messages sent, which can be gamed by spammy notifications, and propose a sensitivity analysis showing how each metric responds to plausible changes. The final piece is committing to a measurement plan before launch — what is the success threshold, how long do we wait, what would make us roll back — so the metric drives a decision instead of post-hoc rationalization.

Q2.Daily active users dropped 4 percent week-over-week. Walk me through how you would diagnose it.

I structure the investigation as a funnel of hypotheses, eliminating the cheapest first. Step one: confirm the metric is real — check for data pipeline issues, schema changes, recent deploys that affected event logging, and time-zone artifacts. About a third of overnight metric drops in my experience turn out to be instrumentation. Step two: segment the drop. Slice by platform (iOS, Android, web), country, app version, user tenure, and acquisition channel. A drop concentrated in one segment points to a localized cause — a specific app version, a regional outage, a paid channel quality drop. Step three: correlate with internal events — recent releases, experiment rampups, infrastructure incidents, marketing changes. Step four: external context — competitor launch, news cycle, holiday, seasonal pattern. Step five, if no smoking gun: build a counterfactual using historical data and the day-of-week, holiday, and seasonality controls, to test whether the drop is statistically beyond normal noise. Communicate continuously with the PM and engineering — diagnosis is iterative, and the worst outcome is a quiet investigation that finds the answer two days late.

Q3.How would you estimate the ROI of a new model that improves recommendation click-through rate by 3 percent?

ROI estimation forces you to translate model gains into business dollars and engineering cost. On the gain side, model the funnel: a 3 percent relative CTR lift, multiplied by current daily impressions, multiplied by the conversion rate from click to monetizable action, multiplied by revenue per action, summed over the relevant time horizon. Apply a discount for the difference between A/B-test lift and long-run lift, since novelty effects fade and downstream surfaces partially cannibalize. On the cost side, count engineering time to ship, ongoing infra cost (the new model may need bigger instances or more frequent retraining), and risk-adjusted cost of regressions on guardrails like latency or content diversity. The output is a range, not a point estimate — ten-percent-confident lower bound, expected, ninety-percent upper. I would also compute payback period and compare against the next-best alternative use of the same engineering quarter, because ROI in isolation is meaningless without opportunity cost. The final deliverable is a one-page memo with the numbers, the assumptions, and the sensitivities, so the leader making the ship decision can override any assumption and see the answer recompute.

Q4.A new feature shows a 1 percent lift in primary metric but flat retention. Do you ship it?

This is a classic short-term-vs-long-term tension and the answer is rarely a clean yes or no. First, check statistical significance and confidence intervals on both metrics — a flat retention with a wide CI may actually hide a real movement either direction. Second, examine the mechanism: does the primary lift come from existing engaged users doing more, or from a one-time surge that will decay? Heterogeneous treatment effect analysis across user segments often reveals that the lift concentrates in power users while marginal users are unaffected or harmed. Third, check guardrails I did not run the test against — support tickets, app store rating, opt-outs — for soft signals retention is too noisy to catch. Fourth, weigh cost of the feature against the primary lift on its own merits. My default position is to ship if the primary lift is meaningful, the mechanism is healthy, and there are no negative guardrail signals, while flagging that retention is the metric to watch in the next quarter. If the primary lift is small or the mechanism looks like attention-hijacking that boosts the metric without genuine value, I argue against shipping even with statistical significance.

Behavioral Q&A

Cross-functional disagreement, executive communication, and projects you would do differently.

Q1.Tell me about a time you disagreed with an engineer or PM on a model or metric decision.

The framing I recommend is situation, my position, the other side, the resolution, and the lesson. A real example: I once pushed back on a PM who wanted to ship a model with a 4 percent lift on click-through but a measurable drop in 30-day retention in one user segment. The PM wanted to ship because the launch deadline was tied to a marketing campaign. I disagreed because the segment harmed was our highest-LTV cohort, and the retention loss in dollar terms outweighed the click lift. I documented the analysis with confidence intervals and a counterfactual, walked the PM through it, and proposed a compromise — ship to the segments where the model was clean, hold out the harmed segment, and iterate the model in parallel. The PM agreed. The lesson I took away is that disagreement is most productive when you arrive with the analysis already done, propose a path forward, and frame the trade-off in the PM's metric — campaign timing risk versus long-term LTV — rather than insisting on the pure DS view. Strong opinions, weakly held, written down.

Q2.Describe a time you had to communicate a complex statistical result to a non-technical executive.

I had to present a multi-armed bandit experiment result to a CFO who was skeptical of any analysis that did not look like a P&L. The challenge was that the bandit had non-uniform exposure across arms, so naive comparison would mislead. Instead of explaining inverse propensity weighting, I built three artifacts. One, a single sentence answer — variant B is the winner with 92 percent probability and an expected 6 percent revenue lift. Two, a chart showing the expected revenue under each arm with a credible interval, no p-values or test names. Three, a bullet list of risks — small sample on weekends, novelty effect possible, recommend confirmation in a frozen-allocation A/B test. The CFO asked one clarifying question — what would change my mind that B is the winner — and approved the rollout. The lesson is that translation is not dumbing down; it is choosing the smallest set of information sufficient for a decision, and being explicit about what would falsify your conclusion. Executives reward candor about uncertainty more than they reward false confidence.

Q3.Tell me about a project where you would do something differently in retrospect.

I led a churn-prediction model build for a subscription product. The model performed well offline — strong AUC, calibrated probabilities — and the retention team was excited. We shipped it as the input to a discount-targeting system. Three months in, the retention lift was about half of what we had predicted, and the discount budget was overrun. The post-mortem revealed two mistakes. First, I had optimized for prediction accuracy rather than uplift — predicting who churns is not the same as predicting who can be saved by a discount. Many high-churn-risk users would have churned regardless. Second, I had not partnered closely with the retention team on the action policy, so we were targeting users who responded well to the model probability but not to the actual treatment. In retrospect I would have framed the problem as causal uplift modeling from the start, run a small holdout at every stage to measure incrementality not prediction quality, and built the action policy and the model in the same sprint. The broader lesson is that DS deliverables are only as good as the decision they drive, and the decision context belongs in the problem definition, not the post-launch retrospective.

Practice these questions live with PhantomCode

Reading answers is not the same as delivering them under pressure. PhantomCode runs an undetectable interview copilot that listens to the live interview audio, transcribes the question, and surfaces a structured answer for you in real time — SQL queries, system design diagrams, statistical reasoning, and product-case frameworks, all tailored to a data scientist loop.

Try the Interview Copilot

Looking for other roles? Explore our guides for Google, Apple, and other companies in our full interview questions library.