Question 1

Explain the difference between L1 and L2 regularization. When would you choose each?

Accepted Answer

L1 (lasso) adds the sum of absolute values of the weights to the loss; L2 (ridge) adds the sum of squared weights. The geometric intuition is the cleanest way to explain why they behave differently. L1’s constraint region is a diamond — when you push the unregularized optimum into that region, the optimal point is almost always at a corner, which means several coefficients land exactly on zero. L2’s constraint region is a circle (or hypersphere), so the optimum lands on a smooth surface and coefficients shrink toward zero but rarely become exactly zero. Practically: L1 wins when you have many features, you suspect most are noise, and you want the model to do feature selection for you — high-dimensional sparse problems like genomics, text bag-of-words, or large categorical feature spaces. L2 wins as a general-purpose overfitting control where you have no strong sparsity prior and you want every feature to contribute a little. Elastic net combines both with a mixing parameter, which is what you reach for when L1 is unstable across resamples (it tends to arbitrarily pick one of a group of correlated features). One detail interviewers like: L1 is non-differentiable at zero, so vanilla gradient descent does not work — you need coordinate descent, subgradient methods, or proximal gradient (soft-thresholding). That is why scikit-learn’s Lasso has a different solver from its Ridge.

Question 2

What is dropout and why does it work? When does it hurt?

Accepted Answer

Dropout randomly zeroes activations during training with probability p, then scales the remaining activations by 1/(1-p) so the expected magnitude stays the same. The standard intuition is that each forward pass is training a different sub-network, so the full network at inference behaves like an ensemble average over exponentially many thinned networks — and ensembles reduce variance. A more honest framing is that dropout penalizes co-adaptation: a neuron cannot rely on any specific other neuron always being present, so it must learn features that are useful on their own. When it hurts: small datasets, where the model is already underfitting and you are throwing away signal; convolutional layers, where adjacent activations are highly correlated and zeroing one pixel barely changes anything (DropBlock or SpatialDropout work better); and the well-known interaction with batch normalization — at train time the BN statistics are computed on activations that include dropped units, at test time they are not, which causes a variance shift between train and test. The usual mitigation is to place dropout after BN, or skip dropout entirely in BN-heavy architectures. Modern transformers use dropout sparingly — typically only on attention weights and on the residual path — because layer norm plus large pretraining datasets already controls overfitting. If a candidate just says “dropout always helps,” they have not deployed enough models.

Question 3

Explain the bias-variance trade-off in concrete terms, with a deployment scenario.

Accepted Answer

Bias is the systematic error from a model that is too simple to capture the true relationship — it is wrong in a consistent direction. Variance is the sensitivity of the model’s predictions to which specific training set it saw — a high-variance model would fit a different shape if you resampled the data. Total expected error decomposes into bias squared plus variance plus irreducible noise; you are trading one against the other when you pick model complexity. Concrete scenario: you are predicting click-through rate for an ads ranker. A single global average — “average CTR is 1.2%” — is extreme high bias, near-zero variance. It is wrong for almost every (user, ad) pair but it does not move around. At the other end, a model with millions of user-specific weights and no regularization will fit the training set perfectly but predict wildly for any user with few historical impressions — high variance. The middle is what you want: a model with enough capacity to learn user and ad features but with regularization, hierarchical priors (shrinkage toward user-segment averages), or enough data so variance is naturally controlled. The deployment payoff: bias-variance also tells you when more data will help. If your model is high bias (training and validation error both bad), more data does nothing — you need a bigger model or better features. If it is high variance (training error low, validation error high), more data is the cleanest fix.

Question 4

Implement gradient descent for logistic regression from scratch, then discuss when to use SGD vs Adam vs LBFGS.

Accepted Answer

Logistic regression predicts p = sigmoid(X·theta), where sigmoid(z) = 1/(1+exp(-z)). The loss is binary cross-entropy: L = -(1/N) · sum( y·log(p) + (1-y)·log(1-p) ). The gradient with respect to theta works out cleanly to (1/N) · X^T · (p - y) — this is the only derivation you need to write on the board, because the sigmoid derivative cancels with the log. The update rule is theta := theta - lr · grad. In code, that is: for each epoch, compute p = sigmoid(X @ theta), grad = X.T @ (p - y) / N, theta -= lr * grad. Add L2 regularization by adding lambda · theta to the gradient (skipping the bias term). Optimizer choice in practice: SGD with momentum is still the default for very large-scale deep learning where the model fits in memory but the dataset does not — it generalizes slightly better than adaptive methods on vision benchmarks and the noise from mini-batches acts as implicit regularization. Adam is what almost everyone reaches for first because it converges fast with default hyperparameters (lr=1e-3, betas=(0.9, 0.999)) and is forgiving about feature scales — but its weight decay implementation in the original paper is wrong, which is why AdamW (decoupled weight decay) is now the standard for transformers. LBFGS is a quasi-Newton method that approximates the Hessian from gradient history — it converges in very few iterations but stores a window of past gradients (memory cost), needs full-batch gradients, and does not tolerate stochasticity. Use it for small problems where the data fits in memory and you want second-order convergence, e.g. fitting a logistic regression on tabular data, or fine-tuning a small model. Mention learning rate schedules — cosine decay with warmup is the default for modern LLM training, and a flat-then-decay schedule is fine for most other things.

Question 5

Design a recommendation system for a video streaming platform — 200M users, 100M videos, must serve recommendations in under 100ms.

Accepted Answer

Two-stage architecture: retrieval narrows 100M videos to a few hundred candidates, then ranking scores those candidates with a heavier model. You do this because no single model can score 100M items per request inside a 100ms budget. Retrieval stage: train a two-tower model — a user tower that produces a user embedding from user features (watch history, demographics, recent context) and a video tower that produces a video embedding from video features (content, metadata, aggregate engagement). Train with sampled-softmax or in-batch negatives so positive pairs (user watched video) have higher dot product than random pairs. At serving time, run the user tower online (fast — a few ms) and pre-compute all video embeddings offline. Use an ANN index (HNSW or ScaNN) to find the top ~500 nearest videos to the user embedding in single-digit milliseconds. Ranking stage: take those ~500 candidates and score each one with a deeper model that uses the full feature set — cross features between user and video, sequence features (the order of recent watches), real-time context (time of day, device). Typical ranking models are wide-and-deep, DLRM, or a transformer over user history. You can afford this because 500 candidates × a 10ms model fits the budget. Infrastructure to call out: a feature store (e.g. Feast, Tecton, or in-house) with both an offline store for training and an online store for serving, with the same transformation code running in both paths to avoid train-serve skew. Training loop: log impressions and outcomes, build training data from those logs with appropriate negative sampling (random negatives bias toward popularity; in-batch negatives are stronger), retrain the ranking model daily or hourly. A/B testing infrastructure: every model change ships behind a feature flag with traffic-splitting and metric monitoring (click-through, watch time, downstream retention). Cold start: new users get popularity-based or demographically-segmented recommendations until you have enough interaction signal; new videos use content-based features (genre, tags, the video tower applied to metadata) until they accumulate engagement. The non-obvious answer is that ranking metrics often disagree — optimizing for click-through rate increases clickbait, optimizing for watch time biases toward long videos. You typically combine multiple objectives with a learned or hand-tuned weighting and validate on long-horizon metrics.

Question 6

Design the training infrastructure for a 100B-parameter language model — must train on 1000+ GPUs efficiently.

Accepted Answer

At 100B parameters, the model itself does not fit on a single GPU (with optimizer state it is roughly 1.6 TB in fp32, ~400 GB in mixed precision), so the question becomes how you split the model across devices. There are three orthogonal axes — data parallelism, tensor parallelism, and pipeline parallelism — and large training runs combine all three. Data parallelism: each GPU holds a full copy of the model and processes a different micro-batch, then gradients are all-reduced across GPUs. Simplest to reason about, but only works while the model fits on one GPU. Tensor parallelism: split each matrix multiplication across GPUs — e.g. split the columns of the weight matrix, each GPU computes its slice of the output, then all-gather. This requires intra-layer communication on every forward and backward pass, so it is bandwidth-hungry and you typically only do it within a single node (8 GPUs connected by NVLink). Pipeline parallelism: split the layers into stages, with different GPUs holding different layers; mini-batches are broken into micro-batches that flow through the pipeline like an assembly line. This adds a “bubble” of idle time at the start and end of each batch — interleaved 1F1B (one-forward-one-backward) scheduling reduces it. ZeRO / FSDP (fully sharded data parallel): rather than every data-parallel rank holding a full copy of the optimizer state, gradients, and parameters, you shard them across ranks and gather what you need just before computation. ZeRO-3 / FSDP shards all three; the cost is extra communication. Most 100B-class runs use 3D parallelism: tensor parallelism within a node (size 8), pipeline parallelism across a few nodes (size 8–16), and data parallelism across the rest. The bottleneck at this scale is communication, not compute — NCCL all-reduce, all-gather, and reduce-scatter latencies dominate. You overlap communication with computation aggressively. Training fault tolerance is its own subsystem: a single GPU failure on a multi-week run cannot kill the job, so you checkpoint every ~30 minutes to a fast distributed filesystem (Lustre, GCS, or a custom one) with asynchronous offload, and you have automatic restart-from-checkpoint with a smaller world size if a node dies. Gradient accumulation lets you simulate a larger effective batch size when your per-step batch is constrained by memory. Mixed precision (bfloat16 for activations, fp32 for optimizer state) is mandatory; flash-attention or its variants are mandatory for the attention layers. Mention learning rate warmup followed by cosine decay, gradient clipping at norm 1.0, and the fact that loss spikes are normal and recoverable with a small learning rate dip plus skipped batches.

Question 7

Design a production model-serving system that handles 1M QPS with sub-50ms p99 latency, supports rolling model updates without downtime, and provides observability into drift.

Accepted Answer

Inference server: pick one — Triton, TorchServe, BentoML, vLLM (for LLMs), or a custom Rust/C++ server if you have the engineering budget. The serving binary handles model loading, dynamic batching, GPU memory management, and request routing. Dynamic batching is the single biggest throughput lever: queue incoming requests for a few ms, form a batch, run the model once, return responses individually. This trades a small amount of latency for much higher GPU utilization — at 1M QPS you cannot afford to run one request per forward pass. Tune the batching window per model (longer for high-throughput batch endpoints, shorter for latency-sensitive ones). Routing: place a lightweight L7 router in front (Envoy, custom) that knows model versions and routes by request metadata. CPU-bound models go to CPU pools; large models go to GPU pools. For some workloads, request shape matters — long sequences in an LLM are routed differently from short ones to maintain consistent batch shapes. Model updates: never hot-swap weights in a serving binary; instead deploy a new version of the binary on a fresh pool, shift traffic with the router. Shadow deployment first — copy 100% of real traffic to the new version and compare outputs (the new model receives traffic but does not serve responses), useful for catching numerical regressions and feature pipeline mismatches. Then canary — 1%, 5%, 25%, 100% — gated by automated metric thresholds on latency, error rate, and a business proxy metric. Rollback must be one command and tested regularly. Observability has three layers. Infrastructure (latency p50/p95/p99, error rate, GPU utilization, queue depth) is the table-stakes layer. Feature distribution drift — compute population stability index (PSI) or KL divergence on each input feature between training data and live traffic, alert when any feature drifts more than a threshold. This catches upstream pipeline bugs faster than anything else. Prediction drift — track the distribution of output scores; a sudden shift means either inputs changed or the model is misbehaving. Outcome metrics — whatever the business cares about (conversion, click-through, retention) measured downstream and joined back to model predictions, ideally automatically attributed by request ID. The outcome layer has the longest delay (sometimes days) but is the only one that catches the case where features and predictions look fine but the model is making bad decisions. Final hard problem: feature freshness — features computed in the training pipeline must match features computed in the serving pipeline, and the most common production bug in ML systems is when they silently diverge. Use the same library, the same code path, the same materialization logic in both — or accept that you will spend a lot of debugging time.

Question 8

Tell me about a model that didn’t work as well in production as in offline evaluation.

Accepted Answer

This is the canonical MLE behavioral question because almost every model has this story, and how you tell it shows whether you understand production ML or you have only done notebooks. A strong answer has four parts. First, the concrete gap — “offline AUC was 0.87, but online click-through rate moved 0.3% instead of the projected 4%, and engagement on the segment we cared about actually dropped.” Numbers, not vibes. Second, the diagnosis with a named category: train-serve skew (the feature pipeline at training time was not the same as at serving time), label leakage (a feature was correlated with the label in a way that did not exist at prediction time), selection bias (the training set was conditioned on past system behavior — you were only learning from items the previous model showed), distribution shift (the world changed between training and serving), or specification gaming (the offline metric did not actually measure what the business cared about). Naming the category quickly is itself the signal — it shows you have seen this failure mode before. Third, the fix — what you changed in the system (a feature store with shared transformation code, an online evaluation framework, a counterfactual logging pipeline, or sometimes simply a different metric). Fourth, the durable lesson — something you now do differently on every project, not just the next iteration of this one. Avoid two anti-patterns: claiming the model was perfect and blaming data quality (shows no ownership), or treating the failure as a one-off random event (shows no system thinking). The interviewer is also listening for whether you had online metrics at all — many candidates ship models with no production telemetry beyond latency, which itself is a signal about seniority.

Question 9

Describe a time you killed an ML project.

Accepted Answer

Senior MLE signal — this question is grading whether you can make rigorous cost-benefit decisions against your own work. The shape of a good answer: you were running a project (your project or one you owned a major component of), the underlying assumption that justified it started to look wrong, you ran the analysis to confirm, and you brought the recommendation to kill the project with concrete numbers attached. The numbers matter — “we were spending roughly $X per quarter on training and serving, the projected business lift from the next iteration was Y, and even with optimistic assumptions the ROI did not justify a third quarter of investment.” Or for a research project: “we had run three architectural variations and each plateaued at the same metric ceiling, which suggested the bottleneck was data quality or label noise rather than model capacity, and the data fix was a different project.” The harder part is the people piece — you had a team that had invested months, you had to write the kill memo without making it feel like blame, you had to redirect those engineers to something they could feel good about. Mention that you got it ratified by your manager, your director, or whoever the decision-maker was — solo-killing a project that other people care about is a political mistake even when the analysis is right. Avoid frames that signal you killed it because someone told you to (no ownership) or because you got bored (no rigor). The ideal close is a brief reflection on what would have caused you to make the call earlier — a leading indicator you now watch on every project.

Question 10

Tell me about a time you partnered with a non-ML team to ship something.

Accepted Answer

Cross-functional collaboration is what separates senior MLEs from research-only profiles. Pick a project where the ML was a means, not the end — and where the partner team was meaningfully non-ML (product, infra, ops, sales, legal). The shape: product wanted a capability (“we want to detect when a customer is about to churn,” “we want to auto-tag uploaded photos,” “we want to route support tickets to the right team”), and you had to do three things that are not about modeling at all. First, translate the product requirement into a learnable problem — what is the label, what is the prediction window, what does success look like, what is the baseline you will compare against. This step is where most projects die, because the product team has a vague intent and you have to push back on framing without sounding obstructionist. Second, coordinate on serving — the platform team probably has opinions about which inference stack you can use, what latency budgets exist, how you log predictions for downstream attribution. You either fit into the existing platform or you negotiate the additions, and you do it without making the platform team feel like you are dumping work on them. Third, set up monitoring with whoever owns the downstream metric — ops, support, or the product team. They need to be able to see model behavior in their own dashboards, not yours. The story should show that you understood the partner team’s constraints (their on-call rotation, their roadmap, their definition of success), that you adjusted scope to fit them, and that you ended up with a shipped thing rather than a research artifact. The anti-pattern to avoid: a story where you built a great model and threw it over the wall, and the partner team failed to integrate it — that frames the partner team as the obstacle, which is the wrong signal.

Machine Learning Engineer Interview Questions — What Senior MLE Candidates Actually Get Asked

How ML engineer loops are structured in 2026

ML fundamentals Q&A

Q1. Explain the difference between L1 and L2 regularization. When would you choose each?

Q2. What is dropout and why does it work? When does it hurt?

Q3. Explain the bias-variance trade-off in concrete terms, with a deployment scenario.

Q4. Implement gradient descent for logistic regression from scratch, then discuss when to use SGD vs Adam vs LBFGS.

ML system design Q&A

Q1. Design a recommendation system for a video streaming platform — 200M users, 100M videos, must serve recommendations in under 100ms.

Q2. Design the training infrastructure for a 100B-parameter language model — must train on 1000+ GPUs efficiently.

Q3. Design a production model-serving system that handles 1M QPS with sub-50ms p99 latency, supports rolling model updates without downtime, and provides observability into drift.

Behavioral Q&A

Q1. Tell me about a model that didn’t work as well in production as in offline evaluation.

Q2. Describe a time you killed an ML project.

Q3. Tell me about a time you partnered with a non-ML team to ship something.

Going deeper