Q1.Design a recommendation system for a streaming platform with 100 million users and 50 million items.
Start by clarifying the objective: are we optimizing for watch time, completion rate, retention, or content diversity? The metric drives every downstream choice. The standard architecture is a two-stage funnel — a candidate generator that retrieves a few hundred items from millions, followed by a heavier ranker that orders them. For retrieval, two-tower neural networks trained with contrastive loss are the modern default: a user tower encodes recent watch history and demographics, an item tower encodes title metadata and embeddings, and the inner product approximates affinity. Index the item embeddings in an ANN store like ScaNN or FAISS for sub-10ms lookup. For ranking, a gradient-boosted model or wide-and-deep network on the few hundred candidates, scored on click probability and predicted watch duration, multi-objective combined. Cover the cold start path explicitly: new users get popularity-weighted recommendations within the demographic, new items get content-based retrieval until they accumulate engagement. Discuss feedback loops — popular items get more exposure, get more engagement, get even more exposure — and the mitigations: exploration via Thompson sampling or off-policy correction. Close with the metrics framework: offline NDCG and recall at K, online A/B on retention and watch time, with diversity and freshness as guardrails.
Q2.Design a fraud detection system for a payments platform processing 10,000 transactions per second.
Fraud detection is a real-time, highly imbalanced, adversarial problem — three properties that shape every design choice. The architecture has three layers. First, deterministic rules — velocity checks, blocklists, geo-impossible patterns — catch obvious fraud at single-digit milliseconds and provide explainability for compliance. Second, a real-time ML model scores every transaction in under 100ms — gradient-boosted trees on hundreds of features (transaction amount, merchant category, device fingerprint, hours-since-last-transaction, historical user behavior). Features come from a feature store with both online (low-latency Redis or DynamoDB) and offline (batch) materialization to guarantee training-serving consistency. Third, a slower graph-based model runs on a sliding window to detect coordinated rings — accounts sharing devices, addresses, or money flow. On imbalance: positive rates are below 0.5 percent, so the right metric is precision-at-K or recall at a fixed precision target, never raw accuracy. Use focal loss or class weighting and calibrate probabilities post-hoc. On adversarial drift: fraudsters adapt within days, so retrain weekly, monitor PSI on key features daily, and shadow-deploy challenger models. The action layer matters too: high-risk transactions are declined, medium-risk go to step-up auth, low-risk pass — three tiers, each with measurable false-positive and false-negative cost.
Q3.Design a search ranking system for a marketplace with hundreds of millions of listings.
Search ranking is a learning-to-rank problem layered on top of an inverted-index retrieval. Stage one is retrieval: an inverted index (Elasticsearch or a custom Lucene-based system) returns the top few thousand listings matching the query terms, augmented by semantic retrieval via a dense embedding ANN store for queries the lexical index misses. Stage two is ranking: a gradient-boosted LambdaMART or a deep neural ranker scores those candidates on hundreds of features — query-listing text match, listing quality (reviews, photos, response rate), personalization (user history, location), and listing freshness. Train on click and conversion data using pairwise or listwise loss. Stage three is business logic and diversification — boost geographic relevance, ensure category diversity in the top ten, apply marketplace-specific constraints. The unique DS challenges here are position bias (users click top results because they are top, not because they are best) which requires inverse propensity weighting or randomized exploration; and fairness across sellers which requires explicit exposure constraints. Evaluate offline on NDCG and online on conversion rate, search-to-purchase, and a satisfaction proxy like dwell time on the listing page.
Q4.Design a content moderation ML system for a social platform with 500 million daily uploads.
Content moderation is multi-modal (text, image, video, audio), multi-policy (violence, hate, spam, sexual content, misinformation), and high-stakes — false negatives harm users, false positives erode trust. The pipeline has three tiers. Tier one is real-time, low-cost screening on every upload — perceptual hashing against known-bad content, lightweight CNN classifiers, keyword filters — designed for sub-second decisions and 99.9 percent uptime. Tier two is heavier ML for ambiguous content: vision-language models for image and video, fine-tuned LLMs for text, with separate per-policy heads since training a single multi-policy model dilutes signal. Tier three is human review for the highest-uncertainty content, with the human labels feeding back into the next training cycle. Critical design choices: action thresholds are per-policy and per-region (legal definitions vary), model outputs feed actions on a graded scale (remove, demote, label, age-gate), and an appeals path is mandatory. On evaluation: precision and recall by policy, with a focus on rare but catastrophic categories, and per-language and per-demographic slicing to catch fairness gaps. Monitor drift aggressively — adversaries actively probe the system, so weekly retrains on red-team examples are standard. The honest framing in an interview is that no system is perfect at this scale; the goal is to balance harm reduction against expression and to make the trade-offs auditable.