Phantom Code AIPhantom Code AI
FeaturesEarn with UsMy WorkspacePricing
FeaturesEarn with UsMy WorkspacePricing
Round type · System design

System Design Interview Questions — Real Problems, Trade-Offs, and What Senior+ Interviewers Want to Hear

System design isn't memorization. It's articulating trade-offs out loud. The same problem has multiple correct answers; what separates a no-hire from a strong-hire is justification — naming the alternatives you considered, explaining why you rejected them, and acknowledging the failure modes of the design you picked.

What follows is the question bank seasoned engineers actually get asked at L5 and L6 loops, paired with the reasoning interviewers expect to hear. Not pattern flashcards. Trade-offs, alternatives, and the follow-ups that separate a recited design from a real one.

What system design rounds are actually testing

The signal differs by level. The same answer is a strong-hire at L4 and a no-hire at L6 — not because the design is wrong, but because the depth of articulation that the level demands is missing.

L4 / SDE II

Solid implementation

Can build the obvious version. Knows what a load balancer, cache, and database are. Recognizes that a single instance won't scale and can describe horizontal scaling. Doesn't need to defend the choice of every component — just needs the design to be coherent and not make obvious mistakes.

L5 / Senior

Trade-off articulation

Names the trade-offs. Picks SQL vs. NoSQL with a reason that holds up under follow-up. Knows when async beats sync and explains the failure modes of each. Identifies the bottleneck before the interviewer prompts. When asked "what if traffic 10x'd?" — answers with the specific component that breaks first and how it scales.

L6+ / Staff

Alternatives explicitly rejected

Doesn't just propose a design — proposes two or three, explains why one wins, and names the conditions under which the answer would change. Surfaces organizational implications: who owns this service, how it evolves, what the on-call burden looks like. Treats every component as a future migration. The signal isn't cleverness — it's judgment under ambiguity.

The framework that holds up under pressure

Every system design book sells you a framework. This one is what actually works in a 45-minute round when the interviewer interrupts twice and your initial assumption turns out to be wrong.

  1. 1. Clarify requirements (5 minutes)

    Functional first — what does the system do, who uses it, what's in scope. Then non-functional — read/write ratio, request volume, latency targets, data size, consistency requirements. Don't skip this. A wrong assumption here invalidates the rest of the round, and interviewers explicitly score on whether you asked.

  2. 2. High-level design (10 minutes)

    Boxes and arrows. Client → load balancer → API → cache → database → queue → worker. Don't optimize yet. The point is to establish the shape so the interviewer agrees on the contract before you go deep on any one component.

  3. 3. Detail the components (15 minutes)

    Pick the two or three components where the interesting decisions live and go deep. Schema choices, partitioning strategy, replication mode, cache eviction. This is where most of the score is awarded.

  4. 4. Identify bottlenecks (10 minutes)

    Walk the request path. Where does traffic concentrate? What component fails first under 10x load? Hot keys, single points of failure, synchronous chains that should be async. Don't wait to be asked.

  5. 5. Discuss alternatives (5 minutes)

    The senior-grade move. "I picked X. The reasonable alternative is Y. I'd switch to Y if requirement Z changed." This is the line that turns a passing round into a strong-hire. Most candidates skip it.

Classic problems

The canon. If you have done any system design prep, you have seen these. The bar is no longer 'can you describe a design' — it is 'can you name the three reasonable designs and articulate when each one is correct.'

Design a URL shortener (TinyURL, bit.ly).

Start by clarifying scale before drawing anything: read-to-write ratio is roughly 100:1, expected 100M new URLs/month, ~10B clicks/month. That sets the storage budget (~500GB for the mapping table over five years) and the read QPS (~4K/s steady, 40K/s peak). The high-level design is trivial — POST /shorten returns a 7-character code, GET /{code} 302-redirects. The trade-offs are where the round is actually scored. Code generation: counter + base62 vs. hash + collision check vs. pre-generated key pool. The counter is simplest but leaks volume to competitors and centralizes a bottleneck; the hash is stateless but you must handle collisions; the pool decouples generation from request path and is what most production systems pick. Storage: a relational store keyed by code is fine at this scale, and the typical wrong answer is reaching for Cassandra when a single Postgres with read replicas handles 40K/s of point lookups easily. Cache the hot tail aggressively — long-tail clicks dominate. Follow-ups: custom aliases (uniqueness check before insert), analytics (async write to a separate pipeline, never block the redirect), expiry (TTL field + lazy GC vs. cron sweeper). The point of this question is not the design; it is whether you can explain why you would not pick the more exotic option.

Design Twitter's home timeline (feed).

The interesting decision is fanout-on-write vs. fanout-on-read. Fanout-on-write pre-computes each user's timeline by inserting into N follower timelines on every tweet — O(followers) write cost, O(1) read. Fanout-on-read computes the timeline at request time by pulling recent tweets from each followee — O(1) write, O(followees) read. Production Twitter uses a hybrid: fanout-on-write for normal users, fanout-on-read for celebrities (Lady Gaga has 80M followers; writing 80M timeline rows per tweet is wasteful when most followers will never log in that day). The articulation interviewers want is the explicit threshold — somewhere around 10K–1M followers depending on activity — and the merge step at read time that combines the pre-computed timeline with celebrity tweets fetched on demand. Storage is typically Redis lists capped at ~800 tweets per user. Ranking is a separate concern layered after retrieval. Follow-ups: how do you handle a user who follows 50K accounts (fanout-on-read for them too), how do you backfill a new follow (fetch last N tweets and merge), how do you delete a tweet without touching every follower's timeline (tombstone + filter at read). The wrong answer is picking one strategy and defending it; the right answer is naming both, explaining where each breaks, and building the hybrid.

Design a distributed cache (Redis-cluster style).

Three concerns dominate: partitioning, replication, and consistency. Partitioning by consistent hashing with virtual nodes is the standard choice — when a node joins or leaves, only 1/N keys move, vs. N/N for modulo hashing. Virtual nodes (256+ per physical node) smooth out the load distribution. Replication: primary-replica per shard with async replication for performance, sync for safety. The trade-off is real: async means a primary failure can lose the last few writes; sync means every write waits for the slowest replica. Most caches accept the async loss because the source of truth is the database. Consistency: caches are eventually consistent by design, and the interview-grade answer names the staleness window and the invalidation strategy. Write-through, write-behind, and cache-aside each have failure modes. Cache-aside is the default — application reads cache, falls back to DB on miss, populates cache — but it has a thundering-herd problem on a popular missing key. The fix is request coalescing or a short negative cache. Eviction: LRU is fine for most workloads; LFU helps when you have a stable hot set; TTL is mandatory for anything user-facing. Follow-ups: hot-key handling (replicate the key, or add a per-key local cache), cluster resharding without downtime (slot migration with dual-read window), and monitoring (hit ratio, p99 latency, eviction rate are the three numbers that matter).

Design a rate limiter.

The first question is where it runs — edge (CDN/API gateway), service mesh, or in-process. Each placement has different consistency requirements. Edge limiters can be approximate; per-user-per-API limiters need to be precise. The four classic algorithms are token bucket, leaky bucket, fixed window, and sliding window. Token bucket allows bursts up to bucket size, then sustained rate — best for user-facing APIs where occasional bursts are normal. Leaky bucket smooths output to a constant rate — best for protecting a downstream that cannot tolerate bursts. Fixed window is the simplest but has the boundary problem (2x burst across the window edge). Sliding window log is exact but expensive in memory; sliding window counter is the production sweet spot — interpolate across two adjacent fixed windows. Distributed implementation: a central Redis with INCR + EXPIRE works to ~100K req/s but becomes the bottleneck. The next step is per-node local buckets that periodically sync to a central store, accepting some over-allowance during the sync window. The senior-grade trade-off to articulate: strict global precision vs. availability under partition. If Redis is down, do you fail open (allow all traffic, risk overload) or fail closed (block all traffic, guaranteed outage)? Most production systems fail open with a local fallback limiter at a higher threshold. Follow-ups: per-tier limits (free vs. paid), per-endpoint limits, and how to surface the limit to the client (429 with Retry-After header).

Design a notification service (email, SMS, push).

This question rewards seeing it as a pipeline, not a service. The components are: a notification API that accepts events, a template/personalization layer, a routing layer that decides which channels to use per user, channel-specific senders, and a state machine tracking delivery. Decouple aggressively with a queue between every stage — a slow APNs response should not block email. Per-channel rate limits are mandatory; SES caps you at a sustained send rate, APNs throttles per-app-per-device, SMS providers charge per message and have country-specific rules. The interesting trade-offs: at-least-once vs. exactly-once delivery (exactly-once is essentially impossible across third parties; you accept dedup at the recipient via idempotency keys), batching vs. real-time (batch lifts throughput but adds latency; user-visible alerts must be real-time, digests can batch), priority lanes (a 2FA code cannot sit behind 100K marketing emails). User preferences are the silent complexity — opt-outs per channel per category, quiet hours, frequency caps, and timezone-aware sending. These cannot live as scattered if-statements; they belong in a preferences service consulted before send. Follow-ups: failure handling (retry with exponential backoff, then dead-letter queue with manual review), tracking (open and click pixels for email, delivery receipts for SMS, but never block the send on tracking writes), and template versioning (A/B testing requires version-pinned templates so analytics align).

Design a chat application (1:1 and group).

The protocol question comes first: long-poll, server-sent events, or WebSocket. WebSocket is the modern default — bidirectional, low overhead per message, persistent. The connection layer is stateful, which is the central architectural challenge. Sticky load balancing routes a user's connection to a specific gateway node; the gateway maintains an in-memory map of user-to-connection. When User A sends a message to User B, the system must locate B's gateway node — a presence service (Redis hash of user → gateway) does this in O(1). Storage: messages are append-only and time-ordered, which fits Cassandra or a sharded MySQL keyed by conversation_id. Group chats with 1000+ members add a fanout problem similar to Twitter — broadcast to every online member, queue for offline. Trade-offs to name: read receipts (per-user-per-message state explodes for large groups; cap or aggregate), typing indicators (ephemeral, never persist, throttle aggressively), end-to-end encryption (changes the entire design — the server cannot search, cannot generate previews, cannot do server-side spam filtering). Delivery semantics: at-least-once with client-side dedup via message_id is the practical choice; exactly-once is a fiction. Follow-ups: offline message queue (push notification + sync on reconnect), media attachments (upload to object storage, send the URL not the bytes), and history search (separate index — Elasticsearch — synced from the message store, never query the primary store for search).

Streaming & real-time

Real-time systems force you to reason about the speed of light. Latency budget, tick rate, and consistency-vs-availability trade-offs all sharpen. These are favored at the senior+ levels because they expose the people who only know request/response.

Design a YouTube live streaming service.

Live streaming is fundamentally different from VOD because the source of truth is being generated in real time. Ingest: streamer pushes RTMP or SRT to a regional ingest endpoint. Transcoding: the ingest re-encodes into multiple bitrate ladders (1080p, 720p, 480p, 240p) — this is CPU-intensive and is where most of the cost lives. Packaging: HLS or DASH segments (typically 2–6 second chunks) are written to object storage with a manifest. Distribution: CDN serves the segments to viewers; the manifest is updated as new segments arrive. The latency–scale trade-off is the entire interview. Standard HLS gives 10–30 second glass-to-glass latency; LL-HLS and CMAF chunked transfer push it to 2–4 seconds; WebRTC pushes it to sub-second but at vastly higher per-viewer cost because WebRTC is not natively CDN-cacheable. The right answer names the use case: a sports broadcast tolerates 5 seconds; a live auction does not. Scale challenges: a viral stream can go from 100 to 1M concurrent viewers in minutes — the CDN must pre-warm, ingest must not become the bottleneck (regional ingest with global replication of the master), and chat must shard by stream ID with its own fanout. Follow-ups: DVR (keep the last N hours of segments retrievable), live-to-VOD (the stream becomes a video on demand at end-of-stream), and ABR (adaptive bitrate — the player picks the ladder rung based on measured throughput; the server just serves what is requested).

Design WhatsApp messaging at billion-user scale.

WhatsApp is interesting because the constraints are extreme: 2B+ users, 100B+ messages/day, end-to-end encrypted, mostly mobile with intermittent connectivity. The encryption choice (Signal Protocol with per-conversation ratcheting keys) shapes everything downstream — the server cannot read content, so it cannot do server-side search, smart replies, or content-based spam filtering. What the server can do: route ciphertext to recipient devices, store offline messages, manage delivery receipts. Connection model: a single WebSocket per device to the nearest edge POP, with the edge proxying to a regional message router. Identity is phone number; device key bundles are published to a key server. Multi-device support adds significant complexity — every message must be encrypted separately to each of the recipient's active devices, and a new device must securely sync history (the QR-code-pairing flow). Delivery: at-least-once, with single tick (sent to server), double tick (delivered to device), and blue tick (read). Each tick is a separate event flowing back through the system. Group chats up to 1024 members fan out at the sender (or sender's primary device) — this preserves end-to-end encryption but means the sender uploads N copies of the ciphertext for an N-member group. Follow-ups: media (encrypted blob to object storage; URL + key sent over the message channel), voice/video calls (separate WebRTC stack with TURN relay fallback), and offline queue retention (typically 30 days, then drop).

Design an online multiplayer game backend.

The defining concern is tick rate and latency budget. A real-time shooter targets 60Hz simulation with sub-100ms perceived latency; a turn-based card game can tolerate seconds. Architecture splits along that axis. Real-time games use authoritative game servers (often UDP) that hold the simulation state in memory; clients send inputs, the server resolves, broadcasts state deltas. Lag compensation, client-side prediction, and server reconciliation are the three techniques every real-time game uses to hide the speed of light. Matchmaking is its own service — pull players from a queue, group them by skill (Elo or TrueSkill) and region, allocate a game server (typically from a pre-warmed pool because cold-starting a game server takes 10+ seconds and lobby abandonment spikes after 30 seconds of waiting). Game server allocation is usually done with Agones on Kubernetes or a similar orchestrator. Persistence: hot state lives in the game server's memory; on session end, results flush to the player profile DB and the match history DB. Cheating is the silent design constraint — the server is authoritative, clients are not trusted, and any state that could be manipulated client-side (currency, position validation, hit detection) must be re-validated server-side. Follow-ups: regional sharding (you cannot put EU and Asia players on the same server because RTT alone breaks the experience), spectator mode (add a read-only fanout from the authoritative server, often through a CDN-style relay), and replays (record the input stream, not the state stream — replays are deterministic re-simulations).

Storage & scale

Data-heavy designs. Sync, dedup, sharding, and the operational reality of 'what happens when this becomes 100x bigger.' These rounds reward candidates who have actually run something at scale, not just read about it.

Design Dropbox or Google Drive.

The interesting problems are not the upload — they are sync, conflict resolution, and storage efficiency. Files are chunked (typically 4MB blocks) before upload; each chunk is hashed; the client first asks the server which chunks it does not already have and only uploads the delta. This deduplication works both within a user's account (moving a file is metadata-only) and across users (a viral PDF is stored once). Metadata service holds the file tree and version history; chunk service holds the actual bytes in object storage. Sync is the hard part. The client maintains a local manifest, polls or holds a long connection for change notifications, and pulls deltas. Conflict resolution when two clients edit the same file offline: the server-side strategy is typically last-writer-wins with a conflicted-copy file generated for the loser, because three-way merge of arbitrary binary files is impossible. For text files, OT or CRDT-based collaboration (the Google Docs path) is a different architecture entirely. Storage tiers: hot data on SSD-backed object storage, cold data on archival storage with seconds-to-minutes retrieval latency. Encryption at rest is table-stakes; per-user encryption keys complicate dedup (you cannot dedup across users if each user encrypts with their own key — Dropbox accepts this, others trade convenience for privacy). Follow-ups: large file resumable upload (chunk-level retry), shared folders (permissions service, ACL inheritance, the messy edge cases of nested shares), and selective sync (the client manifest tracks which subtrees materialize locally).

Design Instagram's backend (photo upload, feed, search).

Three subsystems each with their own design. Upload: client uploads original to object storage via a pre-signed URL (never through the application server — that is the most common wrong answer); a worker generates thumbnail sizes asynchronously; metadata writes to the post DB. The async pipeline is critical because synchronous resize would tie up application threads and cap throughput. Feed: same fanout-on-write vs. fanout-on-read trade-off as Twitter, with the same celebrity-account hybrid. Instagram's feed is ranked, so retrieval pulls a candidate set (last N posts from followees) and a ranker scores them — the ranker is a separate ML service called per-feed-load. Cache the candidate set per user (Redis), invalidate or update on new posts from followees. Search: hashtag and user search go through Elasticsearch indexed asynchronously from the post DB; visual search (find similar photos) uses learned embeddings stored in a vector DB. Storage at scale: photo bytes dominate cost. The optimization layers are CDN edge caching (90%+ hit rate for popular content), regional object storage replication, and lifecycle policies that move old photos to colder tiers. The API itself shards by user ID. Follow-ups: stories (24-hour TTL on the content, separate ranking, separate engagement counters), reels (add a video pipeline — transcode, multiple bitrates, HLS packaging, similar to YouTube's VOD path), and DMs (separate messaging service, reuse the chat-application design).

Design an e-commerce checkout (cart, inventory, payment).

Checkout is where every distributed-systems failure mode shows up: race conditions on inventory, partial failures across cart-payment-fulfillment, exactly-once charging, and cross-service consistency. The cart can be eventually consistent — losing a cart item is annoying, not catastrophic. Inventory is where strong consistency starts to matter, and the production answer is reservation, not decrement. When a user clicks checkout, the system reserves N units for M minutes (a row in the reservations table with TTL); reservations decrement available inventory transactionally. If payment succeeds, the reservation converts to a fulfilled order; if it fails or times out, the reservation expires and the inventory is released. This avoids the textbook race where two users buy the last item simultaneously. Payment integration: never hold a database transaction open across an external payment call — the call can take 30 seconds and you will exhaust the connection pool. Instead, persist the order in a pending state, call the payment provider asynchronously, then transition state on the webhook. Idempotency keys on every payment call are mandatory; without them a network retry double-charges. The senior-grade trade-off: distributed transactions across cart, inventory, payment, fulfillment, and notification are infeasible in practice; the production pattern is the saga — a sequence of local transactions with compensating actions for each rollback step (refund, release inventory, cancel notification). Follow-ups: tax and shipping (third-party services, cache results aggressively, never block checkout on a slow tax provider), fraud (asynchronous post-auth review, can cancel an order pre-fulfillment), and Black Friday (pre-warm everything, queue at the front door, accept a degraded but-not-dropping experience).

ML & data systems

Production ML is mostly systems work — feature pipelines, online/offline parity, candidate generation, rankers, and the offline-vs-online metric gap. Asked at companies where ranking, ads, or search drive revenue.

Design a recommendation system (Netflix, Amazon, Spotify).

A production recommender has three layers: candidate generation, ranking, and re-ranking. Candidate generation reduces millions of items to ~1000 with a cheap, recall-oriented model (matrix factorization, two-tower neural network, or item-item collaborative filtering). The candidate set is computed offline for stable users, online for new ones. Ranking takes those 1000 and scores each with an expensive precision-oriented model (gradient-boosted trees or a deep network) using a much richer feature set — recent user behavior, item metadata, contextual features (time of day, device, current session). Re-ranking applies business rules (diversity, freshness, exclude-recently-shown) and is where you fix problems no model alone solves. The serving path must be fast — typically 50–100ms p99 for the entire stack — so feature stores are populated offline, embeddings are pre-computed, and the online ranker only does feature lookup + scoring. The cold-start problem (new user, new item) needs an explicit fallback — popularity, content-based, or hand-curated. Training pipeline: log all impressions and outcomes (click, watch, skip), join with item and user features, retrain ranker daily or hourly. The interview-grade trade-off is offline-vs-online metrics: an A/B test result is the only metric that matters in production; offline AUC improvements often do not translate. Follow-ups: feedback loops (the model learns from data the model influenced — this is a real bias source, addressed with exploration/randomization), debiasing position effects (top slots get more clicks regardless of relevance), and the multi-stakeholder problem (Netflix balances user satisfaction, content cost amortization, and creator promotion — single-objective ranking does not capture this).

Design an ad serving system.

Ads is a real-time auction at internet scale, with three things happening simultaneously: targeting (which ads is this user eligible for), pricing (second-price auction or VCG), and serving (return the winning ad in <100ms). Targeting starts with an inverted index — user attributes (location, demographics, interests, device) map to candidate campaigns whose targeting clauses match. The challenge is that targeting clauses are complex Boolean expressions, and brute-force evaluation across millions of campaigns is too slow. Solution: bitmap indexes per attribute, intersected at serve time, plus a learned candidate filter that prunes campaigns unlikely to win. Auction: each candidate has a bid (sometimes static, more often a learned bid from a campaign's value model), the auction picks the highest expected revenue (bid × predicted CTR × quality score). Pacing controls smooth a daily budget across the day so it does not exhaust by 9am. Frequency capping prevents one user from seeing the same ad 50 times — this requires per-user-per-campaign counters with low-latency increment, typically Redis with a write-back to durable storage. Click and impression logging is asynchronous — never block the ad return on logging — but must be durable because billing depends on it; dedup keys handle retries. Privacy is increasingly the constraint: with iOS ATT and third-party cookie deprecation, server-side modeling of cohorts replaces individual targeting, and the architecture has to support both. Follow-ups: brand safety (real-time content classification of the page the ad will appear on), creative serving (ads themselves are media — CDN them), and reporting (aggregations roll up through stream processing into per-campaign dashboards with hourly freshness).

Design a search ranking system (web, product, or in-app search).

Search splits into two stages: retrieval (find documents that could match) and ranking (order them by relevance). Retrieval uses an inverted index — Lucene/Elasticsearch for text, with optional vector search for semantic similarity. The query goes through analysis (tokenization, stemming, synonyms), the index returns matching docs, and a first-pass scorer (BM25, or a learned lightweight model) reduces the candidate set to ~1000. Ranking applies a learned model — often a learning-to-rank pairwise or listwise model — using features that include text-match scores, document quality signals (PageRank-style for web, popularity and reviews for product), personalization signals, and freshness. Latency budget is harsh — most search products target sub-200ms. To meet it, the index is sharded by document ID with each shard contributing top-K candidates that merge at the broker. Caching is multi-level: query-result cache for exact repeats, candidate cache for partial query overlap. The interesting trade-off is recall vs. precision: aggressive recall surfaces more candidates but costs latency and risks irrelevant results; aggressive precision misses long-tail queries. The right balance depends on the surface — Google web search prioritizes recall (people will rephrase), product search prioritizes precision (showing irrelevant results loses the sale). Personalization is the wedge that separates good from great — recent click history, location, device, and (in commerce) purchase history reshape the ranking. Follow-ups: query understanding (spell correction, intent classification, entity recognition all happen pre-retrieval), zero-result handling (suggest related queries, broaden filters automatically), and multilingual (per-language analyzers, cross-language embeddings for queries that span languages).

How to practice these (so the prep actually transfers)

Reading designs is necessary and insufficient. The skill being tested is verbal articulation under interruption — and that only develops by speaking the design out loud, getting interrupted, and recovering coherently.

  • Whiteboard solo, then narrate. Draw the design silently. Then explain it out loud, end-to-end, in under 10 minutes. The first time you do this, you'll discover gaps you didn't know existed.
  • Force the alternative. For every component decision, write down the second-best option and why you didn't pick it. If you can't articulate the alternative, you don't yet understand your own choice.
  • Pressure-test with a 10x prompt. After every design, ask yourself: what breaks at 10x scale? At 100x? At 1/10x — would the design be over-engineered? The senior signal is identifying these inflection points without prompting.
  • Mock with a real engineer. Self-practice plateaus around 70% of interview-readiness. The remaining 30% is being interrupted mid-sentence, having an assumption challenged, and pivoting without losing the thread. That requires a human in the loop.
PhantomCode for system design

The hardest part isn't the design. It's the alternative you forgot to mention.

PhantomCode listens during your live system design round and surfaces the trade-offs you'd otherwise leave unsaid — the second-best option, the failure mode under partition, the scaling cliff at 10x. You decide what to say. It just makes sure the senior-grade alternative is on your tongue when the interviewer asks "why not X?"

See the interview copilotBrowse all interview questions