The canon. If you have done any system design prep, you have seen these. The bar is no longer 'can you describe a design' — it is 'can you name the three reasonable designs and articulate when each one is correct.'
Design a URL shortener (TinyURL, bit.ly).
Start by clarifying scale before drawing anything: read-to-write ratio is roughly 100:1, expected 100M new URLs/month, ~10B clicks/month. That sets the storage budget (~500GB for the mapping table over five years) and the read QPS (~4K/s steady, 40K/s peak). The high-level design is trivial — POST /shorten returns a 7-character code, GET /{code} 302-redirects. The trade-offs are where the round is actually scored. Code generation: counter + base62 vs. hash + collision check vs. pre-generated key pool. The counter is simplest but leaks volume to competitors and centralizes a bottleneck; the hash is stateless but you must handle collisions; the pool decouples generation from request path and is what most production systems pick. Storage: a relational store keyed by code is fine at this scale, and the typical wrong answer is reaching for Cassandra when a single Postgres with read replicas handles 40K/s of point lookups easily. Cache the hot tail aggressively — long-tail clicks dominate. Follow-ups: custom aliases (uniqueness check before insert), analytics (async write to a separate pipeline, never block the redirect), expiry (TTL field + lazy GC vs. cron sweeper). The point of this question is not the design; it is whether you can explain why you would not pick the more exotic option.
Design Twitter's home timeline (feed).
The interesting decision is fanout-on-write vs. fanout-on-read. Fanout-on-write pre-computes each user's timeline by inserting into N follower timelines on every tweet — O(followers) write cost, O(1) read. Fanout-on-read computes the timeline at request time by pulling recent tweets from each followee — O(1) write, O(followees) read. Production Twitter uses a hybrid: fanout-on-write for normal users, fanout-on-read for celebrities (Lady Gaga has 80M followers; writing 80M timeline rows per tweet is wasteful when most followers will never log in that day). The articulation interviewers want is the explicit threshold — somewhere around 10K–1M followers depending on activity — and the merge step at read time that combines the pre-computed timeline with celebrity tweets fetched on demand. Storage is typically Redis lists capped at ~800 tweets per user. Ranking is a separate concern layered after retrieval. Follow-ups: how do you handle a user who follows 50K accounts (fanout-on-read for them too), how do you backfill a new follow (fetch last N tweets and merge), how do you delete a tweet without touching every follower's timeline (tombstone + filter at read). The wrong answer is picking one strategy and defending it; the right answer is naming both, explaining where each breaks, and building the hybrid.
Design a distributed cache (Redis-cluster style).
Three concerns dominate: partitioning, replication, and consistency. Partitioning by consistent hashing with virtual nodes is the standard choice — when a node joins or leaves, only 1/N keys move, vs. N/N for modulo hashing. Virtual nodes (256+ per physical node) smooth out the load distribution. Replication: primary-replica per shard with async replication for performance, sync for safety. The trade-off is real: async means a primary failure can lose the last few writes; sync means every write waits for the slowest replica. Most caches accept the async loss because the source of truth is the database. Consistency: caches are eventually consistent by design, and the interview-grade answer names the staleness window and the invalidation strategy. Write-through, write-behind, and cache-aside each have failure modes. Cache-aside is the default — application reads cache, falls back to DB on miss, populates cache — but it has a thundering-herd problem on a popular missing key. The fix is request coalescing or a short negative cache. Eviction: LRU is fine for most workloads; LFU helps when you have a stable hot set; TTL is mandatory for anything user-facing. Follow-ups: hot-key handling (replicate the key, or add a per-key local cache), cluster resharding without downtime (slot migration with dual-read window), and monitoring (hit ratio, p99 latency, eviction rate are the three numbers that matter).
Design a rate limiter.
The first question is where it runs — edge (CDN/API gateway), service mesh, or in-process. Each placement has different consistency requirements. Edge limiters can be approximate; per-user-per-API limiters need to be precise. The four classic algorithms are token bucket, leaky bucket, fixed window, and sliding window. Token bucket allows bursts up to bucket size, then sustained rate — best for user-facing APIs where occasional bursts are normal. Leaky bucket smooths output to a constant rate — best for protecting a downstream that cannot tolerate bursts. Fixed window is the simplest but has the boundary problem (2x burst across the window edge). Sliding window log is exact but expensive in memory; sliding window counter is the production sweet spot — interpolate across two adjacent fixed windows. Distributed implementation: a central Redis with INCR + EXPIRE works to ~100K req/s but becomes the bottleneck. The next step is per-node local buckets that periodically sync to a central store, accepting some over-allowance during the sync window. The senior-grade trade-off to articulate: strict global precision vs. availability under partition. If Redis is down, do you fail open (allow all traffic, risk overload) or fail closed (block all traffic, guaranteed outage)? Most production systems fail open with a local fallback limiter at a higher threshold. Follow-ups: per-tier limits (free vs. paid), per-endpoint limits, and how to surface the limit to the client (429 with Retry-After header).
Design a notification service (email, SMS, push).
This question rewards seeing it as a pipeline, not a service. The components are: a notification API that accepts events, a template/personalization layer, a routing layer that decides which channels to use per user, channel-specific senders, and a state machine tracking delivery. Decouple aggressively with a queue between every stage — a slow APNs response should not block email. Per-channel rate limits are mandatory; SES caps you at a sustained send rate, APNs throttles per-app-per-device, SMS providers charge per message and have country-specific rules. The interesting trade-offs: at-least-once vs. exactly-once delivery (exactly-once is essentially impossible across third parties; you accept dedup at the recipient via idempotency keys), batching vs. real-time (batch lifts throughput but adds latency; user-visible alerts must be real-time, digests can batch), priority lanes (a 2FA code cannot sit behind 100K marketing emails). User preferences are the silent complexity — opt-outs per channel per category, quiet hours, frequency caps, and timezone-aware sending. These cannot live as scattered if-statements; they belong in a preferences service consulted before send. Follow-ups: failure handling (retry with exponential backoff, then dead-letter queue with manual review), tracking (open and click pixels for email, delivery receipts for SMS, but never block the send on tracking writes), and template versioning (A/B testing requires version-pinned templates so analytics align).
Design a chat application (1:1 and group).
The protocol question comes first: long-poll, server-sent events, or WebSocket. WebSocket is the modern default — bidirectional, low overhead per message, persistent. The connection layer is stateful, which is the central architectural challenge. Sticky load balancing routes a user's connection to a specific gateway node; the gateway maintains an in-memory map of user-to-connection. When User A sends a message to User B, the system must locate B's gateway node — a presence service (Redis hash of user → gateway) does this in O(1). Storage: messages are append-only and time-ordered, which fits Cassandra or a sharded MySQL keyed by conversation_id. Group chats with 1000+ members add a fanout problem similar to Twitter — broadcast to every online member, queue for offline. Trade-offs to name: read receipts (per-user-per-message state explodes for large groups; cap or aggregate), typing indicators (ephemeral, never persist, throttle aggressively), end-to-end encryption (changes the entire design — the server cannot search, cannot generate previews, cannot do server-side spam filtering). Delivery semantics: at-least-once with client-side dedup via message_id is the practical choice; exactly-once is a fiction. Follow-ups: offline message queue (push notification + sync on reconnect), media attachments (upload to object storage, send the URL not the bytes), and history search (separate index — Elasticsearch — synced from the message store, never query the primary store for search).