PhantomCodeAIPhantomCodeAI
FeaturesMock InterviewDashboardJobsPricing
FeaturesMock InterviewDashboardJobsPricing
HomeInterview QuestionsDevOps Engineer
Updated for 2026 hiring loops

DevOps Engineer Interview Questions — SRE and Platform Loops Decoded

DevOps and SRE interviews look superficially similar to backend engineering loops, but they grade for a different shape of thinking. The interviewer wants to see operational judgement — the ability to reason about systems that are already running, under load, being modified by other people, and occasionally on fire. That changes which answers score. A coding round in a DevOps loop is usually a scripting or tooling problem (parse this log, write a controller that reconciles state, build a CLI), not a LeetCode puzzle. A system design round expects you to discuss capacity, failure modes, and rollout strategy — not just box-and-arrow architecture. The behavioural round is heavily weighted toward incident response and cross-team friction because that's most of the job. Be specific about what you've actually run in production. "I've used Kubernetes" is a tutorial answer; "I've run a 200-node EKS cluster with multi-tenant workloads, and here's the upgrade pain we hit" is a senior answer. Numbers, postmortems, and tradeoffs are the currency.

How DevOps and SRE loops are structured

  • Tools round. Live troubleshooting — broken Kubernetes cluster, a Terraform plan that won't apply, a CI pipeline that hangs. The grader wants to see your debugging loop: what you check first, how you read logs, how you isolate variables.
  • Infra system design. Build a logging pipeline, a deploy system, an observability stack. Capacity numbers are graded — "a lot of logs" is not an answer, "10TB/day, 600 MB/s peak, 100TB after compression" is.
  • Coding / automation. Write a controller, a config generator, a small CLI. Often in Go or Python. Idiomatic code matters; this is your tools-thinking showing through.
  • Incident response and behavioural. Almost always includes "tell me about an outage you owned". Specificity about detection, mitigation, and post-mortem follow-through is the signal.

Infrastructure & automation Q&A

These probe whether you understand what the tools actually do, not whether you can recite the docs. Discuss internals, failure modes, and tradeoffs.

Q1. Explain what happens when you run kubectl apply -f deployment.yaml. Walk through the components involved.

Start at the client: kubectl parses the YAML, resolves the current kubeconfig context, and sends an HTTPS POST or PATCH to the API server. The API server authenticates (token, cert, OIDC), authorises (RBAC), and runs the request through admission controllers — first the mutating webhooks (which can inject sidecars, default values, image-pull secrets), then schema validation, then validating webhooks (policy gates like OPA/Kyverno). Only then does it persist the object to etcd. The deployment controller (inside the controller manager) sees the new Deployment via a watch, reconciles, and creates a ReplicaSet; the ReplicaSet controller sees that and creates the desired number of Pods. Pods land in etcd with nodeName empty. The scheduler watches for unscheduled pods, runs predicates (resources, taints, affinity) and priorities, and binds each pod to a node by writing nodeName. The kubelet on that node sees a pod assigned to it, pulls the image via the container runtime (containerd/CRI-O), invokes the CNI plugin to set up the pod's network namespace and assign an IP, attaches volumes via CSI, and starts the containers. The kubelet keeps reporting status back to the API server. The L5 signal is naming the difference between apply and create — apply is declarative, it merges your manifest into the live object via three-way merge using the last-applied annotation, so it preserves fields you didn't set; create is imperative and errors if the object exists. Also mention the RollingUpdate strategy on the Deployment — maxSurge and maxUnavailable govern how many new pods can be created above desired count and how many old pods can be taken down at once, and the rollout pauses if readiness probes don't pass.

Q2. Write a Terraform module that provisions an autoscaled web service across 3 AZs with an ALB. Discuss the state file.

The module needs: a VPC with public and private subnets across 3 AZs (or it accepts an existing VPC as input — better for production reuse), security groups for the ALB (allow 443 from 0.0.0.0/0) and for the instances (allow the ALB SG on the app port), a launch template (AMI, instance type, user_data for bootstrapping, IAM instance profile), an Auto Scaling Group spanning all 3 private subnets with target group attachment, an ALB with HTTPS listener + ACM cert + a target group with health checks, and a scaling policy (target tracking on CPU at 60% is the common default). Variables: min_size, max_size, desired_capacity, instance_type, ami_id, app_port, domain_name. Outputs: alb_dns_name, alb_zone_id (so a parent module can create a Route53 alias), asg_name. The state file is the real interview content. Local state (terraform.tfstate on disk) only works for solo experimentation — it has no locking and no sharing. For any team, use a remote backend: S3 for the state blob plus a DynamoDB table for state locking. State locking matters because two engineers running apply concurrently will race — both read the same state, both compute different plans, the second apply overwrites the first's changes and the state diverges from reality. DynamoDB lock prevents this — Terraform acquires the lock before refresh and releases at the end. The other state problem is secrets: anything Terraform writes to state is in plaintext (RDS passwords, private keys, API tokens), so the S3 bucket must be encrypted at rest, versioning must be enabled (so you can recover from a bad apply), and access must be tightly IAM-scoped. Mention workspaces vs separate state files per environment — workspaces share backend config but isolate state; separate backends per env is safer for production isolation.

Q3. You have a CI pipeline that takes 45 minutes. Walk through how you'd reduce it to under 10 minutes.

Profile first — instrument the pipeline so you know which stages dominate. Usually it's tests, then image builds, then dependency install. Then attack in this order. Parallelisation: shard the test suite across N runners (most test runners support --shard or you split by file count or historical timing). Run unit, integration, and lint stages as parallel jobs, not serially. Caching: Docker layer cache via BuildKit with a remote cache backend (registry cache or s3-backed cache), dependency cache (node_modules, .gradle, .m2, Go module cache) keyed on the lockfile hash, and build artifact cache for incremental compiles. Pre-warmed runners: bake the base image with system deps, language toolchains, and CLI tools pre-installed so cold start doesn't reinstall apt packages and JDKs every run. Test selection: only run tests affected by the diff. Bazel and Nx do this natively via the build graph; for less structured codebases you can use file-pattern heuristics ('changed files under packages/auth/** → run auth tests') as a 90% solution. Fail fast: run lint and a smoke subset before the full suite so an obvious typo aborts in 30 seconds, not 30 minutes. The senior signal is what NOT to optimise. Flaky tests will silently undo every speed gain — if 2% of test runs fail spuriously and the pipeline retries the whole job, you've added 20+ minutes of latency on average. Fix flakiness before optimising further. Retry budgets matter too — automatic retries hide real failures and waste runner minutes. And the cost calculus: a $200k/yr engineer waiting 45 minutes vs 10 minutes 8 times a day is meaningful, but if the optimisation costs 6 engineer-weeks plus $50k/yr in cache infrastructure, you need to be honest about the payback period.

Q4. Describe how you'd design a multi-region active-active deployment with database failover.

Three layers: replication topology, read/write routing, and failover orchestration. For Postgres, the standard active-active is logical replication (publications and subscriptions per table) with conflict resolution rules at the application layer — last-write-wins for most tables, application-managed for unique constraints. The cleaner option is a globally distributed database — CockroachDB, Spanner, YugabyteDB — that does multi-region Raft consensus and gives you serialisable transactions across regions, at the cost of write latency proportional to the inter-region RTT for the quorum. For routing: writes go to the primary in the region nearest the user (or to a single global primary in active-passive), reads can come from any replica but you must handle read-your-writes — either route the user's reads to the same region as their writes for a session, or use causality tokens (the client sends back the LSN it last saw, the replica waits to apply up to that LSN before serving). Failover: define your RTO (how fast you recover) and RPO (how much data you're willing to lose). Automated promotion via Patroni or RDS Aurora's built-in failover is sub-minute RTO with near-zero RPO; manual promotion is slower but avoids false failovers from a flapping network partition. Traffic management: DNS-based failover (Route53 health checks updating A records) is the simplest but TTL-bound — clients with cached DNS hit dead regions for up to TTL seconds. Anycast IP with BGP withdrawal is faster but requires owning IP space and BGP relationships. Application-level routing via a smart client SDK or a service mesh gives the fastest cutover but couples failover into the application. Split-brain prevention is the critical hard part: when the network partitions, both regions may think they're the primary and accept conflicting writes. Solutions: quorum (require N/2+1 to acknowledge a write — losing region can't commit), fencing tokens (each generation of leadership has a monotonically increasing token, and writes from an old token are rejected), and STONITH (the new leader cuts power or revokes credentials to the old one before promoting). For globally distributed DBs, the consensus protocol handles this — but for hand-rolled replication you must explicitly design it in. Always test failover regularly with game days — undocumented failover paths break under real pressure.

Infrastructure system design Q&A

Infra system design rounds differ from product system design — capacity, cost, operability, and rollout safety are first-class. Start with numbers, then architect.

Q1. Design a centralized logging system for a 1000-microservice deployment generating 10TB/day of logs, with sub-second search and 30-day retention.

Size the problem first: 10TB/day is ~115 MB/s sustained, but logs are bursty so plan for 5x peak — 600 MB/s ingest capacity. 30 days at 10TB/day is 300TB before compression, ~100TB compressed. Pipeline: agent → buffer → indexer → storage tiers. Agent: Fluent Bit (lightweight, ~10MB RAM per node) or Vector (Rust, better throughput) on every host or as a Kubernetes DaemonSet, tailing container stdout/stderr via /var/log/containers. Add structured fields (pod, namespace, service, trace_id) by enriching from the Kubernetes API. Buffer: Kafka with partitioning by service. This decouples ingest spikes from indexing — if the indexer falls behind, Kafka holds the backlog rather than dropping logs at the agent. Size Kafka for ~6 hours of buffer at peak rate. Indexing: two strategies. Elasticsearch/OpenSearch for full-text search — fast but expensive at scale; budget 1.5-2x storage overhead for indexes and replication. Loki (Grafana) for label-indexed log streams — much cheaper because it only indexes labels, not content; chunks live in S3 and grep happens on read. The realistic answer is hybrid: ES/OS for the last 7 days (hot tier on local NVMe with replicas), then roll over to S3 with Athena/Trino as the cold tier for days 8-30. Sub-second search on hot tier is achievable; cold-tier queries are minutes, which is fine for forensics. Cardinality is the silent killer — adding request_id or user_id as a Prometheus label or a Loki label explodes the index. Same applies to ES fields with high cardinality — keep them as content (searchable but unindexed) not as keyword fields. Mention OpenTelemetry: the modern standard is OTel collectors as the agent/aggregator, emitting OTLP to whatever backend you choose, decoupling instrumentation from storage. Cost realism: at this scale, you're spending $200k-$1M/year on storage and compute; sampling (drop debug logs, keep INFO+, keep 100% of errors) and dropping noisy log lines at the agent often pays back in weeks.

Q2. Design a deploy pipeline that supports canary releases, automatic rollback, and traffic shifting for a 50-service microservices platform.

GitOps as the core: the desired cluster state lives in Git, and ArgoCD or Flux continuously reconciles the cluster to match. This means deploys are pull requests, not kubectl invocations — which gives you review, audit, and rollback for free (revert the PR). Layered on top: progressive delivery via Argo Rollouts or Flagger. Both extend the Deployment with strategies — canary (route N% of traffic to new version, hold, measure, increase), blue/green (deploy alongside, swap on success), and experiment (compare two variants under A/B traffic). Metric-based promotion is the differentiator from naive canaries: define SLOs (success rate, p95 latency, error budget burn) and the rollout queries Prometheus/Datadog at each step; if SLO is violated, it auto-rolls back to the previous ReplicaSet. Traffic shifting at the mesh layer — Istio's VirtualService weights, Linkerd's TrafficSplit — gives precise percentage control without LB reconfiguration. For services not in the mesh, do it at the ALB/Envoy LB level using weighted target groups, but you lose per-request granularity. The harder problems for a 50-service platform: (1) Cross-service deploys that need coordination — service A's new version requires service B's new API. Solution: enforce backward-compatible API changes (additive only), versioned endpoints, and feature flags to decouple deploy-time from release-time. (2) Schema migrations that aren't backward compatible — expand/contract pattern: add the new column, deploy code that writes both, backfill, deploy code that reads new only, drop the old column. Each step is a separate deploy; the rollout never reaches a state where rollback is impossible. (3) Human-in-the-loop checkpoints — high-risk services (payments, auth) get manual approval gates between canary stages; low-risk services auto-promote. Codify this in the Rollout spec so it's policy, not tribal. (4) Multi-cluster: ArgoCD ApplicationSets to deploy the same app across regions with per-region overrides. Track which regions are at which version so you can detect drift. Mention Crossplane or KCL for unifying app, infra, and config in the GitOps model.

Q3. Design an observability stack — metrics, logs, traces — for a global SaaS company at series C scale.

Three pillars, one common metadata schema. Metrics: Prometheus scraping at the cluster level, federated upward to a long-term store. The OSS path is Mimir (Grafana) or Thanos for horizontally scalable Prometheus-compatible storage with object-storage backends; the commercial path is Datadog, Honeycomb (which uses events not metrics, slightly different model), Chronosphere, or Grafana Cloud. At series C, the build-vs-buy decision usually lands on buy for the first observability stack and revisit at series D when the bill becomes painful (think $1M+/yr). Use the OpenMetrics format, expose /metrics endpoints from every service, and standardise on the four golden signals (latency, traffic, errors, saturation) plus business KPIs (orders/min, signups/hr). Traces: OpenTelemetry SDK in every service, instrumenting HTTP/gRPC/DB clients automatically. Spans flow to an OTel collector (deployed as DaemonSet for local aggregation plus a regional gateway tier for batching, sampling, and routing). Storage: Jaeger or Tempo (OSS) or Honeycomb/Datadog/Lightstep (commercial). Trace sampling is the cost lever — head-based sampling (decide at root span time, e.g. 1% of traces) is cheap but loses errors; tail-based sampling (collect all spans, decide at the end whether to keep) keeps 100% of errors and 100% of slow requests plus a sample of healthy ones, at the cost of a buffering tier in the collector. The right answer at series C scale is tail-based with rules: 100% of errors, 100% of p99 latency, 1-5% of healthy. Logs: covered above; integrate by sharing trace_id across all three pillars so you can pivot from a slow trace to its logs in one click. The harder questions: cost. Metrics cardinality is the most common runaway — a histogram with 10 labels each having 100 values is 10M time series; one such metric can be 10% of your bill. Audit cardinality monthly; drop unused labels. Log volume — drop debug logs aggressively, keep INFO+ on hot tier, archive everything to S3 for compliance/forensics at $0.023/GB-mo. Trace storage — sample aggressively, drop noisy spans (healthchecks, metrics scrapes). SLO-driven prioritisation is the cultural piece: alerts fire on customer-impacting SLO burn, not on every CPU spike. Wire the alert routing to PagerDuty with multi-burn-rate windows (fast burn for outages, slow burn for degradation). Every team owns SLOs for the services they run; the platform team owns the observability infrastructure but not the alert definitions.

Behavioural Q&A

Heavy on incident ownership, cross-team negotiation, and long-horizon investment. Pick real stories with numbers; rehearsed STAR templates read as such.

Q1. Walk me through the worst outage you've owned.

The classic SRE behavioural question. Strong answers cover five beats. Detection time: when did the system actually start failing, and when did your team know? If customers reported it before your alerts fired, that itself is a signal about monitoring quality — acknowledge that. Diagnosis: what did you check, in what order, what dead ends did you go down? Be specific about tools (Grafana dashboards, Loki queries, tcpdump, kubectl describe pods) and about the moment you found the actual cause. Mitigation: what did you do to stop the bleeding — rollback, traffic shift away from the bad region, manual database failover, scaling up, flipping a feature flag? Note the gap between mitigation (customers stop being affected) and full resolution (root cause fixed). Duration and impact: be specific with numbers — 47 minutes, ~12k affected users, $X estimated revenue impact, Y% of requests failing at peak. Vague answers ('it was bad') read as someone who wasn't actually in the room. Post-mortem: this is where senior shows up. Talk about the blameless framing — what conditions in the system allowed this to happen, not who pushed the button. Action items with owners and due dates, follow-through measurement (did the action items actually ship within 90 days, or did they rot in the backlog?). The best answers also describe what was surprising about the outage — every real outage has at least one moment of 'wait, that's not supposed to be possible' that exposes a gap in the team's mental model. Finishing with what you carry forward — a habit, a new dashboard, a new gate in the deploy pipeline — lands well. Avoid: blaming a vendor or another team, narrating the outage in heroic terms, or claiming you 'fixed it in five minutes'.

Q2. Tell me about a time you had to make an infrastructure investment that paid off over 6+ months.

Senior infrastructure engineers think in years, not sprints. Pick a real project where you had to make the case upward and execute through to measurable outcome. Examples that land: a build-system rewrite (Bazel migration, Turborepo adoption), a Kubernetes migration off EC2/ECS, an observability overhaul replacing scattered Datadog dashboards with SLO-driven alerts, a CI rebuild that took your pipeline from 60 minutes to 8. The shape of the story: (1) concrete pain — name the metrics that were hurting (build time, MTTR, on-call pages per week, deploy frequency, lead time for changes — DORA metrics are good vocabulary here). Don't just say 'it was slow', say 'p50 build was 23 minutes and engineers were context-switching out of the deploy queue'. (2) The case to leadership — you wrote a doc with cost (engineer-months, infrastructure spend) vs benefit (engineer time saved, incident reduction, revenue protected). Quantify both. Mention you got executive sponsorship and named the person; that signals you understand org dynamics. (3) Execution with milestones — break the project into 30/60/90 day deliverables, each shipping value (don't have a 9-month invisible project). Describe what slipped and why; 'on time and on budget' for infra projects of this size is suspicious. (4) Outcome measured against the original metrics — 'build time dropped from 23 to 6 minutes, on-call pages decreased 40%, deploy frequency 2x'd'. The senior signal is also describing what you cut from scope to ship. 'Originally we planned to also migrate the monorepo to Bazel; we descoped that to phase 2 because we realised the CI win was 80% from caching alone, and Bazel adoption would have added 4 months for the remaining 20%.' Engineers who can't cut scope can't ship large projects.

Q3. Describe a time you pushed back on a team's deploy practices.

This tension is universal — feature teams want velocity, platform/SRE wants safety, and reflexively blocking deploys makes you the enemy. The shape of a strong answer: (1) specific risk — what did you see, with data? 'The auth service had 3 incidents in the last quarter, all traced to deploys without canary; they were using rolling updates with no SLO gates and 100% traffic shifted within 5 minutes.' Concrete pattern, not vibes. (2) How you presented it — start by listening to why they were doing it that way (often there's a real reason, like a previous bad experience with canaries, or a service that genuinely can't be canaried because it's not horizontally scaled), then bring the data — incident frequency, MTTR, the specific failure modes that canaries would have caught. Avoid the 'safety police' framing; frame it as 'we both want you to ship fast, and this is the change that lets you ship fast without rolling back on Fridays'. (3) The compromise — almost never a full ban on the practice. Usually a staged improvement: 'let's add a 10% canary stage for 10 minutes with automated SLO gating; if that passes, full rollout proceeds. We'll measure deploy duration before and after, and revisit in 6 weeks.' Make it cheap to try and easy to measure. (4) Long-term outcome — the team adopted canaries, deploy incidents dropped, and ideally other teams asked to adopt the same pattern. The best answers also acknowledge where you were partially wrong — maybe the canary added 8 minutes to every deploy and that was real friction, so you invested in faster metric evaluation to bring it under 3. Showing that you took their feedback and iterated, rather than entrenching, is the senior platform-engineer signal. Don't end with 'and the team learned to respect SRE more'; that lands as condescending.

Prep for the actual loop

Reading sample answers is the easy half. The harder half is performing under pressure while the interviewer is typing notes about how clearly you reason out loud. Practise live with PhantomCodeAI as your coding copilot for the tools and automation rounds, and review the system design question bank to drill capacity-first design thinking until it's reflex. Most candidates lose SRE loops not on knowledge but on structure — practise the loop, not just the answers.