Uber Interview Question

Design a Metrics Monitoring and Alerting System — Uber Interview

hard22 minBackend System Design

How Uber Tests This

Uber interviews focus on real-time location tracking, ride-matching algorithms, delivery logistics, geospatial systems, and mobile architecture. Both backend and Android system design questions are common for senior roles.

Interview focus: Real-time location, geospatial systems, delivery logistics, mobile architecture, and monitoring.

Key Topics
distributed systemsmonitoringalertingprometheustime seriesobservability

How to Architect a Metrics Monitoring and Alerting System

Every production system at scale has a monitoring system behind it. The one that wakes the on-call engineer at 3 AM. The one that tells you your API latency spiked before your customers notice. The one that stores five years of CPU metrics so you can compare this Black Friday to the last one.

Building that system is harder than it looks.

On the surface: accept metric samples, store them, query them, fire alerts when thresholds are crossed. In practice: time-series data at scale creates write volumes that destroy conventional databases. Labels on metrics create a cardinality explosion that can bring a storage system to its knees. Alerting that fires too often trains engineers to ignore it. And downsampling historical data is the only thing standing between you and petabytes of storage costs.

This question appears at Datadog, Google, Amazon, Netflix, Cloudflare, and Uber. It tests whether you understand observability as a discipline — not just which tools exist, but why their architectures make the trade-offs they do.


Step 1: Clarify the Scope

Interviewer: Design a metrics monitoring and alerting system.

Candidate: A few clarifying questions. Are we collecting infrastructure metrics — CPU, memory, network — or application-level metrics too, like request latency and error rates? What are the write and read volumes? How long do we retain data? Do we need dashboards and visualisation, or just the storage and alerting layer? And is this multi-tenant — serving many independent teams or customers with isolation between them?

Interviewer: Infrastructure and application metrics both. Assume millions of hosts, each emitting hundreds of metrics every 10 seconds. Retain data for 13 months. Dashboards are in scope at a high level. Multi-tenant — each team's metrics should be isolated.

Candidate: Good. Two things jump out immediately. At that scale, ingestion throughput is enormous — let me calculate that precisely. And 13-month retention with raw granularity would be prohibitively expensive, so we'll need a tiered storage and downsampling strategy. Let me start with the numbers and then build up from there.


Requirements

Functional

  • Collect metrics from millions of hosts and applications
  • Store metrics with their associated labels (metadata tags)
  • Support time-range queries and aggregations (PromQL-style)
  • Define alert rules: fire when a metric crosses a threshold for a sustained period
  • Route alerts to notification channels — PagerDuty, Slack, email
  • Support dashboards that query metrics in real time
  • Multi-tenancy: teams see only their own metrics

Non-Functional

  • High write throughput — tens of millions of data points per second
  • Low query latency — dashboard queries return in under a few seconds
  • High availability — a monitoring outage during an incident is the worst possible time to lose visibility
  • Cost-efficient storage — 13 months of raw data at this scale would cost a fortune; tiering is required
  • No data loss — dropped metrics mean blind spots during incidents

Back-of-the-Envelope Estimates

Interviewer: Walk me through the scale.

Candidate:

plaintext
Monitored hosts:            5 million
Metrics per host:           200 (CPU, memory, network, disk, app metrics)
Collection interval:        every 10 seconds
 
Data points per second:
  5M hosts × 200 metrics / 10s = 100 million data points/second
 
Each data point:
  metric_name + labels + timestamp + float64 value ≈ ~50 bytes (raw)
  After compression (Gorilla/delta encoding): ~1.5 bytes/sample
 
Compressed ingestion rate:
  100M × 1.5 bytes = 150 MB/second
 
Raw storage at 10s resolution for 13 months:
  100M samples/s × 86,400s × 390 days × 1.5 bytes
  ≈ ~5 PB
 
After downsampling (raw 15 days, 5-min 90 days, 1-hour 13 months):
  ≈ ~100 TB — practical and affordable

The numbers make the case for downsampling immediately. 5 PB of raw storage at $20/TB/month is $100M/year just for metrics. After tiering, we're at roughly $2M/year. The storage architecture isn't an academic exercise — it's the difference between a viable business and one that's unprofitable by design.


High-Level Architecture

plaintext
                      ┌──────────────────────────────────────┐
                      │            Data Sources               │
                      │  Hosts, Containers, Applications,     │
                      │  Databases, Load Balancers            │
                      └──────────────┬───────────────────────┘
                                     │ push (StatsD/OTLP) or
                                     │ pull (Prometheus scrape)
                      ┌──────────────▼───────────────────────┐
                      │         Ingestion Layer               │
                      │  Collector Fleet / Scrape Pool        │
                      │  (Prometheus agents / OTel Collector) │
                      └──────────────┬───────────────────────┘

                      ┌──────────────▼───────────────────────┐
                      │       Streaming Buffer (Kafka)        │
                      │  Decouples ingestion from storage     │
                      │  Partitioned by metric name hash      │
                      └──────┬───────────────────────────────┘

          ┌──────────────────┼──────────────────────────┐
          │                  │                          │
┌─────────▼──────────┐  ┌────▼────────────────┐  ┌──────▼──────────────┐
│   Hot Storage       │  │   Warm Storage      │  │  Cold Storage       │
│   (In-memory TSDB)  │  │   (Local SSD TSDB)  │  │  (Object Storage)   │
│   Last 24–48 hours  │  │   Last 15 days      │  │  Up to 13 months    │
│   Fastest reads     │  │   5-min resolution  │  │  1-hour resolution  │
└─────────────────────┘  └─────────────────────┘  └─────────────────────┘

                      ┌──────▼──────────────────────────────┐
                      │        Query Layer                   │
                      │  Fan-out query across hot/warm/cold  │
                      │  Merges and returns unified result   │
                      └──────────────┬───────────────────────┘

          ┌──────────────────────────┼────────────────────────────┐
          │                          │                            │
┌─────────▼──────────┐   ┌──────────▼──────────┐   ┌────────────▼────────┐
│  Alerting Engine   │   │  Dashboard / UI      │   │  Alert Manager      │
│  (Evaluates rules  │   │  (Grafana / custom)  │   │  (Routes, dedupes,  │
│   on query results)│   │                      │   │   silences, pages)  │
└────────────────────┘   └──────────────────────┘   └─────────────────────┘

The Time-Series Data Model

Before designing storage, the data model needs to be precise.

Interviewer: What does a metric actually look like structurally?

Candidate: A metric is not a single number. It's a named time series with an associated set of labels and a stream of timestamped float values.

plaintext
Metric:
  name:      http_requests_total
  labels:    { service="payments", region="us-east-1", status="200" }
  samples:   [ (t=1700000000, v=142857),
               (t=1700000010, v=142901),
               (t=1700000020, v=142958), ... ]

The combination of name + label set uniquely identifies one time series. Change any label value — say, status="500" instead of status="200" — and you have a completely different time series. This is what creates the cardinality problem, which we'll come back to.

Four metric types are standard across monitoring systems:

  • Counter — monotonically increasing value (requests served, bytes sent). Never decreases. Rate queries compute the per-second rate of increase.
  • Gauge — a snapshot of a current value (CPU %, memory bytes, connection pool size). Can go up or down.
  • Histogram — samples observations into configured buckets (request latency in 0–10ms, 10–50ms, 50–200ms, 200ms+ buckets). Enables percentile computation.
  • Summary — pre-computed percentiles (p50, p95, p99) calculated on the client side before emission.

Understanding these types matters in an interview because they influence how you query and store metrics. A counter needs a rate() function applied before it's meaningful as a dashboard value. A histogram needs a histogram_quantile() to produce p99 latency — and the bucket configuration is a design decision that cannot be changed after data is collected.


Pull vs Push Ingestion

Interviewer: Do your collectors pull metrics from hosts, or do hosts push metrics to the collection layer?

Candidate: Both models exist in production. The choice has real architectural implications.

Pull model (Prometheus-style):

The collector actively scrapes each target on a schedule. Each scrape hits a metrics endpoint (/metrics) exposed by the target.

plaintext
Prometheus Scraper
  │ GET /metrics (every 15 seconds)

http://host-42:9090/metrics
  → 200 OK
  → metric_name{label="value"} 42.0

Advantages: the collector knows the exact health of every target — a scrape failure is immediately visible. Targets don't need to know where to send data. Firewall rules are simpler (collector initiates).

Disadvantages: doesn't work well for ephemeral targets (serverless functions, short-lived batch jobs) that may not exist long enough to be scraped. Service discovery is needed to keep the scrape target list current.

Push model (StatsD / OpenTelemetry push):

Targets emit metrics to a collection endpoint on a schedule.

plaintext
Application
  │ UDP/HTTP POST (every 10 seconds)

metrics-collector.internal:8125
  → { metric: "api.latency", value: 42, tags: [...] }

Advantages: works for any target, including serverless and batch. No need for service discovery on the collection side. Lower coupling — the target doesn't know what's consuming its metrics.

Disadvantages: if the target fails silently, the collection layer doesn't know. Push storms — many targets pushing simultaneously — can overwhelm the collection endpoint.

My recommendation: pull for long-running infrastructure (hosts, services, databases) and push for ephemeral workloads (Lambda functions, batch jobs, CI pipelines). Most large monitoring systems use both. OpenTelemetry supports both protocols, making it a natural choice for the agent layer.


The Kafka Buffer

Between the ingestion layer and storage sits a Kafka buffer. This is not optional at this scale.

Why it's needed:

Without Kafka, the ingestion layer writes directly to the TSDB. A storage slowdown or restart causes back-pressure all the way to the metrics collectors — and eventually dropped metrics. During an incident, when you most need your metrics, is exactly when your storage is most likely to be under stress.

Kafka decouples producers (collectors) from consumers (storage writers). Collectors write at their own pace. Storage writers drain the queue at their own pace. A temporary storage hiccup results in queue depth growth, not metric loss.

Partitioning strategy:

Partition by a hash of (tenant_id, metric_name). This ensures all samples for the same time series land on the same partition, in order. The storage writer for each partition can then build efficient compressed blocks without cross-partition coordination.


Storage: Hot / Warm / Cold Tiering

Interviewer: How do you store 13 months of metrics cost-effectively?

Candidate: Tiered storage with progressive downsampling. Raw data is expensive and only valuable for recent debugging. A week-old metric doesn't need 10-second resolution — nobody is debugging an incident from 30 days ago at 10-second granularity.

Hot tier (last 24–48 hours, in-memory):

The most recent data lives in a Prometheus-style in-memory TSDB. Writes are fast — samples are appended to in-memory chunks. Reads are instant — no disk I/O. This tier serves all real-time dashboards and alert evaluations.

Warm tier (2–15 days, local SSD TSDB):

Data is compacted from memory to SSD-backed storage. Still raw resolution (10-second samples), but compressed. Prometheus's TSDB does this automatically — head blocks flush to disk, then compact into larger immutable blocks. Thanos or VictoriaMetrics extend this into their own distributed storage.

Cold tier (15 days to 13 months, object storage with downsampling):

Data moves to S3 or GCS in progressively downsampled form. Thanos calls this the Compactor:

plaintext
Raw data (10s resolution, 0–15 days):
  Stored in warm tier
 
5-minute downsampling (15–90 days):
  Every 30 raw samples → 1 aggregated sample
  Stored: min, max, sum, count (not just avg — preserves spike detection)
 
1-hour downsampling (90 days–13 months):
  Every 12 five-minute samples → 1 aggregated sample
  Same aggregates: min, max, sum, count

Why store min/max/sum/count and not just the average?

Storing only the average is lossy. A spike that lasted 30 seconds in an hour window would disappear into the average. Storing min and max preserves spike information even at coarse resolution. The Query Layer reconstructs p-values from the stored aggregates.


Time-Series Database Internals

Interviewer: What makes a time-series database different from a regular database for storing metrics?

Candidate: Two things: the write pattern and the compression opportunity.

The write pattern: metrics arrive in timestamp order. Unlike a general-purpose database that gets inserts at arbitrary primary keys, a TSDB always appends to the "now" end of each time series. This makes log-structured storage (LSM trees) natural — writes go to a sequential log and are compacted offline.

Compression — the Gorilla algorithm:

Facebook published their Gorilla TSDB paper in 2015. It describes two compression techniques that work together.

For timestamps:

plaintext
Timestamps typically arrive at fixed intervals (every 10 seconds).
Store the delta:
  t0 = 1700000000
  t1 = delta = 10
  t2 = delta-of-delta = 0 (same interval)
  t3 = delta-of-delta = 0
Encodes to ~1.5 bits per timestamp for regular series.

For values:

plaintext
Float64 values often change slightly between samples.
XOR consecutive values:
  v0 = 42.001  (binary: 0100000001000101 000000011000...)
  v1 = 42.003  (binary: 0100000001000101 000000111000...)
  XOR = only the differing bits (leading/trailing zeros stripped)
Encodes to ~1.37 bytes per value on average.

Together: ~1.5 bytes per sample instead of 12 bytes (8 bytes float + 4 bytes int32 timestamp). That's the 8× compression that makes the storage math work. VictoriaMetrics claims even better compression — approximately 0.4–1 byte per sample — through additional optimisations on top of Gorilla.

Explaining the Gorilla algorithm — delta-of-delta for timestamps, XOR for floats — is one of those moments in a Datadog or Google interview where you can signal that you've actually read the source material, not just the blog post summary. Getting that explanation fluent out loud takes a few runs. Mockingly.ai has monitoring system design simulations where compression and TSDB internals come up as follow-ups.


The Cardinality Problem

This is the most common deep-dive question in monitoring system design interviews — and the one most candidates don't know how to answer.

Interviewer: What is cardinality in the context of metrics, and why is it dangerous?

Candidate: Cardinality is the number of distinct time series in the system. It's determined by the unique combinations of all label values across all metric names.

Consider a simple metric:

plaintext
http_requests_total{service, endpoint, status_code, region}

If there are 100 services × 500 endpoints × 10 status codes × 5 regions:

plaintext
100 × 500 × 10 × 5 = 2,500,000 unique time series
  — for a single metric.

Now imagine adding a label like user_id or request_id — values with millions of possible values:

plaintext
http_requests_total{..., user_id="u-192038"}
  → millions of unique time series per metric
  → each needs its own in-memory chunk
  → RAM explodes
  → TSDB crashes

This is the cardinality explosion. At Prometheus scale, exceeding ~10 million active time series causes severe performance degradation. The TSDB index (which maps label sets to chunk locations) becomes too large for RAM.

The prevention strategy:

Never use high-cardinality values as labels. User IDs, session IDs, request IDs, and trace IDs do not belong in metric labels. They belong in logs and traces.

Good label design:

plaintext
Good:   http_requests_total{service, status_code, region}
Bad:    http_requests_total{service, status_code, region, user_id}
Bad:    http_requests_total{service, status_code, region, request_id}

The system should enforce label cardinality limits at ingestion time — reject or drop any time series that would push a metric's label combination count above a configured threshold (e.g., 100,000 unique label sets per metric name).

This is worth proactively raising in an interview even if not asked. It signals you understand the operational reality of running these systems at scale.

That understanding — connecting a data model decision to an infrastructure cost and a production failure mode — is exactly the kind of depth that Mockingly.ai is built to help you practise articulating, with follow-ups that push you beyond the surface answer.


Query Layer

Interviewer: How does a dashboard query return results spanning both warm and cold storage?

Candidate: The Query Layer fans out across all three tiers and merges the results.

plaintext
Dashboard requests: avg(http_requests_total) over past 30 days
 
Query Layer:
  ├─ Query Hot tier (last 48h):   returns 10s resolution data
  ├─ Query Warm tier (48h–15d):   returns 10s resolution data
  └─ Query Cold tier (15d–30d):   returns 5-min resolution data
 
Merge results:
  Stitch time ranges together
  Apply PromQL aggregation (avg(), rate(), sum() etc.)
  Return unified time series to client

The Query Layer knows the resolution available at each tier. For a query spanning 30 days, the UI gets raw resolution for recent data and 5-minute resolution for older data. Most dashboards don't expose this boundary — they simply show a continuous graph.

Query optimisation with recording rules:

Common aggregations run on every dashboard load. Pre-computing them saves latency:

plaintext
Recording rule (runs every 5 minutes):
  Rule: job:http_requests:rate5m =
        sum by (service) (rate(http_requests_total[5m]))
 
Result: a new time series is stored with pre-computed values.
Dashboard queries this derived series instead of raw data.

This converts an expensive fan-out aggregation at query time into a cheap single-series lookup — critical for dashboards with many panels.


The Alerting Pipeline

Interviewer: Walk me through the full alerting pipeline, from rule definition to a page reaching an engineer.

Candidate: The pipeline has three stages: evaluation, routing, and notification.

Stage 1: Rule Evaluation

The Alerting Engine runs in a continuous loop. Every evaluation interval (typically 1 minute), it executes each alert rule as a query against the metrics store:

plaintext
Alert rule:
  name: HighAPIErrorRate
  expr: rate(http_requests_total{status="5xx"}[5m]) > 0.05
  for:  5m     ← must be true for 5 consecutive minutes before firing
  severity: critical
  annotations:
    summary: "Error rate above 5% for {{ $labels.service }}"

The for clause is critical. Without it, a single anomalous metric sample fires an alert — causing false positives from transient spikes. The for: 5m requirement means the condition must hold for 5 consecutive evaluation cycles. This filters noise without introducing much delay.

Stage 2: Alert Manager (Routing and Deduplication)

Fired alerts go to the Alert Manager, which handles three things:

Grouping: multiple alerts with the same labels are grouped into one notification. If 50 hosts in region=us-east-1 all fire HighCPU simultaneously, the engineer receives one notification saying "50 hosts affected" — not 50 separate pages.

Deduplication: the same alert firing from multiple Alerting Engine replicas (for HA) is deduplicated into one notification. Without this, an HA setup would page twice per alert.

Silencing and inhibition: during a known maintenance window, alerts can be silenced. Higher-severity alerts can inhibit lower-severity ones — if DatacenterDown is firing for us-east-1, silence all the downstream HighAPILatency alerts that are caused by the same incident.

plaintext
Alert Manager routing tree:
  match: severity=critical  → PagerDuty (immediate page)
  match: severity=warning   → Slack #alerts-warning (no page)
  match: severity=info      → email digest (daily)

Stage 3: Notification

The Alert Manager sends to notification integrations: PagerDuty, OpsGenie, Slack, email. It retries on delivery failure. It tracks which notifications have been sent to avoid duplicate pages during repeated evaluation cycles.

The alerting pipeline — especially the interaction between the for clause, grouping, deduplication, and inhibition — is where Datadog interviewers specifically probe. Getting the chain from "alert condition is true" to "one page reaches an engineer" explained in the right order, with the reasons for each stage, is what a complete answer looks like. Mockingly.ai runs observability simulations where the alerting pipeline is a standard deep-dive.


Alert Fatigue and the On-Call Experience

Interviewer: What's alert fatigue and how does your design address it?

Candidate: Alert fatigue is what happens when an on-call engineer receives so many alerts that they start ignoring them.

It's not a monitoring problem — it's an engineering culture problem that the monitoring system can make worse or better.

A monitoring system that fires an alert on every 1-minute anomaly, before the for clause proves it's sustained, trains engineers to dismiss alerts without investigating. The first time a real outage hides in the noise, something goes wrong.

Design choices that reduce fatigue:

Require a sustained condition with for. A spike that lasts 30 seconds is noise. A spike that lasts 5 minutes is a problem.

Group related alerts. 50 host alerts → 1 notification. This is the Alert Manager's grouping function. An engineer who receives 50 pages for the same event stops trusting the system.

Dead man's switch alerts. Alert if the monitoring system itself stops receiving metrics from a host. This catches cases where a host's metrics agent crashed and the system is silently blind. A "watchdog" metric (a heartbeat that always fires) is checked continuously — if it stops firing, that absence is itself an alert.

Symptom-based alerts over cause-based alerts. Alert on "users cannot complete checkout" (measured by error rate on the checkout endpoint) rather than "CPU is above 80%". High CPU rarely directly impacts users. A high error rate always does. Cause-based alerts are useful for diagnostics, not for waking engineers.


Multi-Tenancy

Interviewer: How do you isolate metrics between teams without running separate infrastructure per team?

Candidate: Every metric sample gets a tenant_id prepended to its label set at ingestion time.

The ingestion API authenticates each client and stamps its tenant identifier onto every sample before it reaches the storage layer. Tenants cannot forge each other's IDs — authentication happens before data touches the pipeline.

At the storage layer, metrics are physically partitioned by tenant. In Kafka, partitions are assigned by hash(tenant_id, metric_name). In the TSDB, each tenant's data lives in a separate keyspace or namespace. Query Layer requests are scoped to a tenant — a query for tenant A never touches tenant B's data.

This is how Cortex (now Grafana Mimir) and VictoriaMetrics implement multi-tenancy. The X-Scope-OrgID header in Cortex carries the tenant identifier and is validated at every API boundary.


Common Interview Follow-ups

"How do you handle a metrics spike where ingestion suddenly doubles?"

The Kafka buffer absorbs the spike without back-pressure reaching the collectors. The queue depth increases temporarily. Storage writers drain at maximum throughput. If the spike is sustained, the queue depth alert fires and triggers auto-scaling of the storage writer tier. The key is that collectors and storage are decoupled — neither is aware of the other's capacity.

"Prometheus scrapes are pull-based. What if a service goes down — does the scrape failure show up as missing data or as zeros?"

Missing data — which is more dangerous than zeros, because many queries treat missing data as zero without the caller realising. Prometheus explicitly tracks scrape health as a separate metric (up{job, instance}). An alert on up == 0 detects a failed scrape target immediately. The distinction between "metric is zero" and "metric is missing" is important and worth naming in an interview.

"How do you handle out-of-order samples?"

TSDBs receive samples roughly in timestamp order but not perfectly — network jitter and batching cause occasional out-of-order arrivals. Prometheus's TSDB accepts samples up to a configured out-of-order window (default: 1 hour in recent versions). Samples older than the window are rejected. VictoriaMetrics is more permissive, accepting out-of-order samples within the full retention period. For most use cases, the Prometheus default window is sufficient.

"How would you design metric anomaly detection — alerting when behaviour deviates from baseline rather than crossing a fixed threshold?"

Fixed thresholds break on metrics with natural periodicity. CPU at 80% on a Sunday might be abnormal; on a Monday morning it might be expected. Anomaly detection compares the current value against a baseline derived from historical data at the same time of day and day of week.

The implementation: a recording rule computes a rolling baseline (e.g., median of the same metric at the same hour for the past 4 weeks). The alert rule fires when the current value deviates from the baseline by more than N standard deviations. This is statistically simple and effective for regular, predictable metrics. For irregular metrics, a proper ML model (SARIMA, Prophet) can be trained offline and its predictions served as a metrics stream that alert rules compare against.


Quick Interview Checklist

  • ✅ Clarified scope — infra + app metrics, 13-month retention, multi-tenant, dashboards in scope
  • ✅ Back-of-the-envelope — 100M samples/sec, 5 PB raw vs 100 TB tiered — made the case for downsampling
  • ✅ Four metric types — counter, gauge, histogram, summary — and why they matter
  • ✅ Pull vs push trade-offs — pull for long-running infra, push for ephemeral workloads
  • ✅ Kafka buffer — decouples ingestion from storage, absorbs spikes, prevents metric loss
  • ✅ Hot/warm/cold tiering — in-memory, SSD TSDB, object storage with progressive downsampling
  • ✅ Downsampling stores min/max/sum/count, not avg — preserves spike detection at coarse resolution
  • ✅ Gorilla compression — delta-of-delta for timestamps, XOR for values, ~1.5 bytes/sample
  • ✅ Cardinality explosion — defined precisely, explained the failure mode, named the prevention strategy
  • ✅ No high-cardinality values (user_id, request_id) as labels — raised proactively
  • ✅ Query fan-out across tiers — stitches hot/warm/cold into unified result
  • ✅ Recording rules — pre-compute expensive aggregations, reduce dashboard query latency
  • ✅ Alert rule for clause — sustained condition prevents false positives from transient spikes
  • ✅ Alert Manager — grouping, deduplication, silencing, inhibition, routing tree
  • ✅ Alert fatigue — for clause, symptom-based alerting, dead man's switch
  • ✅ Multi-tenancy — tenant_id stamped at ingestion, storage partitioned by tenant, query scoped at API layer

Conclusion

A metrics monitoring system looks like a data pipeline problem. It is — but the hard parts are not the pipeline mechanics.

They're the cardinality problem, which can bring a TSDB to its knees with a single bad label. The downsampling strategy, which is the only thing keeping storage costs from growing unboundedly. The for clause on alert rules, which is the difference between a useful alerting system and one that trains engineers to ignore it.

The candidates who do well at Datadog, Google, and Amazon are the ones who raise cardinality without being asked, who explain downsampling with the min/max/sum/count detail rather than just saying "aggregate," and who connect the alerting design to the human experience of being on-call at 3 AM.

The design pillars:

  1. Time-series data model — name + label set defines a unique series; labels drive cardinality; metric type determines valid operations
  2. Kafka as the ingestion buffer — decouples producers from storage; absorbs spikes; prevents metric loss during storage stress
  3. Hot/warm/cold tiering — raw resolution for recent data, progressively coarser for older data; makes 13-month retention economically viable
  4. Gorilla compression — delta-of-delta timestamps + XOR float values; 8× compression; the reason TSDB storage costs are manageable
  5. Cardinality limits at ingestion — reject high-cardinality label combinations before they reach storage; name this proactively
  6. Alert for clause + grouping — sustained conditions prevent false positives; grouping prevents alert storms from waking engineers 50 times for one event
  7. Symptom-based alerting — alert on what users experience, not on what machines report

Frequently Asked Questions

What is a time-series database and why do metrics systems use one?

A time-series database (TSDB) is a database engineered around one access pattern: data arrives in timestamp order, is queried by time range, and is never updated. General-purpose databases handle this badly at scale.

Why PostgreSQL fails for metrics at 100M samples/second:

  1. Every insert updates a B-tree index — at high throughput this creates severe write amplification
  2. Row-oriented storage means a query for "CPU values over the last hour" reads every column in every matching row, even though only timestamp and value matter
  3. The same metric name is stored as a string in every single row

How a TSDB fixes each problem:

  1. LSM-tree (log-structured merge tree) — writes are sequential appends, not random B-tree inserts. Orders of magnitude faster under sustained write load
  2. Column-oriented storage — all values for one metric are stored contiguously. A time-range query reads one block from disk, not millions of scattered rows
  3. Gorilla compression — delta-of-delta encoding for timestamps, XOR encoding for float values. ~1.5 bytes per sample instead of 12 bytes raw. The 8× compression is only possible because temporal ordering is guaranteed

How does Gorilla compression work in a time-series database?

Gorilla compression uses two encoding techniques — delta-of-delta for timestamps and XOR for float values — to achieve ~1.5 bytes per sample, down from 12 bytes raw.

Timestamp compression (delta-of-delta):

plaintext
Metrics arrive at fixed intervals — e.g. every 10 seconds.
 
t0 = 1700000000 (stored as full value)
t1 = delta = 10  (difference from t0)
t2 = delta-of-delta = 0  (delta didn't change)
t3 = delta-of-delta = 0  (same interval again)

A fixed-interval series encodes to ~1.5 bits per timestamp — the delta-of-delta is always 0.

Value compression (XOR):

plaintext
v0 = 42.001  (stored as full float64)
v1 = 42.003  → XOR with v0 = only the differing bits
              → leading and trailing zeros stripped
              → ~1.37 bytes average for slowly-changing values

The compression ratio degrades for chaotic, unpredictable metrics but holds well for the slowly-changing values that dominate infrastructure monitoring (CPU %, memory, request rates).

Facebook published the Gorilla algorithm in a 2015 paper. Prometheus's TSDB and VictoriaMetrics both use variants of it — VictoriaMetrics claims 0.4–1 byte per sample through additional optimisations.


What is cardinality in metrics monitoring and why is it dangerous?

Cardinality is the number of distinct time series in the monitoring system. Each unique combination of metric name and label values is one series. Adding a label with unbounded values multiplies the series count by millions — crashing the TSDB.

How cardinality explodes:

plaintext
http_requests_total{service, endpoint, status_code, region}
 
100 services × 500 endpoints × 10 status codes × 5 regions
= 2,500,000 unique series  →  manageable
 
Add label: user_id="u_192038"
× 1,000,000 users = 2,500,000,000,000 unique series  →  TSDB crash

Why it crashes the system:

  1. Prometheus stores its series index in RAM — one entry per active series
  2. At 10M+ active series, the index exhausts available memory
  3. Query performance degrades before RAM is fully exhausted
  4. A single developer adding user_id as a label can trigger this overnight

Prevention strategy:

  1. Block high-cardinality labels at ingestion — drop user_id, request_id, session_id, trace_id before they reach the TSDB
  2. Enforce per-metric cardinality limits — reject any series submission that would push a metric above 100,000 unique label combinations
  3. Alert on cardinality growth — a sudden spike in active series count is an instrumentation bug, not a business event
  4. High-cardinality data belongs in logs and traces — not in metrics

What is hot/warm/cold storage tiering for metrics?

Tiered storage keeps recent high-resolution data on fast, expensive hardware and progressively downsamples older data to cheaper storage. Without it, 13 months of raw 10-second metrics at 100M samples/second would cost ~$100M/year in storage alone.

TierAgeResolutionStorageQuery latency
HotLast 24–48 hoursFull (10s)In-memory TSDBSub-millisecond
Warm2–15 daysFull (10s)Local SSD TSDBMilliseconds
Cold15 days – 13 months5-min then 1-hourObject storage (S3/GCS)Seconds

The cost impact:

  • Raw 13-month storage: ~5 PB → ~$100M/year
  • Tiered with downsampling: ~100 TB → ~$2M/year

The Query Layer fans out across all three tiers transparently — a dashboard query spanning 30 days stitches hot, warm, and cold results into one continuous graph.


Why store min, max, sum, and count when downsampling — not just the average?

Storing only the average is lossy — it destroys the spike information that is most valuable for incident investigation. Storing sum and count also enables correct re-aggregation across windows.

Why average alone breaks:

plaintext
Window 1 (10 samples): avg = 80%, sum = 800, count = 10
Window 2 (90 samples): avg = 20%, sum = 1800, count = 90
 
Correct combined average: (800 + 1800) / (10 + 90) = 26%
Naive average-of-averages: (80 + 20) / 2 = 50%  ← wrong

What each aggregate preserves:

  1. sum + count — enables correct averages across any combination of downsampled windows
  2. min — preserves the lowest value seen in the window; catches brief dips that an average would hide
  3. max — preserves spikes; a CPU spike that lasted 30 seconds in an hour window survives the 1-hour downsample as the max value

A monitoring system that stores only avg will miss brief spikes in downsampled data — exactly when you're looking at a historical incident and trying to understand what happened.


How does the alerting pipeline work — from rule to page?

The alerting pipeline has three stages: rule evaluation, Alert Manager routing, and notification delivery.

Stage 1: Rule Evaluation

The Alerting Engine queries the metrics store on a schedule (every 30–60 seconds) and evaluates each rule:

plaintext
Alert rule:
  expr: rate(http_requests_total{status="5xx"}[5m]) > 0.05
  for:  5m     ← condition must hold for 5 consecutive minutes
  severity: critical

The for clause is critical — it requires the condition to be continuously true for the specified window before the alert transitions from PENDING to FIRING. Without it, every transient spike generates a page.

Stage 2: Alert Manager

Four operations happen before a notification is sent:

  1. Grouping — 50 hosts in region=us-east-1 all firing HighCPU → one notification, not 50 pages
  2. Deduplication — the same alert firing from multiple Alerting Engine replicas (HA setup) → deduplicated to one notification
  3. Silencing — during maintenance windows, matching alerts are suppressed
  4. InhibitionDatacenterDown firing for us-east-1 → all downstream HighAPILatency alerts for that region are suppressed

Stage 3: Notification

The Alert Manager routes based on severity:

plaintext
severity=critical  → PagerDuty (immediate phone/SMS)
severity=warning   → Slack #alerts-warning
severity=info      → email digest

Retries on delivery failure. Tracks sent notifications to prevent duplicate pages across evaluation cycles.


What is alert fatigue and how do you prevent it?

Alert fatigue is the desensitisation on-call engineers develop when they receive too many alerts — causing them to acknowledge or ignore pages without investigating. It erodes trust in the monitoring system and creates real incident risk.

The four design decisions that prevent it:

  1. for clause on every alert rule — a spike lasting 30 seconds is noise; a sustained condition for 5 minutes is a problem worth investigating. This single change eliminates the majority of false positives
  2. Grouping related alerts — 50 hosts all breaching CPU threshold simultaneously get grouped into one notification. An engineer who gets paged 50 times for one deployment gone wrong stops trusting the system
  3. Symptom-based alerting over cause-based alerting — alert on "checkout error rate > 1%" (what users experience), not "CPU > 80%" (what machines report). High CPU rarely directly impacts users; a high error rate always does
  4. Dead man's switch alerts — a special alert that fires when a monitoring target stops sending metrics. Without this, a crashed metrics agent creates a silent blind spot during an incident — the worst possible time to lose visibility

What is the difference between symptom-based and cause-based alerting?

Symptom-based alerting fires on user-visible impact — things users experience. Cause-based alerting fires on infrastructure metrics — things machines report. The two serve different purposes, and conflating them is the primary source of alert fatigue.

Symptom-BasedCause-Based
Alert onCheckout error rate, API latency p99CPU > 80%, memory > 90%
Direct user impactAlways — this is the user experienceRarely — machines can be stressed without user impact
False positive rateLowHigh
Wake an engineer at 3 AM?YesNo — use for dashboards and daytime investigation
Examplerate(http_5xx[5m]) > 0.01cpu_usage > 0.8

The rule of thumb: only page engineers when users are being affected right now, or will be affected within minutes. CPU at 80% may resolve itself or may not matter at all. A checkout error rate above 1% is a customer-facing crisis.

Cause-based metrics are invaluable for dashboards and post-incident analysis — but they should alert to Slack, not PagerDuty.


Which companies ask the metrics monitoring and alerting system design question?

Datadog, Google, Amazon, Meta, Netflix, Cloudflare, and Uber ask variants of this question for senior software engineer and infrastructure roles.

Why it is a consistently probing interview question:

  1. Multiple hard sub-problems in one — cardinality, tiering, downsampling, and alerting correctness are each individually non-trivial
  2. Reveals operational experience — candidates who've actually run monitoring systems at scale know the cardinality problem, the for duration requirement, and dead man's switches from direct experience
  3. Distinguishes depth from breadth — a mid-level answer names Prometheus and Grafana; a senior answer explains Gorilla compression, why sum/count beats avg for downsampling, and how the Alert Manager prevents duplicate pages from HA replicas

What interviewers specifically listen for:

  1. Cardinality raised proactively — before the interviewer asks, naming it as the silent killer of TSDB systems
  2. min/max/sum/count in downsampling — the specific reason, not just "we aggregate old data"
  3. for clause named and explained — connecting it to false positive reduction and on-call trust
  4. Symptom-based over cause-based — a concrete example distinguishing the two
  5. Gorilla compression explained mechanically — delta-of-delta and XOR, not just "it compresses well"

If any of those five feel unclear when you imagine explaining them live, Mockingly.ai runs monitoring and observability system design simulations — with follow-ups on exactly these points — built for engineers targeting roles at Datadog, Google, Amazon, and Netflix.


The monitoring system design interview is one where the operational depth matters as much as the architecture. Cardinality explosion, alert fatigue, downsampling with aggregates rather than averages — these are the insights that come from having thought carefully about how these systems behave at 3 AM when something is on fire. Practising that depth out loud, with an interviewer who pushes back on your choices, is what Mockingly.ai is built for — with simulations designed for senior roles at Datadog, Google, Amazon, and Netflix.

Companies That Ask This

Ready to Practice?

You've read the guide — now put your knowledge to the test. Our AI interviewer will challenge you with follow-up questions and give you real-time feedback on your system design.

Free tier includes unlimited practice with AI feedback • No credit card required

Related System Design Guides