Design DoorDash

hard22 minBackend System Design
Key Topics
doordashfood deliverygeohashingreal timekafkawebsocketsdistributed systemslogistics

How to Design DoorDash

DoorDash interviews specifically test whether you can think about physical logistics alongside software architecture — and that combination is what makes this question genuinely different from most system design problems.

Most systems just move bytes. DoorDash moves food. That creates constraints you don't see elsewhere: an order has a preparation window — dispatch a dasher too early and they wait at the restaurant while food gets cold on the counter; dispatch too late and the food sits. Delivery windows are tight. A dasher who accepts an order and then goes offline is a customer service crisis. And unlike most distributed systems where eventual consistency is a safe default, a stale "Available" status on a dasher or a missed order assignment can directly cost a customer their dinner.

DoorDash interviewers focus on your ability to navigate trade-offs between reliability, performance, and cost, and they evaluate your ability to model time-sensitive state machines. Payment processing requires strong consistency and idempotent operations, while dasher location updates often tolerate eventual consistency. That single insight covers most of what a strong candidate needs to demonstrate.

This guide covers the full design — with the conversational back-and-forth of a real interview.


Step 1: Clarify the Scope

Interviewer: Design DoorDash.

Candidate: Before I start — a few clarifying questions. DoorDash is a three-sided marketplace, so I want to confirm who we're designing for: the customer placing the order, the merchant receiving it, and the dasher delivering it — or just one of these flows? Is the dispatch system in scope, or just order placement and tracking? Do we need surge/dynamic pricing? And are we designing for the current scale — tens of millions of daily orders — or a starting point that needs to grow there?

Interviewer: Design the full system: customer ordering, merchant order management, dasher dispatch, and real-time tracking. Dynamic pricing is a good deep-dive if we have time. Assume current DoorDash scale.

Candidate: Perfect. The dispatch problem is where this question really gets interesting, so I'll make sure we get there. Let me start with requirements and numbers, then walk through the architecture with each flow.


Requirements

Functional

  • Customers can browse restaurant menus, add items to a cart, and place orders
  • Merchants receive orders in real time and update order preparation status
  • The system dispatches an available dasher to pick up and deliver each order
  • Customers and merchants can track the dasher's real-time location during delivery
  • Both customers and dashers receive push notifications at each stage
  • Dynamic delivery fees based on real-time supply/demand in a zone

Non-Functional

  • Low dispatch latency — a dasher must be assigned within seconds of an order being placed
  • Location update throughput — millions of dasher location pings per second at peak
  • Strong consistency for assignments — an order must never be assigned to two dashers simultaneously
  • Fault tolerance — if a dasher goes offline mid-delivery, the system must detect and respond
  • Geo-partitioned — dispatch must be scoped to local regions; a New York dispatcher should not evaluate dashers in Los Angeles
  • p95 latency under 300ms for all customer-facing APIs

Back-of-the-Envelope Estimates

Interviewer: Walk me through the numbers.

Candidate: Let me work through a realistic scale estimate.

plaintext
Daily orders:                 ~10 million (DoorDash's reported scale)
Peak hour multiplier:         4× (dinner rush, 6–9 PM local time)
 
Order QPS (average):          10M / 86,400s ≈ 116 orders/sec
Order QPS (peak):             ~460 orders/sec
 
Active dashers globally:      ~1 million at peak dinner rush
Dasher location update interval: every 5 seconds
 
Location update throughput:
  1M dashers / 5s = 200,000 location updates/sec
 
Customer tracking requests:
  ~10M active deliveries × 1 refresh/5s = 2M reads/sec (cached)
 
Restaurant menu reads:
  ~50M DAU × 5 menu views/day / 86,400s ≈ 2,900 reads/sec
  Peak: ~12,000 reads/sec
 
Storage:
  Order record: ~2 KB
  10M orders/day × 2 KB = 20 GB/day of order data
  Location ping: ~50 bytes
  200K × 50 bytes = ~10 MB/sec of location data (store only latest)

The numbers surface two design decisions immediately. 200,000 location updates per second is a high-throughput write stream — it needs a dedicated pipeline, not the same database handling orders. And the orders-per-second is actually modest (460 at peak), which means the complexity in this system isn't write throughput — it's the correctness of dispatch and the real-time nature of tracking.


High-Level Architecture

plaintext
                       ┌──────────────────────────────────────┐
                       │          API Gateway                  │
                       │  (auth, rate limiting, routing)        │
                       └────────────┬─────────────────────────┘

     ┌──────────────────────────────┼─────────────────────────────┐
     │                              │                             │
┌────▼───────┐              ┌───────▼──────────┐           ┌──────▼──────┐
│  Customer  │              │  Merchant        │           │  Dasher     │
│  Service   │              │  Service         │           │  Service    │
│  (browse,  │              │  (receive orders,│           │  (location, │
│   order)   │              │   update status) │           │   accept,   │
└────┬───────┘              └───────┬──────────┘           │   deliver)  │
     │                              │                       └──────┬──────┘
     └──────────────────────────────┼───────────────────────────────┘

                       ┌────────────▼──────────────┐
                       │        Event Bus            │
                       │         (Kafka)             │
                       │  order.placed               │
                       │  order.ready_for_pickup     │
                       │  dasher.location_updated    │
                       │  order.assigned             │
                       │  order.picked_up            │
                       │  order.delivered            │
                       └──────┬─────────────────────┘

     ┌────────────────────────┼────────────────────────────┐
     │                        │                            │
┌────▼──────────┐    ┌────────▼──────────┐     ┌──────────▼──────────┐
│  Order        │    │  Dispatch         │     │  Location           │
│  Service      │    │  Service          │     │  Service            │
│  (state       │    │  (geo-partitioned │     │  (dasher location   │
│   machine,    │    │   matching,       │     │   write path,       │
│   persistence)│    │   assignment)     │     │   read cache)       │
└────┬──────────┘    └────────┬──────────┘     └──────────┬──────────┘
     │                        │                            │
┌────▼──────────┐    ┌────────▼──────────┐     ┌──────────▼──────────┐
│  PostgreSQL   │    │  Redis (dasher     │     │  Redis (latest      │
│  (orders,     │    │  pool per geo-     │     │  location per       │
│  assignments) │    │  hash cell)        │     │  dasher)            │
└───────────────┘    └───────────────────┘     └──────────────────────┘

The Order State Machine

Before designing any service, model the order lifecycle explicitly. This is the skeleton that every other component hangs on, and interviewers expect to see it.

Interviewer: What are the states an order moves through?

Candidate: Here's the complete state machine:

plaintext
CART
  │  customer confirms and pays

PLACED
  │  merchant acknowledges

MERCHANT_ACCEPTED
  │  dispatch finds and assigns a dasher

DASHER_ASSIGNED
  │  dasher drives to restaurant

DASHER_AT_MERCHANT
  │  merchant marks order ready; dasher picks up

PICKED_UP
  │  dasher drives to customer

DELIVERED
  │  customer rates experience

COMPLETED

And the failure/exception states that can branch off at any point:

plaintext
PLACED         → CANCELLED (customer cancels before merchant accepts)
MERCHANT_ACCEPTED → CANCELLED (merchant rejects — out of stock, closed)
DASHER_ASSIGNED → REASSIGNING (dasher goes offline — covered later)
Any state     → FAILED (unrecoverable system error)

The Order Service persists every state transition to PostgreSQL and publishes an event to Kafka on each transition. Downstream services — Dispatch, Notification, Location — react to those events. This event-driven design means the Order Service doesn't need to know about dispatch logic or push notifications. It just manages state.


Service 1: Order Service

What it does: accepts order creation requests, manages the state machine, persists order data, and publishes state-change events.

Interviewer: Walk me through what happens when a customer places an order.

Candidate: The customer submits their cart. The Order Service:

plaintext
1. Validates the cart (items still available, prices haven't changed)
2. Charges the customer via the Payment Service
   → Payment carries an idempotency_key = client-generated UUID
   → Prevents double-charge if the network times out and client retries
3. Creates an order record in PostgreSQL: status = PLACED
4. Publishes order.placed to Kafka
5. Returns order_id and estimated delivery time to the customer

The merchant's service is subscribed to order.placed. It receives the order and pushes it to the merchant's tablet app in real time via WebSocket. When the merchant acknowledges, they call the Order Service, which transitions to MERCHANT_ACCEPTED and publishes that event.

Interviewer: Why publish to Kafka rather than calling the Dispatch Service directly?

Candidate: Decoupling. If Dispatch is slow or temporarily down, the order still gets created successfully — the order.placed event waits in Kafka until Dispatch is ready to consume it. If we called Dispatch synchronously, a Dispatch outage would fail order creation for the customer. With Kafka, the order is placed and the customer sees a confirmation; dispatch catches up asynchronously. The order creation latency is also lower — we're not waiting for dispatch to complete before responding to the customer.


Service 2: Location Service

200,000 dasher location updates per second. This is a write-heavy stream that needs a dedicated path.

What it is: a service that accepts high-frequency location pings from every active dasher, stores the latest known position, and makes it queryable for real-time tracking.

Interviewer: 200,000 location writes per second is a lot. How do you handle it?

Candidate: Two things. First, this is a stream of updates, not a history. We only care about a dasher's current location — not where they were 10 seconds ago. So we're doing SET dasher:{dasher_id}:location {lat, lng, heading, timestamp} in Redis, not INSERT into a relational table. Redis handles hundreds of thousands of SET operations per second comfortably.

Second, we publish every location update to a dasher.location_updated Kafka topic. This decouples the write (Location Service → Redis) from the read (Dispatch Service querying nearby dashers). The Dispatch Service reads from Redis; it doesn't hit the Location Service directly.

Interviewer: What about the tracking view on the customer's screen — they need to see the dasher moving in real time.

Candidate: The customer's app maintains a WebSocket connection to a Tracking Service. When a dasher's location is updated in Redis, the Location Service also publishes to a Redis Pub/Sub channel: tracking:order:{order_id}. The Tracking Service is subscribed to this channel and pushes the new coordinates to the customer's WebSocket connection.

This is the same cross-server push pattern used in messaging systems. The Tracking Service doesn't need to know which customer is watching which order — it subscribes to the order-specific channel, and customers subscribe via WebSocket when their order is active.

plaintext
Dasher app sends location ping every 5 seconds
  → Location Service → Redis SET (latest position)
  → Location Service → Kafka (dasher.location_updated)
  → Location Service → Redis Pub/Sub (tracking:order:{order_id})
  → Tracking Service receives Pub/Sub event
  → Pushes to customer's WebSocket connection

Service 3: Dispatch Service (The Core Problem)

This is the heart of the DoorDash interview. Everything else is a supporting cast.

The problem to solve: given an active order and a pool of available dashers in the area, find the best dasher to assign — quickly, correctly, and without ever assigning the same order to two dashers.

Interviewer: How does dispatch work? Walk me through it.

Candidate: Let me start with the data model, then the algorithm, then the correctness guarantees.

Step 1: Geo-partitioning with Geohashing

The first principle of dispatch is locality. A dispatcher in New York should only consider New York dashers. Scanning 1 million global dashers for each of 460 orders per second would be catastrophically inefficient.

We use geohashing to partition the world into cells. Geohashing converts a latitude/longitude pair into a short alphanumeric string where shared prefix means shared geographic proximity. A precision-5 geohash cell is roughly 5km × 5km — about right for a delivery zone.

Each dasher's current geohash cell is stored in Redis: dashers_in_cell:{geohash} → [dasher_id_1, dasher_id_2, ...]. When a dasher moves between cells, the Location Service updates their cell membership atomically. When Dispatch needs candidates for an order at a restaurant with geohash 9q8yy, it queries the set for that cell and its 8 neighbours to capture dashers near the boundary.

Step 2: Candidate Scoring

The most important factor for finding the best dasher is geographical location — finding a dasher as close as possible to the store minimises total travel time. The second factor is ensuring the dasher will arrive at the right time: dispatch too early and they wait at the restaurant; dispatch too late and the food sits and gets cold.

The scoring function weighs:

plaintext
Score = f(
  estimated_travel_time_to_merchant,    -- primary factor
  merchant_estimated_prep_time,         -- so dasher arrives when food is ready
  dasher_current_workload,              -- is this dasher already carrying an order?
  dasher_acceptance_rate,               -- likelihood they accept this offer
  order_batching_opportunity            -- can we combine this with nearby order?
)

The travel time estimate isn't Google Maps distance — it's a model trained on historical delivery data for this neighbourhood, time of day, weather, and traffic. A delivery in Manhattan at 7pm is not the same as Manhattan at 2pm.

Step 3: Assignment with Correctness Guarantees

Once a dasher is selected, the assignment must be atomic — no two Dispatch instances can assign the same order simultaneously.

The pattern:

sql
-- Atomic conditional update
UPDATE orders
SET dasher_id = :dasher_id,
    status = 'DASHER_ASSIGNED'
WHERE order_id = :order_id
  AND status = 'MERCHANT_ACCEPTED'   -- only assign if not already assigned
  AND dasher_id IS NULL;             -- double-check no concurrent assignment

If this UPDATE affects 0 rows, another instance beat us to it — we abort. The dasher selection and the database write are separate steps, but the actual assignment is a single atomic conditional update. This is optimistic locking applied to assignment.

The offer is then pushed to the dasher's app with a strict timeout — typically 30–45 seconds. If the dasher declines or the timer expires, the system retries with the next candidate. Retry logic prevents orders from remaining unassigned.

The dispatch section is typically where DoorDash interviewers slow down and probe hardest — they'll ask about geohashing precision, the atomicity of the assignment, and what happens when a dasher accepts and then immediately goes offline. It helps to have a crisp verbal explanation of each step before the real thing. Mockingly.ai specifically includes logistics-style dispatch prompts in its simulation library if you want to stress-test your reasoning here.

Interviewer: What if there are no available dashers nearby?

Candidate: A few escalating responses. First, expand the geohash search radius — check the next level up (a larger area). Second, surface a "no dashers nearby" state to the customer with a revised ETA. Third, the Dynamic Pricing system (covered below) can respond by increasing delivery fees in the zone to attract more dashers. If demand significantly exceeds supply, orders may be queued and the customer sees an extended wait time rather than an outright failure.

Interviewer: You mentioned batching. How does that work?

Candidate: Batching means utilising dashers as effectively as possible by looking for opportunities where a single dasher can pick up multiple orders at the same store or a set of nearby stores. When a new order arrives, the Dispatch Service checks if any currently assigned dasher is already heading to the same merchant and can carry a second order without significantly extending the first delivery's ETA. If the combined route adds less than, say, 5 minutes to the first customer's delivery, it's batched. The scoring function penalises batching by a factor representing the additional ETA impact — the optimiser finds the right balance automatically.


ETA Prediction

Interviewer: How do you compute the estimated delivery time shown to the customer?

Candidate: ETA is not a single number — it's a sum of estimates, each with its own model:

plaintext
Total ETA = 
  Order confirmation time          (usually seconds, nearly deterministic)
  + Merchant prep time             (ML model: restaurant, item type, current queue)
  + Dasher travel to merchant      (routing model: distance + traffic + time-of-day)
  + Dasher time at merchant        (ML model: parking, order handoff)
  + Dasher travel to customer      (routing model: distance + traffic)
  + Dasher time to door            (ML model: building type, floor, urban density)

Each component is a separate ML model trained on historical delivery data. The models are re-run continuously as the order progresses — not just at placement. When the dasher is confirmed en route, the ETA is recalculated with real GPS data instead of estimated travel time.

The ETA the customer sees is continuously updated, not static. If there's a traffic jam, the ETA extends. If the merchant is unusually fast, it shortens.


Fault Tolerance: Dasher Goes Offline Mid-Delivery

This is a deep-dive topic DoorDash interviewers specifically love, because it tests fault tolerance thinking on a time-critical, physically consequential failure.

Interviewer: A dasher accepts an order and is en route to the merchant. Their phone dies. What happens?

Candidate: Three layers of detection and recovery.

Detection: the Location Service expects a location ping from every active dasher every 5 seconds. If no ping arrives within 30 seconds — missed 6 consecutive pings — the Dasher Service marks the dasher as POTENTIALLY_OFFLINE and publishes a dasher.offline event to Kafka.

Grace period: a brief grace period (say, 90 seconds total from last ping) before assuming the dasher is truly gone. A tunnel, a building basement, or a brief crash can cause temporary signal loss. The order is not immediately reassigned — that would be chaotic.

Rescue dispatch: if the grace period expires with no ping resuming, the Dispatch Service triggers a "rescue dispatch" workflow:

plaintext
1. Order Service transitions order → REASSIGNING
2. Dasher Service marks offline dasher → INACTIVE (prevents re-assignment)
3. Dispatch Service runs a new dasher search for the same order
   (same geohash logic, but now the context is urgent — customer already waiting)
4. New dasher is assigned; customer is notified with updated ETA
5. Original dasher's delivery is nullified
   → If they resume connectivity, their app shows the order was reassigned

The atomic conditional UPDATE we discussed prevents any race condition between the rescue dispatch and the original dasher potentially resuming: the UPDATE checks that the current dasher_id is still the offline dasher's ID.

Interviewer: What if the dasher already picked up the food?

Candidate: Much harder. If the food is in the dasher's possession, reassignment doesn't work — there's nothing to reassign. The Dasher Service tracks the order's last known state. If the dasher went offline after PICKED_UP, the system attempts to re-establish contact first — the Notification Service sends a push notification and an SMS. If contact is re-established, delivery continues. If not, DoorDash's support team handles it manually — this is a human-intervention case, not one that can be fully automated. The system escalates by creating a support ticket automatically.

The dasher-goes-offline question is one DoorDash interviewers ask specifically because it tests fault tolerance on a physically consequential failure — food is already in someone's hands. Walking through both the pre-pickup and post-pickup cases clearly, with the grace period and rescue dispatch mechanics, is what distinguishes a thorough answer. Mockingly.ai includes logistics fault tolerance scenarios like this in its DoorDash and Uber simulations.


Dynamic Pricing

Interviewer: How does dynamic delivery pricing work?

Candidate: Dynamic pricing is a supply/demand ratio problem scoped to each geohash zone.

The Pricing Service monitors, per geohash cell:

plaintext
Supply:  count of available dashers in the cell
Demand:  count of unassigned orders in the cell
         + count of incoming orders in the last 5 minutes (predictive)
 
Surge multiplier = f(demand / supply ratio)

When the ratio exceeds a threshold — say, more than 3 pending orders per available dasher — a surge fee is applied to new orders in that cell. This simultaneously increases revenue per delivery (compensating for slower delivery times) and incentivises dashers to move into the high-demand zone.

The pricing calculation runs on a 30-second cycle per zone. It's read from Redis cache when customers see the fee breakdown before placing an order.

What the system doesn't do: change the price of an in-flight order. Once a customer confirms an order at a specific delivery fee, that fee is locked. Price changes only affect new orders.

Dynamic pricing is a topic where interviewers probe both the technical mechanism (supply/demand ratio per geohash cell) and the product reasoning (why lock in-flight prices, why 30-second cycles). Having both dimensions ready is what a complete answer looks like. Mockingly.ai runs simulations where interviewers ask exactly these product + technical follow-ups in combination.


Database Design

Interviewer: Walk me through the key tables.

Candidate: Each service owns its data. Here are the critical schemas.

Orders (PostgreSQL — strong consistency needed for state machine):

sql
CREATE TABLE orders (
    order_id         UUID PRIMARY KEY,
    customer_id      UUID NOT NULL,
    restaurant_id    UUID NOT NULL,
    dasher_id        UUID,                    -- null until assigned
    status           TEXT NOT NULL,           -- state machine value
    subtotal         DECIMAL(10,2) NOT NULL,
    delivery_fee     DECIMAL(10,2) NOT NULL,
    estimated_eta    TIMESTAMP,
    placed_at        TIMESTAMP DEFAULT NOW(),
    updated_at       TIMESTAMP DEFAULT NOW()
);
 
-- Index for dispatch queries
CREATE INDEX idx_orders_status ON orders (status, restaurant_id);

Order Items:

sql
CREATE TABLE order_items (
    order_id      UUID REFERENCES orders(order_id),
    item_id       UUID NOT NULL,
    item_name     TEXT NOT NULL,             -- denormalised — menu can change
    quantity      INT NOT NULL,
    unit_price    DECIMAL(10,2) NOT NULL,
    PRIMARY KEY   (order_id, item_id)
);

Note: item_name is denormalised. If the restaurant changes a menu item's name after the order, we want the original name on the receipt. Always snapshot price and name at order time.

Dashers (PostgreSQL for profiles; Redis for live state):

sql
CREATE TABLE dashers (
    dasher_id       UUID PRIMARY KEY,
    name            TEXT NOT NULL,
    vehicle_type    TEXT,                    -- car, bike, scooter
    rating          DECIMAL(3,2),
    lifetime_orders INT DEFAULT 0
);

Live dasher state (Redis — fast reads, volatile):

plaintext
dasher:{dasher_id}:status     → "available" | "en_route" | "at_merchant" | "delivering"
dasher:{dasher_id}:location   → { lat, lng, heading, timestamp }
dasher:{dasher_id}:order      → order_id (if currently assigned)
dashers_in_cell:{geohash}     → SET of dasher_ids

PostgreSQL for durable dasher profile data. Redis for live operational state that needs to be queryable in microseconds across millions of dashers. Both serve different access patterns — never use one where the other is correct.


Notification Pipeline

Interviewer: How do push notifications work at each order stage?

Candidate: Kafka-driven, with a dedicated Notification Service. Every Kafka event in the system — order.placed, order.assigned, order.picked_up, order.delivered — is consumed by the Notification Service. It decides who to notify and via which channel:

plaintext
order.placed        → Merchant: "New order received!" (app push)
order.assigned      → Customer: "Your dasher is on the way to the restaurant"
                    → Dasher: "New delivery offer" (with 30-second accept timer)
order.picked_up     → Customer: "Your food is picked up — live tracking now"
dasher.offline      → Customer: "Slight delay on your order"
order.delivered     → Customer: "Your order has arrived! Rate your experience"
                    → Restaurant: "Order completed"

For in-app notifications when the user is active, the notification goes via WebSocket. For users who backgrounded the app, it goes via FCM (Android) or APNs (iOS). The Notification Service checks the user's session state and routes accordingly — same event, different delivery channel.


Restaurant Menu and Catalog

Interviewer: How do you serve restaurant menus at 12,000 reads per second?

Candidate: Restaurant menus are read far more than they're written. A popular restaurant might have its menu fetched 10,000 times an hour and updated once a week. Cache aggressively.

The Menu Service stores canonical menu data in PostgreSQL. Redis caches the serialised menu per restaurant with a 5-minute TTL. Most of the 12,000 reads per second hit Redis — a menu read is a single key lookup, sub-millisecond.

Cache invalidation: when a restaurant updates their menu (price change, item added, item marked unavailable), the Menu Service writes to PostgreSQL and immediately deletes the Redis cache entry for that restaurant. The next read rebuilds from the database.

One important nuance: the "item unavailable" status is the most time-sensitive change. If a restaurant runs out of a popular item at 7 PM, the cache should not continue serving it as available for 5 minutes while orders come in that will be rejected. For availability changes specifically, the cache TTL should be shorter — 30 seconds — or the update should actively push to Redis rather than relying on TTL expiry.


Geo-Partitioned Architecture for Scale

Interviewer: The dinner rush in every timezone creates waves of peak load. How does the architecture handle this regionally?

Candidate: The Dispatch Service is geo-partitioned by design. Each instance of the Dispatch Service owns a set of geohash cells. Orders and dashers in those cells are handled exclusively by that instance. There's no cross-instance coordination needed for a given order — it lives entirely within one geographic partition.

At the Kafka level, the order.placed topic is partitioned by the restaurant's geohash. A consumer group of Dispatch Service instances processes messages where each instance consumes its assigned partitions. Scaling is as simple as adding Dispatch Service instances and rebalancing Kafka partition assignments.

Regional isolation also limits blast radius. A bug in the San Francisco Dispatch instance doesn't affect New York dispatch. Geo-partitioning or consistent hashing to shard location-sensitive data limits blast radius in regional spikes and keeps dispatch latency low during peak dinner rush.


Common Interview Follow-ups

"How would you handle a popular restaurant where hundreds of orders come in simultaneously?"

The restaurant creates a bottleneck between order placement and dispatch — all those orders need a dasher but a single restaurant has finite capacity. The key constraint is the restaurant's preparation capacity, not the dispatch system's throughput. Queue orders at the merchant side; the dispatch engine schedules daher assignments with respect to the merchant's prep time model. If the restaurant is overwhelmed, orders are placed but dispatch is delayed intentionally — there's no point sending a dasher to pick up an order that won't be ready for 30 minutes. The ETA shown to the customer reflects the queue depth honestly.

"How does order batching interact with the ETA shown to the first customer?"

When the Dispatch Service considers batching a second order onto a dasher already assigned to a first order, it runs the full route calculation for the combined trip. The additional time to the first customer is the key constraint — if it exceeds a threshold (typically 3–5 minutes), the batch is rejected and a separate dasher is assigned to the second order. The first customer's ETA might extend slightly if batching is approved. DoorDash shows customers whether their order has been batched, maintaining transparency.

"How would you design the search — 'tacos near me' with filters for price, rating, and estimated delivery time'?"

Restaurant search is an Elasticsearch problem. The restaurant index contains fields for cuisine type, menu keywords, location (stored as a geo_point), average rating, price range, and estimated delivery time (a cached estimate based on current dasher availability and distance). A search for "tacos near me" executes a geo-distance query filtered to a configurable radius (typically 5km), full-text matched on cuisine type and menu keywords, and sorted by a relevance score that weighs rating, proximity, and estimated delivery time. Elasticsearch's native geo-distance filter makes proximity queries fast without a separate geohashing layer.

"What if the payment fails after the dasher is already assigned?"

This is a Saga-like failure. The order transitions to PAYMENT_FAILED. The Dispatch Service is notified and unassigns the dasher — the dasher's status returns to AVAILABLE. The customer is notified and offered a chance to retry payment within a time window. If they don't retry, the order is cancelled. The dasher was briefly assigned but never acted on it — no real-world impact. This is why the order state machine has PLACED (payment pending) separated from MERCHANT_ACCEPTED — we don't confirm to the merchant or dispatch until payment is confirmed.

"How do you handle the case where two Dispatch Service instances try to assign different dashers to the same order at the same time?"

The atomic conditional UPDATE in PostgreSQL prevents this. Both instances read the order in MERCHANT_ACCEPTED status. Both select a dasher. Both attempt UPDATE orders SET dasher_id = X, status = 'DASHER_ASSIGNED' WHERE order_id = Y AND status = 'MERCHANT_ACCEPTED'. Only one succeeds — the other gets 0 rows affected and knows it lost the race. The losing instance discards its assignment and moves to the next unassigned order in its queue. No two dashers are ever assigned to the same order.


Quick Interview Checklist

  • ✅ Clarified scope — three-sided marketplace, dispatch and tracking in scope
  • ✅ Identified the three actors: customer, merchant (restaurant), dasher
  • ✅ Back-of-the-envelope — 200K location updates/sec is the dominant write load
  • ✅ Order state machine — all states named including failure/exception transitions
  • ✅ Event-driven architecture via Kafka — Order Service publishes, everything else subscribes
  • ✅ Location Service — Redis for latest position, no history, high-throughput write path
  • ✅ WebSocket + Redis Pub/Sub for real-time customer tracking
  • ✅ Geohashing for dispatch zone partitioning — cells and 8-neighbour radius queries
  • ✅ Dasher scoring function — travel time, prep time alignment, workload, batching
  • ✅ Assignment via atomic conditional UPDATE — prevents dual-assignment
  • ✅ 30-45 second accept timeout with retry to next candidate
  • ✅ ETA as a sum of ML model estimates, recalculated continuously
  • ✅ Dasher offline detection via missed heartbeats — grace period + rescue dispatch
  • ✅ Post-pickup offline escalation to support when food is in dasher's possession
  • ✅ Dynamic pricing — supply/demand ratio per geohash cell, 30-second cycle, in-flight price locked
  • ✅ Denormalised item_name and unit_price on order_items (snapshot at order time)
  • ✅ PostgreSQL for durable state, Redis for live operational state
  • ✅ Menu caching in Redis with shorter TTL for availability changes
  • ✅ Geo-partitioned Dispatch Service — Kafka partition key = restaurant geohash
  • ✅ Notification Service consuming Kafka events — WebSocket for foreground, FCM/APNs for background

Conclusion

Designing DoorDash is really designing a real-time logistics system disguised as a food ordering app. The challenge isn't handling high QPS — it's handling the coupling between physical reality (a person on a bike, food cooling on a counter) and the distributed software system coordinating it.

Designing a system like DoorDash requires considering physical logistics alongside software architecture. You must balance real-time tracking demands with transactional integrity in the core ordering flow. That's a genuinely harder constraint than most system design problems impose.

The candidates who do well at DoorDash interviews are the ones who reason through the temporal constraints — dispatch timing relative to food prep, the dead dasher edge case, the difference between "I need strong consistency here" (assignment) and "eventual consistency is fine here" (location tracking) — and articulate why each decision was made.

The design pillars:

  1. Order state machine as the backbone — every service reacts to state transitions published on Kafka; nothing calls the Order Service directly
  2. Geohashing for dispatch locality — a dispatcher should only see local dashers; geo-partitioned Dispatch Service instances enforce this
  3. Redis for live dasher state, PostgreSQL for durable order state — different consistency needs, different tools
  4. Atomic conditional UPDATE for assignment — prevents dual-assignment without distributed locks
  5. ETA as a pipeline of ML estimates — not a static calculation, continuously recalculated as reality deviates from predictions
  6. Grace period + rescue dispatch for offline dashers — detect via heartbeat, wait, then reassign atomically
  7. Kafka as the integration layer — loose coupling between services, replayable event log, natural geo-partitioning via topic partitions


Frequently Asked Questions

What is geohashing and how does DoorDash use it for dispatch?

Geohashing is a technique that converts a latitude/longitude pair into a short alphanumeric string. Locations with a shared string prefix are geographically close to each other — making it a natural tool for locality-based partitioning.

How DoorDash uses geohashing for dispatch:

  1. The world is divided into cells. A precision-5 geohash cell covers roughly 5km × 5km — about the right size for a delivery zone
  2. Each available dasher's current cell is stored in Redis: dashers_in_cell:{geohash} → [dasher_ids]
  3. When an order arrives, the Dispatch Service queries that cell and its 8 neighbouring cells (to capture dashers near cell boundaries)
  4. Only local dashers are evaluated — a dispatcher in New York never considers dashers in Los Angeles
  5. When a dasher moves between cells, the Location Service atomically updates their cell membership in Redis

Why geohashing instead of a simple radius query:

A radius query at 460 orders/second across 1 million global dashers would require a full spatial scan on every dispatch request. Geohashing provides constant-time lookup — GET dashers_in_cell:{geohash} — with no scanning.


What is the order state machine in a food delivery system?

The order state machine is the complete set of states an order can occupy and the valid transitions between them. It is the backbone of the entire system — every service reacts to state transitions rather than calling each other directly.

The full state machine:

plaintext
CART → PLACED → MERCHANT_ACCEPTED → DASHER_ASSIGNED
  → DASHER_AT_MERCHANT → PICKED_UP → DELIVERED → COMPLETED

Exception/failure states:

  1. PLACED → CANCELLED — customer cancels before merchant acknowledges
  2. MERCHANT_ACCEPTED → CANCELLED — merchant rejects (out of stock, closed)
  3. DASHER_ASSIGNED → REASSIGNING — dasher goes offline before pickup
  4. Any state → FAILED — unrecoverable system error

Why the state machine is architecturally important:

  1. Every state transition publishes a Kafka event — downstream services (Dispatch, Notification, Location) react to events, not to direct calls
  2. The Order Service is decoupled from dispatch logic — it only manages state
  3. State transitions are persisted to PostgreSQL before the Kafka event is published — no transition is lost on a crash
  4. The state machine makes it impossible to, for example, assign a dasher to an already-assigned order — the AND status = 'MERCHANT_ACCEPTED' check in the atomic UPDATE enforces valid transitions

How does DoorDash prevent two dashers from being assigned to the same order?

An atomic conditional UPDATE in PostgreSQL prevents dual assignment. This is a correctness guarantee, not just an optimisation.

How it works:

sql
UPDATE orders
SET dasher_id = :dasher_id,
    status = 'DASHER_ASSIGNED'
WHERE order_id = :order_id
  AND status = 'MERCHANT_ACCEPTED'
  AND dasher_id IS NULL;

The mechanism:

  1. Both Dispatch Service instances select a dasher candidate for the same order
  2. Both attempt the UPDATE simultaneously
  3. PostgreSQL's row-level locking ensures only one UPDATE succeeds — the other gets 0 rows affected
  4. The losing instance checks the affected row count, sees 0, discards its assignment, and moves to the next unassigned order
  5. No two dashers are ever assigned to the same order — even under concurrent dispatch

Why not a distributed lock?

A Redis distributed lock (SET NX EX) would also work but adds another network round-trip and a dependency. The conditional UPDATE achieves the same result using the database's own ACID guarantees — simpler, fewer moving parts, and the database is already in the write path.


How does DoorDash handle a dasher who goes offline mid-delivery?

Three layers of detection and recovery handle a dasher going offline — with different responses depending on whether the food has been picked up yet.

Detection:

  1. The Location Service expects a ping every 5 seconds from every active dasher
  2. After 30 seconds of missed pings (6 consecutive missed), the dasher is marked POTENTIALLY_OFFLINE
  3. A 90-second grace period follows — brief outages (tunnels, building basements) resolve without action

Recovery — dasher went offline before pickup:

  1. Grace period expires with no reconnection
  2. Order transitions to REASSIGNING
  3. Dispatch Service runs a rescue dispatch: finds a new dasher using the same geohash algorithm
  4. Original dasher is marked INACTIVE
  5. Customer receives "Slight delay on your order" notification with updated ETA

Recovery — dasher went offline after pickup (food in hand):

  1. Reassignment is impossible — the food is with the dasher
  2. System attempts re-contact: push notification + SMS sent to dasher's phone
  3. If contact is re-established: delivery continues normally
  4. If no contact after extended period: support ticket is automatically created; human intervention required

The post-pickup case cannot be fully automated. It's a deliberate design acknowledgement that some failure modes require human judgment.


How does ETA prediction work in a food delivery system?

ETA is a pipeline of ML model estimates, not a single calculation. Each stage of delivery has its own model, and the total ETA is continuously recalculated as the order progresses.

The six components:

  1. Order confirmation time — nearly deterministic, typically seconds
  2. Merchant prep time — ML model trained on restaurant type, specific items, current queue depth, time of day
  3. Dasher travel to merchant — routing model using distance, real-time traffic, historical data for that route at that time of day
  4. Dasher time at merchant — ML model incorporating parking availability, building access, order handoff patterns
  5. Dasher travel to customer — routing model as above
  6. Dasher time to door — ML model for building type, floor, urban density, elevator wait times

Why continuous recalculation matters:

  1. The ETA at order placement uses estimates for all six components
  2. Once a dasher is assigned, real GPS data replaces the travel time estimate
  3. Once pickup is confirmed, prep time and travel-to-merchant are removed from the calculation
  4. Traffic jams, unusually fast restaurants, and parking difficulties all update the displayed ETA in real time

A static ETA set at order placement would be wrong by the time the order is delivered. The customer always sees the current best estimate based on real data.


How does real-time dasher location tracking work for customers?

Real-time tracking uses a chain of three systems: Redis for fast writes, Redis Pub/Sub for cross-service fan-out, and WebSockets for client delivery.

The full flow:

  1. Dasher app sends a GPS ping every 5 seconds
  2. Location Service receives the ping and writes: SET dasher:{id}:location {lat, lng, heading, timestamp} to Redis
  3. Location Service publishes to Redis Pub/Sub channel: tracking:order:{order_id}
  4. Tracking Service is subscribed to that channel — receives the event
  5. Tracking Service pushes the new coordinates to the customer's WebSocket connection
  6. Customer's map updates to show the dasher moving

Why Redis Pub/Sub instead of polling:

Polling at 200,000 dashers × 2M customer reads per 5-second interval would be ~400M database queries per second. Redis Pub/Sub publishes once and fans out to all subscribers — the Tracking Service receives the update and pushes to relevant customers without any polling.

Why WebSockets instead of HTTP:

The customer's map needs to update every 5 seconds. HTTP polling at that frequency from millions of active deliveries would generate enormous server load. WebSockets maintain a persistent connection — the server pushes updates without the client asking.


How does dynamic pricing work in a food delivery system?

Dynamic delivery pricing increases delivery fees in zones where demand exceeds supply, incentivising more dashers to move into high-demand areas.

The mechanism:

  1. The Pricing Service monitors each geohash cell every 30 seconds
  2. It computes a supply/demand ratio: (pending orders in cell) / (available dashers in cell)
  3. When the ratio exceeds a threshold (e.g., more than 3 pending orders per available dasher), a surge fee multiplier is applied to new orders in that cell
  4. The surge multiplier is stored in Redis: surge_fee:{geohash} → 1.4 (meaning 40% above base fee)
  5. When a customer places an order, the current surge fee for their zone is locked into the order at that moment

Three important design decisions:

  1. Locked on order confirmation — the price a customer agreed to never changes after checkout, regardless of subsequent demand changes in their zone
  2. 30-second cycle — frequent enough to respond to dinner rush spikes; infrequent enough that fees don't change while a customer is browsing menus
  3. Customer-visible — the app shows the surge fee before checkout, not hidden in the total

How does the DoorDash system handle restaurant menu availability changes?

Menu availability changes are the most time-sensitive cache invalidation problem in the system. A stale "available" item generates orders that restaurants must reject — creating bad customer experience.

The two-tier approach:

  1. Canonical data — menu items and prices are stored in PostgreSQL (the Menu Service's database). PostgreSQL is the source of truth
  2. Read cache — menus are serialised to Redis per restaurant with a 5-minute TTL. At 12,000 menu reads/second, almost all traffic hits Redis — not PostgreSQL

Cache invalidation strategy by change type:

Change typeCache invalidation approachWhy
Price changeDelete Redis key on write; TTL builds fresh cache5-minute delay is acceptable
New item addedDelete Redis key on write5-minute delay is acceptable
Item marked unavailableImmediate active push to Redis OR 30-second TTLStale "available" generates bad orders

The "item unavailable" case gets special treatment. A restaurant running out of a dish at 7 PM cannot afford 5 minutes of stale cache — customers will place orders the restaurant must reject. Either the Menu Service actively overwrites the Redis cache entry on unavailability changes, or the TTL for availability-sensitive fields is reduced to 30 seconds.


Which companies ask the DoorDash system design question in interviews?

DoorDash, Uber, Lyft, Instacart, Grubhub, Amazon, and Google ask variants of this question for senior software engineer roles. It appears at any company building logistics, ride-sharing, or real-time marketplace systems.

Why it is a particularly revealing interview question:

  1. Physical constraints create novel requirements — dispatch timing relative to food prep time, the post-pickup offline failure mode, and locked in-flight pricing are problems unique to logistics that don't appear in pure software system design
  2. Tests multiple distributed systems concepts simultaneously — geohashing, atomic conditional updates, heartbeat-based failure detection, event-driven architecture, and real-time location streaming all appear in one question
  3. Scales to seniority — a mid-level answer describes the happy path; a senior answer covers the dasher offline failure modes, the dual-assignment prevention mechanism, and why Kafka decouples the services

What interviewers specifically listen for:

  1. Modelling the order state machine before anything else — it's the skeleton everything hangs on
  2. Geohashing for dispatch locality — not just "find nearby dashers" but the specific mechanism and Redis data structure
  3. Atomic conditional UPDATE for assignment — named as a correctness requirement, not an optimisation
  4. Two distinct dasher-offline scenarios — pre-pickup (rescuable) vs post-pickup (human intervention)
  5. ETA as a pipeline of models — not a static calculation

The DoorDash interview is one where the physical domain creates constraints that don't exist in pure software systems — and the interviewers know it. Explaining why you need to dispatch a dasher relative to prep time completion (not just restaurant location) is the kind of insight that lands you an offer. If you want to rehearse that reasoning under real interview conditions before the actual interview, Mockingly.ai has system design simulations for roles at DoorDash, Uber, Instacart, and beyond — where the follow-up questions are as hard as the ones in this guide.

Companies That Ask This

Ready to Practice?

You've read the guide — now put your knowledge to the test. Our AI interviewer will challenge you with follow-up questions and give you real-time feedback on your system design.

Free tier includes unlimited practice with AI feedback • No credit card required

Related System Design Guides