Design a Distributed Task Scheduler

hard22 minBackend System Design
Key Topics
distributed systemstask schedulerjob schedulerleader electionetcdkafka

How to Design a Distributed Task Scheduler

A distributed task scheduler is one of the most deceptively simple system design questions you can get.

Accept jobs. Assign them to workers. Track completion. Most engineers have used one — through cron jobs, background workers, or data pipelines. Designing one from scratch sounds manageable.

Then the interviewer asks: what happens when a worker crashes mid-task? What if two scheduler nodes both try to assign the same task? How do you guarantee a cron job that runs daily actually runs exactly once and not twice? How do you handle a task that was scheduled for 3 AM but the server's clock drifted by two minutes?

That's where the real depth lives. This question compresses several hard distributed systems problems into one surface area: coordination, durability, failure detection, execution guarantees, and clock semantics.

Interviewers use this problem because it compresses core distributed systems concepts into a manageable surface area: coordination, durability, leader election, failure detection, retries, and execution guarantees. Mid-level candidates often describe queues and workers. Senior candidates talk about uncertainty, failure domains, and execution guarantees.

This guide covers the full design — in the kind of back-and-forth you'd actually have in the room.


Step 1: Clarify the Scope

Interviewer: Design a distributed task scheduler.

Candidate: A few questions before I start. Are we scheduling one-off tasks, recurring cron-style tasks, or both? Do tasks have dependencies — meaning some tasks must complete before others can start? What are the latency requirements — does a task need to run within milliseconds of its scheduled time, or is a few seconds of delay acceptable? Do we need to support task priorities? And what's the expected scale — thousands of tasks per day or millions?

Interviewer: Support both one-off and recurring tasks. Task dependencies are a nice-to-have — cover them if we have time. Latency of a few seconds is acceptable. Priority queues are in scope. Assume millions of tasks per day at peak.

Candidate: Got it. The most interesting challenges here are execution guarantees — ensuring tasks run at least once without running twice — and handling worker failures correctly. Let me walk through requirements and numbers, then build the architecture.


Requirements

Functional

  • Submit one-off tasks to run immediately or at a future time
  • Schedule recurring tasks (cron-style: "run every day at 2 AM")
  • Task priorities: high, normal, and low queues
  • Track task status: pending, running, completed, failed
  • Retry failed tasks up to a configurable number of times
  • Cancel or update a submitted task before it runs
  • Task dependencies: task B starts only after task A completes (stretch goal)

Non-Functional

  • At-least-once execution — no submitted task should be silently dropped
  • No duplicate execution under normal conditions — at-most-once is ideal, but acknowledged as hard
  • Fault-tolerant — scheduler and worker failures must not lose tasks
  • Scalable — millions of tasks per day, horizontal scaling of workers
  • Observable — task status queryable at any time; alerting on stuck or failed tasks

Back-of-the-Envelope Estimates

Interviewer: Give me the rough numbers.

Candidate:

plaintext
Tasks submitted per day:       10 million
Peak submission rate:          ~500 tasks/second
Average task duration:         30 seconds (wide range: 1s to hours)
Concurrent running tasks:      ~50,000 at peak
Workers needed:                ~5,000 (assuming 10 concurrent tasks each)
Task record size:              ~1 KB
Storage for 90-day retention:  10M × 365 × 1 KB ≈ ~3.6 TB

The key insight from these numbers: 500 task submissions per second is very manageable for a database or queue to absorb. The real challenge isn't throughput — it's correctness under failure. With 50,000 concurrent tasks across 5,000 workers, at any moment some workers will be silently dying, some tasks will be stuck, and some scheduler nodes may disagree on what needs to run next. The design must handle all of those without human intervention.


High-Level Architecture

plaintext
                      ┌────────────────────────────────┐
                      │           Clients               │
                      │  (APIs, cron definitions,       │
                      │   one-off submissions)           │
                      └──────────────┬─────────────────┘
                                     │ REST API
                      ┌──────────────▼─────────────────┐
                      │        Scheduler Service        │
                      │  (Leader-elected, polls task    │
                      │   store, enqueues due tasks)    │
                      └──────────────┬─────────────────┘

          ┌──────────────────────────┼──────────────────────────┐
          │                          │                          │
┌─────────▼──────────┐   ┌──────────▼──────────┐   ┌──────────▼──────────┐
│   High-Priority     │   │  Normal-Priority    │   │  Low-Priority       │
│   Task Queue        │   │  Task Queue         │   │  Task Queue         │
│   (Kafka / SQS)     │   │  (Kafka / SQS)      │   │  (Kafka / SQS)      │
└─────────┬──────────┘   └──────────┬──────────┘   └──────────┬──────────┘
          │                          │                          │
          └──────────────────────────┼──────────────────────────┘
                                     │ pull
          ┌──────────────────────────┼──────────────────────────┐
          │                          │                          │
┌─────────▼──────────┐   ┌──────────▼──────────┐   ┌──────────▼──────────┐
│   Worker Node 1    │   │   Worker Node 2      │   │   Worker Node N      │
│   (executes tasks) │   │   (executes tasks)   │   │   (executes tasks)   │
└─────────┬──────────┘   └──────────┬──────────┘   └──────────┬──────────┘
          │                          │                          │
          └──────────────────────────┼──────────────────────────┘
                                     │ heartbeats, status updates
                      ┌──────────────▼─────────────────┐
                      │         Task Store              │
                      │  (PostgreSQL / Cassandra)       │
                      │  Source of truth for all tasks  │
                      └─────────────────────────────────┘
                      ┌─────────────────────────────────┐
                      │         Coordinator             │
                      │  (etcd / ZooKeeper)             │
                      │  Leader election, worker         │
                      │  registry, distributed locks     │
                      └─────────────────────────────────┘

Three services do the real work: the Scheduler Service decides what runs and when, the Task Queue buffers work between scheduling and execution, and the Worker Nodes run the actual logic. The Coordinator (etcd or ZooKeeper) is the glue that keeps them from stepping on each other.


Task Submission: Durability First

Interviewer: What happens when a client submits a task?

Candidate: The first priority is durability.

Once the scheduler acknowledges the request, it must never lose that job — even if the scheduler crashes immediately afterward. This implies persistent storage before acknowledgment. Submission is a responsibility boundary. Acknowledging too early risks loss.

The flow on submission:

plaintext
Client → POST /tasks { type, payload, scheduled_at, priority, max_retries }


Scheduler Service writes task to Task Store (status = PENDING)


Returns task_id and status to client

  ▼ (asynchronously)
Scheduler Service polls Task Store for due tasks
Enqueues them into the appropriate priority queue
Updates status = QUEUED

The task is durable in the Task Store before the client gets an acknowledgement. If the scheduler crashes between writing to the store and enqueuing to Kafka, the polling loop on restart will find the PENDING task and enqueue it. Nothing is lost.


Task Store Schema

Interviewer: Walk me through the task storage schema.

Candidate:

sql
CREATE TABLE tasks (
    task_id         UUID PRIMARY KEY,
    task_type       TEXT NOT NULL,          -- identifies which handler to invoke
    payload         JSONB NOT NULL,         -- arbitrary job-specific parameters
    priority        SMALLINT DEFAULT 1,     -- 0=high, 1=normal, 2=low
    status          TEXT NOT NULL,          -- PENDING, QUEUED, RUNNING, COMPLETED, FAILED
    scheduled_at    TIMESTAMPTZ NOT NULL,   -- when to run (NOW() for immediate)
    started_at      TIMESTAMPTZ,
    completed_at    TIMESTAMPTZ,
    retry_count     INT DEFAULT 0,
    max_retries     INT DEFAULT 3,
    next_retry_at   TIMESTAMPTZ,            -- exponential backoff target
    worker_id       TEXT,                   -- which worker claimed this task
    heartbeat_at    TIMESTAMPTZ,            -- last heartbeat from the worker
    result          JSONB,                  -- output or error details
    created_at      TIMESTAMPTZ DEFAULT NOW()
);
 
-- Critical indexes
CREATE INDEX idx_tasks_due ON tasks (scheduled_at, status)
    WHERE status IN ('PENDING', 'QUEUED');
 
CREATE INDEX idx_tasks_running ON tasks (heartbeat_at, status)
    WHERE status = 'RUNNING';

Key decisions to explain:

worker_id and heartbeat_at on the task row are what make failure detection possible. A running task that hasn't received a heartbeat update in 60 seconds is presumed orphaned — its worker likely died.

next_retry_at enables exponential backoff for retries. A task that fails is not immediately re-queued. It waits — 30 seconds, then 2 minutes, then 10 minutes — before being retried, giving transient problems time to resolve.

The partial index WHERE status IN ('PENDING', 'QUEUED') is important. As the system processes millions of tasks, the completed and failed rows vastly outnumber the actionable rows. The partial index stays small and fast by only indexing the rows that matter for scheduling.


Push vs Pull: How Work Reaches Workers

This is one of the first real trade-off questions interviewers probe.

Interviewer: Does the scheduler push tasks to workers, or do workers pull tasks from a queue?

Candidate: Both models exist. Let me explain the trade-offs.

Push model: the Scheduler Service maintains a registry of available workers and directly sends a task to a specific worker when it's due.

plaintext
Pros:
  Fine-grained control — scheduler knows exactly who has what
  Low latency — no queue polling delay
  Easier to implement load balancing strategies
 
Cons:
  Scheduler must track worker health accurately
  If a worker is reported healthy but actually slow, it gets overloaded
  Scheduler becomes a bottleneck for all task assignments

Pull model: workers pull tasks from a shared queue when they have capacity.

plaintext
Pros:
  Natural load balancing — idle workers pull more, busy ones pull less
  Resilient to scheduler restarts — tasks sit in the queue safely
  Workers self-report capacity by pulling rate
 
Cons:
  Slightly higher latency — workers poll on a cycle
  Requires a reliable distributed queue between scheduler and workers

I'd choose the pull model for this design.

Workers are inherently variable — some tasks take 1 second, others take 10 minutes. A push model that assigns based on estimated load quickly goes wrong. In a pull model, a worker that finishes fast pulls more work. A slow worker stops pulling. The queue naturally absorbs the variance.

Pull-based models trade a small delay for greater resilience. Workers only claim jobs when ready, reducing stale assignments.

The Scheduler Service's job becomes: poll the Task Store for due tasks, enqueue them into the priority-stratified task queues, update status to QUEUED. Workers handle the rest.

Push vs pull is one of those questions where interviewers expect you to argue a position, not just list trade-offs. "I'd choose pull because workers are inherently variable" — with the failure reasoning behind it — is what a senior answer sounds like. Mockingly.ai has distributed systems simulations where you practise defending exactly this kind of design decision under follow-up questioning.


Leader Election: Preventing Duplicate Scheduling

Interviewer: Multiple Scheduler Service instances run for availability. How do you prevent two of them from enqueuing the same task twice?

Candidate: This is where leader election comes in.

What leader election is: a distributed consensus mechanism where a group of nodes agrees on exactly one node being the "leader" at any time. Only the leader performs a privileged operation — in our case, polling the Task Store and enqueueing tasks. All other scheduler nodes stand by, ready to take over if the leader dies.

The tool: etcd or ZooKeeper. Both implement distributed consensus (etcd uses Raft; ZooKeeper uses ZAB). Both provide a lease-based locking primitive that schedulers can use:

plaintext
Scheduler nodes compete to acquire a lease key in etcd:
  key: /scheduler/leader
  value: { node_id, hostname }
  TTL: 30 seconds
 
Winning node = leader.
Leader renews the lease every 10 seconds (keeps it alive).
If leader dies, TTL expires, other nodes race to acquire it.
New leader elected within TTL window (≤30 seconds).

Split-brain protection: if the leader loses network connectivity to etcd but workers and clients can still reach it, the leader's lease expires. It must stop acting as leader — it checks its own lease validity before each scheduling cycle. An expired lease means: stop scheduling, enter standby mode.

Interviewer: What if the old leader enqueued a task, then dies, and the new leader polls the same task and enqueues it again?

Candidate: This is the core race condition. Two mitigations work together.

First, the scheduler updates the task's status to QUEUED in the same database transaction as the enqueueing decision. The new leader's poll query only fetches PENDING tasks — a task already marked QUEUED won't appear again.

Second, the Kafka consumer offset commits ensure the task queue message is only delivered once to workers. If a message is enqueued twice, idempotency keys on the task_id ensure the second message is deduplicated before a worker executes it.


Worker Heartbeats and Failure Detection

Interviewer: A worker claims a task and starts executing it. Then it crashes. How does the system detect this and recover?

Candidate: Three mechanisms work together: heartbeats, a reaper process, and task timeouts.

Heartbeats:

While executing a task, the worker sends a heartbeat to the Task Store every 15 seconds:

sql
UPDATE tasks
SET heartbeat_at = NOW()
WHERE task_id = :task_id
  AND worker_id = :worker_id
  AND status = 'RUNNING';

This keeps heartbeat_at fresh. A task whose heartbeat_at is older than 60 seconds (4 missed heartbeats) is presumed orphaned.

The Reaper process:

A background process (running on the leader scheduler or as a separate service) periodically scans for orphaned tasks:

sql
SELECT task_id FROM tasks
WHERE status = 'RUNNING'
  AND heartbeat_at < NOW() - INTERVAL '60 seconds';

For each orphaned task:

  • If retry_count < max_retries: reset status to PENDING, increment retry_count, set next_retry_at with backoff
  • If retry_count >= max_retries: mark status FAILED, move to Dead Letter Queue

The key question — is the worker really dead?

Silence is ambiguous. A worker might be alive but slow, in a GC pause, or experiencing a network partition. Distributed locks with a time-to-live reduce the likelihood that other workers will pick up the same job. The lock expires if the worker crashes. This supports at-least-once execution semantics.

The 60-second window is a deliberate choice. Short enough to recover failed tasks promptly. Long enough to avoid falsely reaping a worker that's alive but momentarily unresponsive. The right value depends on the SLA for task recovery.


Execution Guarantees: At-Least-Once vs Exactly-Once

Interviewer: Can you guarantee a task runs exactly once?

Candidate: Exactly-once in a distributed system is genuinely hard. Let me explain why, and what the practical answer is.

The Two Generals' Problem proves that certainty in communication over an unreliable link is impossible. Distributed schedulers aim for at-least-once execution. They rely on application-level idempotency for correctness.

The failure scenario that makes exactly-once hard:

plaintext
1. Worker W1 claims task T and begins execution
2. W1 completes the task successfully
3. W1 sends "completed" status update to Task Store
4. Network drops the update — Task Store never receives it
5. Reaper detects W1's heartbeat stopped (W1 may have crashed or the update failed)
6. Reaper re-queues task T as PENDING
7. Worker W2 picks up T and executes it again

Task T ran twice. From the Task Store's perspective, T only ran once — but in reality it executed twice.

The honest answer: the scheduler guarantees at-least-once execution. The task may run more than once in failure scenarios. Making tasks safe to run multiple times — idempotent — is the application's responsibility.

How to make tasks idempotent:

  • Give each task execution a unique execution_id. If a task uses this ID as part of any external operation (database write, API call), a duplicate run with the same ID produces the same result as the first.
  • For side effects (sending an email, charging a card), store a record of the operation with the task_id. Before executing, check whether this task has already produced this side effect.

Interviewer: Can the scheduler do anything to minimise duplicate runs?

Candidate: Yes — use a distributed lock at execution time.

When a worker claims a task, it acquires a lock in Redis with the task_id as the key and a TTL equal to the task's expected duration plus a buffer. A second worker trying to claim the same task fails to acquire the lock and skips it.

plaintext
Worker claims task T:
  SET lock:task:{task_id} {worker_id} NX EX 120
  NX = only set if key doesn't exist
  EX 120 = expires in 120 seconds
 
  └─ If SET succeeds: claim is exclusive, proceed with execution
  └─ If SET fails: another worker holds the lock, skip this task

This eliminates most duplicate runs — but not all. If a worker crashes after completing the task but before releasing the lock, the lock's TTL will eventually expire and the task could be claimed again. At-least-once remains the honest guarantee.

The at-least-once vs exactly-once section is where senior interviewers probe hardest — they'll ask about the Two Generals' Problem, why idempotency is the application's responsibility, and what specific Redis NX EX semantics prevent. Having a fluent, connected explanation of all three is the kind of thing that separates offer-level answers. Mockingly.ai is built to surface exactly where that fluency breaks down.


Recurring Tasks and Clock Drift

Interviewer: How do you handle a cron task — say "send the weekly newsletter every Monday at 8 AM"?

Candidate: Recurring tasks are stored with their cron expression alongside a next_run_at timestamp.

sql
CREATE TABLE recurring_tasks (
    task_id         UUID PRIMARY KEY,
    cron_expression TEXT NOT NULL,       -- e.g. "0 8 * * MON"
    task_type       TEXT NOT NULL,
    payload         JSONB,
    next_run_at     TIMESTAMPTZ NOT NULL,
    last_run_at     TIMESTAMPTZ,
    is_active       BOOLEAN DEFAULT TRUE
);

When the Scheduler Service polls for due tasks, it also scans recurring_tasks WHERE next_run_at <= NOW(). For each due recurring task:

  1. Enqueue a one-off task execution into the task queue
  2. Compute the next run time from the cron expression
  3. Update next_run_at in the same transaction

Updating next_run_at atomically with the enqueueing step prevents the same occurrence from being scheduled twice — even if two scheduler nodes race.

The clock drift problem:

Cron expressions reference wall-clock time. Server clocks in a distributed system drift. A job scheduled for 8:00 AM exactly might not be picked up until 8:00:02 AM due to polling frequency. Or a server whose clock is 5 minutes fast might schedule a task 5 minutes early.

Mitigations:

  • NTP sync — all servers sync to a common NTP source. Reduces drift to milliseconds.
  • Polling interval vs precision — if the scheduler polls every 10 seconds, the precision of task firing is ±10 seconds. Document this as the SLA, not a bug.
  • Next-run-at in UTC — always store and compare timestamps in UTC. Daylight saving transitions can silently shift cron times by an hour if you work in local time.

Task Dependencies (DAGs)

Interviewer: Task B should only run after Task A completes. How do you model this?

Candidate: Task dependencies form a Directed Acyclic Graph (DAG). Each edge represents a dependency: "A must complete before B starts."

sql
CREATE TABLE task_dependencies (
    dependent_task_id   UUID REFERENCES tasks(task_id),
    dependency_task_id  UUID REFERENCES tasks(task_id),
    PRIMARY KEY (dependent_task_id, dependency_task_id)
);

When a task completes, the scheduler checks if any other tasks have this as a dependency and whether all their dependencies are now satisfied:

sql
-- Find tasks whose all dependencies are now complete
SELECT t.task_id
FROM tasks t
WHERE t.status = 'PENDING'
  AND NOT EXISTS (
    SELECT 1 FROM task_dependencies d
    JOIN tasks dep ON d.dependency_task_id = dep.task_id
    WHERE d.dependent_task_id = t.task_id
      AND dep.status != 'COMPLETED'
  );

Any task returned by this query is now eligible to run — its entire dependency chain is satisfied.

Cycle detection: before accepting a new task with dependencies, run a cycle detection check on the submitted DAG. A dependency cycle would mean tasks waiting on each other forever. Reject submissions where the dependency graph contains cycles.


Dead Letter Queue

Every production scheduler needs a place for tasks that have exhausted all retries.

Interviewer: What happens to a task that fails 5 times and hits its retry limit?

Candidate: It moves to a Dead Letter Queue (DLQ).

The DLQ is a separate persistent store (a Kafka topic, a database table, or an S3 path) where permanently failed tasks land with their full execution history: the payload, all retry timestamps, and the last error message.

plaintext
task → FAILED (max retries exhausted)


Write to DLQ: { task_id, task_type, payload, error, retry_history }


Alert on-call engineer (PagerDuty / Slack)

    ▼ (human or automated)
Investigate root cause, fix, manually re-enqueue if appropriate

Without a DLQ, permanently failed tasks are just lost — or worse, they keep retrying forever and consume worker capacity. The DLQ surfaces failures for remediation and provides an audit trail.


Scaling the Scheduler

Interviewer: How does the system scale to handle 10× the task volume?

Candidate: The components scale differently.

Workers scale horizontally. Add more worker nodes, subscribe them to the Kafka consumer groups. Kafka automatically redistributes partitions across new consumers. Worker capacity scales linearly with node count — this is the easy part.

The Scheduler Service has a leader constraint. Only one scheduler processes the task table at a time. For most scales this is fine — a single leader can handle millions of task rows per poll cycle. If the polling itself becomes a bottleneck, partition the task table by a hash of task_id and have each scheduler instance own a subset of partitions. The leader election now operates per-partition.

The Task Store scales via read replicas and partitioning. The Scheduler Service's polls are read-heavy; worker status updates are write-heavy. Route reads to replicas, writes to the primary. Partition by task_id hash for horizontal scale if needed.

Queue depth as the pressure valve. The task queues absorb submission spikes. If 1 million tasks arrive in a minute, the queue buffers them. Workers drain at their own pace. Monitor queue depth — sustained growth means workers are under-provisioned.


Observability

A task scheduler without observability is a black box. Things will go wrong. You need to see what.

Key metrics to track:

  • Queue depth by priority — is work accumulating faster than workers process it?
  • Task execution latency — p50, p95, p99 from scheduled_at to started_at
  • Failed task rate by task type — which task types are flaky?
  • Orphaned task count — how many tasks are being reaped per hour?
  • DLQ depth — is the failure pile growing?

Alerting rules:

  • Queue depth sustained above threshold for 5 minutes → scale workers or investigate blocked tasks
  • DLQ depth growing → page on-call with task type breakdown
  • No Scheduler Service leader for more than 60 seconds → page on-call

Common Interview Follow-ups

"What if a task takes much longer than expected and the lock expires while it's still running?"

The worker's distributed lock TTL is set to the task's expected duration plus a buffer — say, 2× the p99 execution time for that task type. If a task genuinely runs past the TTL, the lock expires. The Reaper detects the missing heartbeat and re-queues the task.

This is a case where at-least-once delivery is visible to the application: the task runs twice. The correct mitigation is job-level checkpointing for long-running tasks — the worker periodically saves its progress and, if restarted, resumes from the last checkpoint rather than restarting from scratch.

"How do you handle priority starvation — low-priority tasks never running because high-priority tasks keep arriving?"

Pure priority queues can starve low-priority work indefinitely. Two mitigations:

First, aging: tasks that have waited longer than a threshold get their effective priority boosted. A low-priority task that's been waiting for an hour gets treated as normal priority.

Second, weighted fair queuing: workers don't always pull from the highest-priority queue. They pull from high 70% of the time, normal 25%, low 5%. Low-priority tasks make progress, just slower.

"How does the system handle a task that should run at exactly 12:00:00 AM but the scheduler is under load?"

It doesn't — and that's an acceptable trade-off to name explicitly. A polling-based scheduler has inherent jitter equal to the polling interval. A task scheduled for 12:00:00 might actually be enqueued at 12:00:02 if the poll happens every 2 seconds.

For tasks where sub-second precision matters (financial settlement, regulatory reporting), use a separate high-precision scheduler that polls more frequently and runs its tasks on a dedicated worker pool. Accepting that polling-based schedulers have second-level precision is an important design honesty.

"How do you ensure the scheduler doesn't thundering-herd the task store on startup?"

When a new leader is elected after a failover, it might find a backlog of PENDING tasks and try to enqueue all of them at once. Rate-limit the enqueueing: process at most N tasks per second during startup. Use a token-bucket rate limiter to spread the load. Workers will drain the queue as fast as they can regardless — there's no benefit to enqueuing everything at once.

These follow-up questions — priority starvation, lock expiry during long tasks, thundering herd on startup — are the ones that reveal real distributed systems intuition. If any of them feel uncertain, that's worth closing before the actual interview. Mockingly.ai runs system design simulations for engineers targeting roles at Amazon, Google, Microsoft, and Airbnb where exactly these follow-ups are standard.


Quick Interview Checklist

  • ✅ Clarified scope — one-off and recurring tasks, priority, dependencies, failure handling
  • ✅ Back-of-the-envelope — 500 submissions/sec is modest; 50K concurrent tasks is the challenge
  • ✅ Durability first — Task Store written before client acknowledgement
  • ✅ Task Store schema — key fields: status, worker_id, heartbeat_at, next_retry_at
  • ✅ Partial index on WHERE status IN ('PENDING', 'QUEUED') — keeps index small and fast
  • ✅ Push vs pull — chose pull with clear justification (worker variance, resilience)
  • ✅ Leader election — etcd/ZooKeeper lease, Raft-based consensus, split-brain prevention
  • ✅ Atomic status update + enqueue — prevents duplicate scheduling after leader failover
  • ✅ Heartbeat mechanism — 15-second updates, 60-second orphan window, Reaper process
  • ✅ At-least-once acknowledged honestly — exactly-once requires application idempotency
  • ✅ Distributed lock (Redis NX EX) — minimises duplicate execution under normal conditions
  • ✅ Recurring tasks — cron expression + next_run_at, atomic update on schedule
  • ✅ Clock drift — NTP sync, UTC storage, polling precision as documented SLA
  • ✅ DAGs — dependency table, completion trigger checks, cycle detection on submission
  • ✅ Dead Letter Queue — final destination after max retries, alerting, audit trail
  • ✅ Scaling — workers horizontal, scheduler per-partition, queue as pressure valve
  • ✅ Observability — queue depth, execution latency, orphaned task count, DLQ depth

Conclusion

Designing a distributed task scheduler is fundamentally about managing ambiguity.

The network will lie to you. Workers will disappear without warning. The scheduler's clock will drift from reality. Two scheduler nodes might both believe they are the leader for a brief moment. The question isn't whether these failures happen — they will — it's whether your design degrades gracefully when they do.

In distributed systems, the scheduler is less about assigning work and more about managing doubt. If you treat this as a queueing problem, you miss the signal. If you treat it as a coordination problem, you're on the right path.

The design pillars:

  1. Durability before acknowledgement — write to the Task Store before telling the client the task is accepted; never acknowledge what you haven't persisted
  2. Leader election for single scheduling authority — prevents duplicate task enqueueing; etcd or ZooKeeper provide the consensus primitive
  3. Pull-based workers — natural load balancing, resilient to scheduler restarts, worker capacity self-reports through pull rate
  4. Heartbeats and a Reaper — the only reliable failure detection mechanism in an asynchronous distributed system
  5. At-least-once with idempotency — accept the honest guarantee; demand idempotent task implementations from callers
  6. Distributed locks to minimise duplicates — Redis NX EX reduces duplicate execution without pretending to eliminate it
  7. Dead Letter Queue with alerting — permanently failed tasks need a home and a human in the loop

Frequently Asked Questions

What is a distributed task scheduler?

A distributed task scheduler accepts task submissions from clients and executes them reliably across a fleet of worker nodes — at the right time, without losing tasks when individual nodes fail.

How it differs from a single-machine cron job:

  1. Fault tolerant — if the scheduling machine dies, tasks are not lost; a new node takes over via leader election
  2. Horizontally scalable — add worker nodes to increase throughput; no single machine is the execution bottleneck
  3. Observable — task status (pending, running, completed, failed) is queryable at any time across the entire fleet
  4. Guaranteed delivery — tasks submitted to the system are persisted before acknowledgement; they cannot be silently dropped

What is the difference between a task scheduler and a message queue?

A message queue (Kafka, SQS, RabbitMQ) delivers messages to consumers as fast as possible — it has no concept of scheduled future execution. A task scheduler holds tasks until their trigger time and only then enqueues them for execution.

Message QueueTask Scheduler
When does work run?Immediately on enqueueAt a scheduled future time
Time awarenessNoneCore feature
Recurring jobsNot supportedFirst-class (cron expressions)
Retry logicVaries by implementationBuilt-in with backoff
Failure trackingMessage requeueStatus machine (PENDING → RUNNING → FAILED)

A task scheduler uses a message queue internally — the queue is the buffer between scheduling decisions and worker execution. A queue alone cannot replace a scheduler.


What execution guarantee does a distributed task scheduler provide?

Most production distributed schedulers provide at-least-once execution — a task will run at least once but may run more than once in failure scenarios.

Why exactly-once is theoretically impossible:

plaintext
1. Worker W1 claims task T and begins execution
2. W1 completes the task successfully
3. W1 sends "completed" status update to Task Store
4. Network drops the update — Task Store never receives it
5. Reaper detects W1's heartbeat stopped
6. Reaper re-queues task T as PENDING
7. Worker W2 executes T again

Task T ran twice. The Task Store saw only one completion — but two executions happened.

The practical solution:

  1. Accept at-least-once as the honest guarantee
  2. Design task implementations to be idempotent — running the same task twice produces the same result as running it once
  3. Use a unique execution_id per run so external systems can detect and deduplicate re-runs
  4. Use a Redis distributed lock (SET NX EX) to minimise duplicates under normal conditions — but not as a guarantee

What is leader election and why does a distributed scheduler need it?

Leader election is a distributed consensus mechanism where a group of nodes agrees on exactly one "leader" at any time. The leader performs a privileged operation — in a scheduler, polling the task store and enqueueing due tasks.

Why exactly one leader is a correctness requirement:

  1. If two scheduler nodes both poll for PENDING tasks simultaneously, they may each find and enqueue the same task
  2. Both workers receive the task and execute it — a duplicate
  3. The leader election ensures only one scheduler ever makes scheduling decisions at any moment

How it works with etcd (Raft consensus):

  1. All scheduler nodes race to acquire a lease key: /scheduler/leader with a 30-second TTL
  2. The winning node becomes leader — it renews the lease every 10 seconds to keep it alive
  3. If the leader dies, its lease expires within 30 seconds
  4. Remaining nodes race again — a new leader is elected automatically
  5. Split-brain protection: the leader checks its own lease validity before each scheduling cycle. An expired lease means stop scheduling immediately

How does worker failure detection work in a task scheduler?

Worker failure detection uses a heartbeat mechanism: running workers periodically update a timestamp on the task record; a background Reaper process scans for tasks whose heartbeat has gone stale.

The three-component mechanism:

  1. Heartbeats — the worker sends UPDATE tasks SET heartbeat_at = NOW() WHERE task_id = X AND worker_id = Y every 15 seconds while executing
  2. Orphan detection — the Reaper runs periodically and queries: SELECT task_id FROM tasks WHERE status = 'RUNNING' AND heartbeat_at < NOW() - INTERVAL '60 seconds'
  3. Recovery — for each orphaned task:
    • If retry_count < max_retries: reset to PENDING, increment retry_count, set exponential next_retry_at
    • If retry_count >= max_retries: mark FAILED, move to Dead Letter Queue

Why 60 seconds for the orphan window:

Too short — workers in GC pauses or brief network blips get falsely reaped, causing duplicate execution. Too long — genuinely dead workers hold tasks for too long, delaying recovery. 60 seconds (4 missed 15-second heartbeats) is a common production default.


How do you prevent the same task from running twice?

Two mechanisms work in combination to minimise duplicate task execution — though truly eliminating duplicates is impossible in a distributed system.

Mechanism 1 — Atomic status update at scheduling time:

  1. The Scheduler Service marks the task QUEUED in the same database transaction as the enqueue decision
  2. A second scheduler node polling for PENDING tasks will not find this task — it's already QUEUED
  3. This prevents the most common duplicate: two scheduler leaders racing to enqueue the same task

Mechanism 2 — Distributed lock at execution time:

  1. When a worker claims a task, it runs: SET lock:task:{task_id} {worker_id} NX EX 120
  2. NX = only set if the key doesn't exist
  3. EX 120 = expires in 120 seconds (task duration + buffer)
  4. A second worker attempting the same task gets nil — it skips the task

What these mechanisms don't cover:

If a worker completes a task but crashes before the lock is released, the TTL eventually expires and another worker can re-claim it. At-least-once remains the honest guarantee. Tasks must be designed to be idempotent.


How do recurring cron tasks work in a distributed scheduler?

Recurring tasks are stored with their cron expression and a next_run_at timestamp. The Scheduler Service polls for tasks where next_run_at <= NOW(), enqueues a one-off execution, and atomically updates next_run_at to the next scheduled time.

The schema:

sql
CREATE TABLE recurring_tasks (
    task_id         UUID PRIMARY KEY,
    cron_expression TEXT NOT NULL,       -- e.g. "0 8 * * MON"
    task_type       TEXT NOT NULL,
    payload         JSONB,
    next_run_at     TIMESTAMPTZ NOT NULL,
    last_run_at     TIMESTAMPTZ,
    is_active       BOOLEAN DEFAULT TRUE
);

The atomic update is critical — both the enqueue and the next_run_at update must happen in the same transaction. If two scheduler nodes race, only one succeeds; the other finds the task already updated and skips it.

Clock drift mitigations:

  1. NTP sync — all servers sync to a common NTP source; reduces drift to milliseconds
  2. UTC storage — always store and compare timestamps in UTC; avoids daylight saving time shifts
  3. Polling precision as SLA — a scheduler polling every 10 seconds fires tasks within ±10 seconds of their scheduled time; document this as the SLA, not a bug

What is a Dead Letter Queue in a task scheduler?

A Dead Letter Queue (DLQ) is the final destination for tasks that have exhausted all retry attempts without successfully completing. It preserves permanently failed tasks for investigation rather than discarding them silently.

What the DLQ stores per failed task:

  1. The original task payload
  2. All retry timestamps and retry counts
  3. The error message and stack trace from the last failure
  4. The full execution history (which workers attempted it, when)

Why a DLQ matters:

  1. Without it: permanently failed tasks disappear silently — engineers never know a class of tasks is broken
  2. With it: a growing DLQ triggers an alert; on-call engineers can investigate the root cause and re-enqueue fixed tasks manually
  3. Audit trail: for compliance and debugging, the full history of what was attempted is preserved

The alerting rule: if DLQ depth is growing, page on-call with a breakdown by task_type. A sudden DLQ spike usually means a downstream dependency is broken — not the tasks themselves.


How do task dependencies (DAGs) work in a distributed scheduler?

Task dependencies form a Directed Acyclic Graph (DAG) where each edge means "task A must complete before task B starts." A dependency table stores these edges, and a completion trigger checks whether all dependencies for a task are satisfied.

The dependency schema:

sql
CREATE TABLE task_dependencies (
    dependent_task_id   UUID REFERENCES tasks(task_id),
    dependency_task_id  UUID REFERENCES tasks(task_id),
    PRIMARY KEY (dependent_task_id, dependency_task_id)
);

Eligibility check on task completion:

sql
SELECT t.task_id FROM tasks t
WHERE t.status = 'PENDING'
  AND NOT EXISTS (
    SELECT 1 FROM task_dependencies d
    JOIN tasks dep ON d.dependency_task_id = dep.task_id
    WHERE d.dependent_task_id = t.task_id
      AND dep.status != 'COMPLETED'
  );

Any task this query returns has all its dependencies satisfied — it is now eligible to run.

Cycle detection on submission: a dependency cycle means tasks waiting on each other forever. Before accepting a task submission with dependencies, run a depth-first search on the proposed graph. Reject any submission that introduces a cycle.


Which companies ask the distributed task scheduler system design question?

Amazon, Google, Microsoft, Pinterest, Airbnb, Instacart, Uber, and Lyft ask variants of this question for senior software engineer and principal engineer roles.

Why it is a consistently popular question:

  1. Compresses core distributed systems concepts — coordination, durability, failure detection, retries, and execution guarantees all appear in a single, bounded problem
  2. Reveals real experience — candidates who've operated task schedulers in production know the heartbeat window trade-off, the DLQ necessity, and the clock drift problem from direct experience
  3. Scales clearly to seniority — a mid-level answer describes queues and workers; a senior answer explains leader election, the Two Generals' Problem, and why at-least-once is the honest guarantee

What interviewers specifically listen for:

  1. Durability before acknowledgement — writing to the Task Store before responding to the client
  2. Pull over push — with the specific reasoning that worker capacity is inherently variable
  3. At-least-once named honestly — not claiming exactly-once, and explaining why
  4. Heartbeat window reasoning — explaining why 60 seconds, not 10 seconds or 5 minutes
  5. DLQ raised proactively — not just when the interviewer asks about failed tasks

The distributed task scheduler interview is one where senior-level depth shows up quickly. Describing heartbeats and connecting them to the at-least-once guarantee, explaining why exactly-once is theoretically hard, justifying pull over push with failure reasoning — these are the signals interviewers at Amazon and Google are listening for. If you want to practise articulating that reasoning under real interview conditions, Mockingly.ai has system design simulations built for engineers preparing for senior roles at Amazon, Google, Microsoft, and beyond.

Companies That Ask This

Ready to Practice?

You've read the guide — now put your knowledge to the test. Our AI interviewer will challenge you with follow-up questions and give you real-time feedback on your system design.

Free tier includes unlimited practice with AI feedback • No credit card required

Related System Design Guides