Sep 06, 2025·7 min read

Serverless cron jobs: stop overlaps and detect silent failures

Make serverless cron jobs reliable: choose a scheduler, block concurrent runs with locks, and add a last-ran heartbeat with alerts.

Serverless cron jobs: stop overlaps and detect silent failures

The problem: overlaps and silent failures

Most serverless cron jobs fail in two predictable ways: they run twice at the same time, or they stop running and nobody notices.

An overlap is when the next scheduled run starts before the previous one finishes. In real systems, that becomes duplicate invoices, repeated emails, double payouts, or imports that write the same rows twice. Even if your code is mostly idempotent, overlaps still hurt: they waste rate limits, double-charge cards, and hold locks longer than expected.

Silent failures are worse because they look like nothing happened. A job can stop because a deploy removed the schedule, a permission change blocked database access, a secret expired, quotas were hit, or a platform update disabled a trigger. Old logs may still be there, so it feels fine until a customer reports missing data.

"It worked once" isn't a reliability plan. A job that ran yesterday isn't proof it will run tomorrow, especially when config, permissions, secrets, and runtimes change without touching code.

What you want is simple and measurable:

  • No concurrent runs (one run owns the work, everyone else backs off)
  • Fast detection when runs stop (a clear "last ran at" signal, plus an alert when it's stale)

Treat scheduling like production infrastructure and you stop chasing weird, intermittent bugs later.

Choose a scheduler that fits the job

Not all schedulers behave the same, and that matters once people depend on your jobs.

Event-based schedulers fire a trigger at (roughly) a specific time and hand off work to a function or endpoint. They're simple and cheap, but delivery is often "best effort" unless you add retries, dead-letter handling, and monitoring.

Queue-based schedulers enqueue a "run this job" message. That extra hop is useful because queues usually give you better control over retries, backpressure, and visibility. If your job is heavy, slow, or spiky, a queue tends to make failures easier to see and recover from.

Common options include AWS EventBridge Scheduler (or CloudWatch schedules), GCP Cloud Scheduler (often paired with Pub/Sub or Cloud Tasks), Azure Functions Timer Trigger (or Logic Apps), and CI schedulers like GitHub Actions for lightweight maintenance tasks.

A practical way to choose is to answer a few questions:

  • How often does it run, and does the exact minute matter?
  • How long can a run take at peak (seconds vs hours)?
  • What should happen on failure: retry, alert, or both?
  • What permissions does it need (database, secrets, third-party APIs)?
  • Do you need to catch up if a run is missed?

If you need strong guarantees, avoid "fire and forget" setups with no retries, no dead-letter path, and no alert when nothing runs. That's how jobs quietly stop for days.

Define your run model before you code

Many reliability problems start before the scheduler. They start with a fuzzy definition of what a "run" even means.

Decide what counts as one run and write it down. Is it "everything since the last run," "all records for yesterday," or "a batch id created at 02:00"? This one choice affects how you lock, retry, and recover.

Make "running twice" safe where you can

Even with good locks, assume a run can happen twice due to retries, timeouts, or manual replays. Aim for idempotent work: the same input should lead to the same final result.

A simple pattern is to store a run key (like 2026-01-20) and record which items were processed under that key. If the same key runs again, you skip completed items instead of repeating side effects.

Separate trigger from worker

Treat the schedule trigger as a thin starter and put real work in a worker function. The trigger should only compute the run key, attempt to claim the run, and hand off.

This keeps business logic separate from reliability guardrails, and it makes it easier to switch schedulers later.

Before coding, define the outcomes:

  • Success: what data is definitely correct, and where is it recorded?
  • Failure: what must be rolled back, and what can be retried?
  • Partial success: what is safe to keep, and how do you resume?
  • Timeout: what state might be left behind?

Plan your concurrency guard (locking strategy)

If you run serverless cron jobs, assume two bad days will happen: a job runs long and the next schedule fires anyway, or a function retries after a timeout and you get two copies. A concurrency guard is the small piece that makes those days boring.

Start by choosing where the lock lives. Pick something your job can read and write quickly, with strong "only one wins" behavior.

Pick your lock store

Common choices:

  • A single database row (great if you already use Postgres/MySQL and can do an atomic update)
  • Redis (fast and convenient for short locks, but make sure it's highly available)
  • Object storage lease (a blob/file created with "if not exists"; simple, but can be slower)

Next, decide what the lock key represents. A practical key often includes the job name and the scheduled time window, like billing-sync:2026-01-20T02:00Z. That blocks duplicates for the same slot without preventing the next day's run.

Always set a TTL (expiry). TTL protects you when a run crashes mid-way or the platform kills the process. Set it slightly longer than your worst-case runtime, not your average.

Finally, decide what happens on conflict:

  • Skip (safe for idempotent work, but you may miss a run)
  • Reschedule (better coverage, but can create traffic spikes)
  • Fail loudly (best for critical jobs where missing a run is worse than noise)

Step-by-step: prevent concurrent runs with a lock

Overlaps happen when your scheduler fires twice, or a run takes longer than expected. In serverless, the simplest fix is a shared lock stored outside the function (a database row, Redis key, or a cloud KV store). One run wins the lock; the others exit.

1) Use a clear lock flow

Keep the flow predictable:

  • Acquire lock (atomic create or conditional update)
  • If lock is taken, exit quickly
  • Run the job
  • Release lock, but only if you still own it

2) Add an owner token and always release

An owner token prevents Run B from releasing Run A's lock. Always release in a finally block so errors don't leave a permanent lock.

import crypto from "crypto";

export async function handler() {
  const lockKey = "nightly-report";
  const owner = crypto.randomUUID();
  const ttlSeconds = 15 * 60; // lock safety window

  const acquired = await acquireLock({ lockKey, owner, ttlSeconds });
  if (!acquired) return { status: "skipped", reason: "lock_taken" };

  try {
    await doWork();
    return { status: "ok" };
  } finally {
    await releaseLock({ lockKey, owner }); // only release if owner matches
  }
}

A good acquireLock is atomic and sets an expiry (TTL) so a dead run doesn't block forever.

3) Test with forced overlap

Trigger two runs at the same time (manual invoke twice, or reduce the schedule temporarily). One should run; the other should log "skipped: lock_taken". If both run, your lock write isn't truly atomic, or your owner check is missing.

Step-by-step: add a last-ran heartbeat check

Harden job security
We patch issues like SQL injection and unsafe logging in serverless job code.

A heartbeat is a tiny "I ran" record your job writes every time it finishes (success or failure). It turns silent failures into alerts, which matters in serverless where there's no always-on process to notice a stall.

1) Choose where to store "last ran"

Pick a place that's easy to write, fast to read, and unlikely to be down at the same time as your scheduler:

  • A database table
  • A key-value store entry (simple "job_name -> last_run")
  • A metrics system / time series gauge

2) Record the right fields

Don't store only a timestamp. Store enough to debug without digging through logs first.

job_name, run_id, started_at, finished_at, status, duration_ms, error_snippet

A practical rule is: write once at start (status=running), then update at the end (status=success or failed). That also lets you detect "stuck running".

3) Set a threshold and alert rules

Set the "missing heartbeat" threshold to about 2x your expected interval. If a job runs every 15 minutes, alert if there's no successful heartbeat in 30 minutes.

Separate alert types:

  • Missing heartbeat: no success within threshold (likely not running)
  • Repeated failures: last N runs failed (job runs, but work is broken)

Example: a nightly billing sync should run at 2:00 AM and finish in 5 minutes. Alert if there's no success by 2:15 AM. Use a different alert if it ran but failed three nights in a row.

Logging and alerts that are actually useful

When serverless cron jobs misbehave, the first problem is usually not the scheduler. It's that nobody can answer three questions quickly: did it start, did it finish, and why did it skip?

Give every run a consistent run id (for example: timestamp + short random suffix). Log it at the start and include it in every log line so you can follow one run end-to-end.

Also log when a run doesn't happen on purpose. A skipped run isn't the same as a failed run, but it's still important signal. If the job didn't run because it couldn't get the lock, say that clearly and include the lock key (and owner if you have it).

Keep logs consistent:

  • Start: run id, job name, trigger time, version/commit, important inputs
  • Skip: run id, job name, skip reason (lock conflict, scheduler disabled, feature flag off)
  • Finish: run id, status (ok/failed), duration, counts (items processed, errors)
  • Failure: run id, error type, safe context, and what was already done

Alerts should watch for patterns, not just single errors. A duration spike can mean an upstream API is slow. Too many skips can mean your lock is stuck. No runs at all often points to scheduler permission drift or a deployment that removed the trigger.

Make every alert actionable. Include the last successful run time, the expected next run time, and the first thing to check (lock record, scheduler status, recent deploy).

Retries, timeouts, and catching up safely

Fast turnaround fixes
Most FixMyMess remediations finish in 48-72 hours after a free code audit.

Retries help, but they also create overlaps. Many schedulers retry automatically if your function returns an error or times out. Without a distributed lock (or if you release it too early), a retry can start while the original run is still working.

Timeouts make this worse. In serverless, the platform can stop your code mid-task when you hit a time limit. You may not get a chance to clean up, and you may not know which steps finished. If the scheduler retries, you can double-send emails, double-charge, or write duplicates.

A safer approach is to make each run resumable and idempotent. Think in checkpoints, not one big "do everything" function. For example, a nightly invoicing job can store a progress marker like "processed up to invoice_id 18420" and continue from there.

Guardrails that prevent retries and catch-ups from doing damage:

  • Hold the lock for the whole run. Release it only when you're truly done.
  • Record a run_id and progress markers so a retry can continue instead of restarting.
  • Split work into small batches with a per-item "already processed" check.
  • Add a controlled backfill mode that processes missed windows one at a time.

Backfills matter because schedules slip. If yesterday's run failed, today's run shouldn't automatically process two days in one go unless your system can handle it. A simple rule is "process the oldest missing window first, then stop".

Common mistakes and easy fixes

Most production failures come from a few choices that feel fine in a prototype.

  • TTL shorter than the job. Your "single run" promise breaks the moment one run is slow. Fix: set TTL to worst-case runtime plus a buffer, and refresh it while the job is alive.
  • TTL too long. One crashed run can block the schedule for hours. Fix: keep TTL reasonable, release in finally, and use an owner token so another instance can't unlock by accident.
  • In-memory locks. In serverless, each run can land on a different instance, so memory flags do nothing. Fix: use a shared store (DB row, Redis, or managed KV).
  • Assuming "exactly once" without idempotency. Retries and at-least-once delivery will bite you. Fix: write with unique keys, upserts, or a run-id check before side effects.
  • Using logs as your heartbeat. Logs are great for debugging, but painful to alert on. Fix: write a last-ran record to a queryable place (DB/KV/metrics).

One easy-to-miss cause of silent failure is permission drift. The scheduler still fires, but the worker can no longer read secrets, write to storage, or call an API after a change.

Quick checklist before you ship

Before you trust serverless cron jobs in production, do one pass focused on boring failures: a scheduler that's off, a lock that expires too early, or a heartbeat nobody checks.

  • Confirm the schedule is enabled in the right environment, and the runtime identity can read secrets, write to your DB/queue, and emit logs.
  • Write down your concurrency guard: lock key format, where it's stored, TTL, and what happens on conflict (skip, reschedule, or fail).
  • Validate TTL with real timings. If the job sometimes takes 12 minutes, a 10-minute lock will create overlaps.
  • Store a "last successful run" heartbeat somewhere you can query quickly during an incident, and include status (not just a timestamp).
  • Run two intentional tests: (1) force a failure to confirm alerts reach a human, and (2) force an overlap to confirm the second run is blocked and clearly logged.

A simple overlap test: start one run with an intentional sleep/wait in the middle, then trigger a second run. If you don't see a clean "lock held, exiting" message, your guard still isn't reliable.

Example: a nightly job that must never run twice

Detect silent failures
We add heartbeats and clear alerts so you notice when schedules stop after a deploy.

A painful (and common) case: a nightly "report export" runs at 02:00, generates PDFs, and emails them to customers. After a deploy, the scheduler fires twice (or a retry kicks in) and some customers get duplicate emails. Nothing is "down," but trust drops fast.

The fix is two small pieces: a lock to prevent overlap, and a heartbeat to catch silent stops.

First, the job grabs a distributed lock (for example, a row in a database or a key in a managed store) with a TTL longer than the expected run time. If the lock is already held, the second invocation exits before sending anything.

A practical flow:

  • Try to acquire lock "nightly-export" with TTL 45 minutes
  • If lock exists, log "skipped: already running" and stop
  • If lock acquired, generate the export and send emails
  • Release lock (TTL is a safety net, not the plan)

Second, write a heartbeat like last_success_at after emails are sent. Then run a separate check every 15 minutes that alerts if now - last_success_at is greater than 24 hours + one interval. That catches the "job stopped after deploy" problem quickly.

For a non-technical owner, the best logs and alerts are plain-language:

02:00:01 lock_acquired job=nightly-export run_id=abc123
02:07:44 completed job=nightly-export emails_sent=418 last_success_at=2026-01-20T02:07:44Z
02:00:02 skipped job=nightly-export reason=lock_held current_owner=abc123
ALERT: Nightly export has not succeeded in 25h. Last success: 2026-01-19 02:06 UTC. Check scheduler + secrets.

Next steps if your scheduled jobs are still unreliable

If your jobs still overlap or "just stop" after you added a lock and a heartbeat, the scheduler might be fine but the job logic is fragile.

A common sign is when the cron logic came from an AI-generated prototype. You often see a lock that isn't truly shared, secrets leaking into logs, and retry behavior that looks helpful but causes duplicate side effects.

Signs you're in "stop patching" territory:

  • Fixes work until the next deploy, then failures change shape
  • Nobody can explain exactly when a run is considered "done"
  • Retries create duplicate emails, charges, or writes
  • Auth breaks unpredictably (expired tokens, missing refresh, bad roles)
  • Logs don't let you reconstruct one run end-to-end

At that point, a short remediation pass usually beats more tweaks. The goal isn't a rewrite. It's to make the job predictable: one clear entry point, one lock strategy, one set of timeouts, and one place where success is recorded.

If you're dealing with a broken AI-generated codebase (especially from tools like Lovable, Bolt, v0, Cursor, or Replit), FixMyMess at fixmymess.ai focuses on diagnosing and repairing issues like overlapping runs, exposed secrets, and fragile retry logic so the job behaves reliably in production.