Sep 06, 2025·7 min read

Stuck job detection: heartbeats, timeouts, and alerts

Q: What’s the simplest way to add heartbeats without overengineering?

Track one simple record per job run, like `last_heartbeat_at`, `current_step`, and a basic progress number. Update it only after finishing a real unit of work so it reflects real progress, not a loop that’s spinning.

Learn stuck job detection with simple heartbeats, sensible timeouts, and clear alerts so you can tell if work is slow, failed, or truly stuck.

Why "it is not running" is so hard to debug

When someone says, "it is not running," they usually mean, "I clicked the button and nothing happened." The problem is that most apps do a lot of work after the click, out of sight. Without clear signals, you can't tell whether the work never started, started and is just slow, or started and got stuck halfway.

A background job is simply a task your app does later, without making the user wait on the screen. Common examples are generating a PDF, importing a CSV, sending emails, building a report, or syncing data. The user triggers it, then the app hands it off to a worker that runs in the background.

From the outside, three different outcomes can look exactly the same:

Slow: it's running, but taking longer than expected.
Failed: it stopped with an error and won't finish on its own.
Stuck: it started, but isn't making progress anymore.

All three feel like "nothing is happening." Your dashboard might still say "queued" or "processing" for hours. Support messages pile up. Founders end up guessing: is the server down, will it finish overnight, did the user do something wrong?

Stuck job detection is how you turn that guesswork into a clear answer. You want an easy way to see:

Did the job start?
Is it still alive right now?
When did it last make progress?
When do we declare it timed out?
Who gets alerted, and what should they do next?

Once those questions are visible, "it is not running" becomes a diagnosable issue, not a mystery.

Slow vs failed vs stuck: a simple definition

When someone says "the job isn't running," they usually mean one of three different things. If you don't name which one it is, teams tend to guess, retry, and create duplicate work (including double-charging a customer or sending the same email twice).

Slow

A slow job is still moving forward. It might be uploading files, calling an API, or processing records, just more slowly than you expected. The key signal is that something is changing: timestamps update, progress numbers increase, or new log lines appear.

Failed

A failed job stopped. It won't finish on its own. It might have crashed, hit a hard error, or lost access to something it needs (like a permission or a missing file). The key point is that it's done, and the only way to complete the work is to fix the cause and retry.

Stuck

A stuck job is the tricky one. The job "exists" (it's in the queue or marked as running), but it's no longer making progress. It's alive enough to block other work, but not healthy enough to finish.

In practice:

Slow: progress keeps changing, even if it's small.
Failed: it ends with an error and stops.
Stuck: it looks running, but nothing changes for too long.

Example: your daily report email didn't go out. If you retry blindly, you might send it twice later. If you know it failed, you can rerun safely. If you know it's stuck, you can restart it on purpose instead of waiting for hours.

The three building blocks: heartbeats, timeouts, alerts

If you want stuck job detection to feel simple, think of it like tracking a delivery. You need proof it's moving, a point where you decide it's not, and a way to tell someone.

1) Heartbeats: proof of life (or progress)

A heartbeat is a small signal a job sends while it runs. It can be as basic as "still working" every minute, or better, "processed 3 of 20 items." The goal isn't detail. It's confidence that the job is moving.

Good heartbeats are consistent and cheap. A file export job can heartbeat after each chunk is written. A data sync can heartbeat after each customer is updated. If a job can't report meaningful progress, it should at least report "I'm alive" on a schedule.

2) Timeouts: when "maybe slow" becomes "stuck"

A timeout is the rule that turns silence into a clear state. It answers: how long are you willing to wait with no heartbeat before you treat the job as stuck?

A simple setup looks like this:

Pick an expected heartbeat interval (for example, every 60 seconds).
Allow a grace window (for example, 5 missed heartbeats).
Mark the job as stuck when that window is exceeded.
Record the last known progress so you know where it stopped.

3) Alerts: a message a human can act on

An alert should reach a real person and include enough context to respond without guessing: which job, which customer, when it last progressed, and what the system did next (retry, paused, or failed).

You don't need a fancy dashboard. A simple status view is enough: latest run, last heartbeat time, and whether a timeout triggered. That's what turns "it's not running" into an observable problem instead of a debate.

How to add heartbeats without overengineering

A heartbeat is just a small "still alive" signal your job updates while it runs. It turns stuck job detection from guessing into one checkable fact: when did the job last report progress?

Start with the smallest progress signal you can update safely. For most teams, a single row in a database table is enough. Keep it boring and durable so a restart doesn't wipe it.

Pick a heartbeat that's hard to lie about

Choose one field that matches what "progress" means for the job:

last_heartbeat_at (timestamp)
current_step (for example: "fetching data", "writing file", "sending email")
processed_count (how many items are done)

Update the heartbeat only after you finish a real unit of work (one page fetched, one batch written, one invoice processed). That way it reflects real movement, not a loop that spins forever.

Set an update rhythm that fits the job

A heartbeat that fires too often creates noise. Too rarely, and you find out late.

As a rough guide: update per step for short jobs, every 1 to 5 minutes for long jobs, and after each external call completes for anything that depends on an API, upload, or third-party service.

Make it visible. A tiny admin view is enough: job ID, status, current step, and last heartbeat time. Then "it's not running" becomes "it stopped heartbeating at 2:14 PM while writing step 3."

Choosing job timeouts that match real life

A timeout shouldn't be a random number. It's a promise: "If this job hasn't made progress by X, something is probably wrong, and we will respond in a predictable way."

Start with two timeouts, not one:

A soft timeout: "this is taking longer than normal"
A hard timeout: "stop waiting, something is wrong"

Base defaults on real runs, not guesses. Look at your slowest normal run (not the best day, not a full outage) and use that as a starting point. If reports usually finish in 2 to 4 minutes but occasionally take 8, a soft timeout at 6 and a hard timeout at 15 is more useful than a hard 5 that fires all the time.

Some jobs are "fast work plus long waiting." Waiting on an external API, a file upload, or a payment provider can pause progress without meaning the job is broken. Handle this by giving each step its own limit, so you can see where time is being spent.

A simple way to set timeouts

Use this as a starting point, then adjust based on real data:

Soft timeout: 1.5x to 2x your slowest normal run
Hard timeout: 3x to 5x your slowest normal run
Step timeouts: set per external call or wait, based on the service's normal behavior

When a timeout hits, decide the behavior in advance. Options are simple: mark as stuck and alert someone, retry if it's safe, or stop and require review.

Common reasons jobs get stuck (no code jargon)

Triage a hanging job today

Share the failing run and we pinpoint where progress stops and why.

Get Help

Most stuck jobs aren't mysterious. They are usually waiting on something, blocked by something, or repeating the same step without making progress.

Four causes show up again and again:

The worker stops halfway through. The machine running the job can crash, restart, run out of memory, or lose network. The job already started, so it looks "in progress," but nobody is doing the work anymore.
It waits on another service that never answers. Email providers, payment gateways, AI APIs, file storage, partner systems. If the service hangs (not a clean error), your job can sit there forever unless you set a limit.
Something blocks it in the database. A job might try to update a record while another process holds a lock. From the outside it looks idle. Inside it's stuck in a traffic jam.
The job loops or redoes the same work. A small bug can cause endless retries, duplicate processing, or a loop that never reaches "done." It's technically running, but it isn't moving forward.

A quick sanity check is to ask: did it stop making progress, or did it stop completely? If a job is marked running but "rows processed" hasn't changed in 20 minutes, you're looking at waiting, blocking, or looping.

Alerts that help someone take action

An alert is only useful if it tells a real person what to do next. If it just says "worker error" at 2 a.m., people will mute it and you'll lose the point of monitoring.

Start with one channel you already watch, like email or team chat. Add escalation later once you trust the alerts.

Alert only on states someone can act on: a job is stuck (no heartbeat for X minutes), a job failed several times in a row, or the queue is backing up (for example, the oldest job is over 15 minutes old). Avoid "FYI" alerts like "job started" unless you're debugging.

Every alert should include enough context to decide quickly:

Job name and which feature it belongs to
User or account affected (if applicable)
Start time and last heartbeat time
Last known step (for example: "Generating PDF")
Owner and first action

Make the first action explicit and boring. Example: "Owner: Support. First action: retry once. If it fails again, escalate to Engineering with this job ID."

Step by step: make stuck jobs diagnosable in a day

Clean up spaghetti architecture

We refactor AI-generated code so workers are simpler to monitor and change.

Begin Refactor

You can make stuck job detection real in a day if you keep it simple: define what "healthy" looks like, then measure it.

Start by listing the handful of background jobs that would hurt if they silently stopped (billing sync, nightly reports, email sends, file imports). For each one, define what "done" means in plain terms: a row saved, a report delivered, an email batch completed, a file marked processed.

Next, add a heartbeat. This is just a timestamp your job updates while it works. Update it when the job starts, occasionally during progress, and once more at the end. Now "it's not running" becomes "the last heartbeat was 23 minutes ago while processing step 3."

Then set one timeout rule per job. Base it on missing heartbeats for long work or maximum runtime for short work. Pick something realistic based on normal behavior, not optimistic behavior.

Finally, give yourself one place to look. Four states are enough:

Running (recent heartbeat)
Done (finished marker recorded)
Failed (error recorded)
Stuck (timeout triggered)

Prove it works

Test it by forcing a failure on purpose: stop the worker mid-job or simulate a crash. Confirm two things: the job shows as stuck, and the alert includes what job it is, when it started, and the last heartbeat time.

Once this is in place, you're no longer guessing. You're observing.

Common mistakes and traps to avoid

The goal is simple: when something stops moving, you learn it quickly and you know what to do next. Teams usually miss that goal because signals look "green" right up until a customer complains.

A common trap is treating a single "started" signal as a heartbeat. If your worker reports only at the beginning, a mid-job freeze can look healthy for hours.

Timeouts also backfire when they're set by guesswork. If they're tighter than normal slow runs (end-of-month reports, big imports, peak traffic), you'll get alerts for work that would've finished fine. People learn to ignore alerts, which defeats the point.

Retries are another quiet source of damage. If a job can run twice and cause a second charge, a duplicate email, or a double refund, auto-retry turns one failure into a support mess.

Alerts are hardest to act on when they lack basics: which customer is affected, what step it reached, who owns it, and whether someone already acknowledged it.

A useful reality check: if an alert wakes someone up, they should be able to answer "what broke, who is affected, and what's the next safe action" in under two minutes.

Quick checklist before you ship monitoring

Before you roll this out, aim for one outcome: when someone says "it's not running," you can answer what stopped, when, and what to do next.

For any job that is "running," can you quickly see when it last checked in?
Do you have one plain rule for "stuck," written in a single sentence (for example: "no check-in for 10 minutes means stuck")?
When an alert fires, does it say which job, which customer or workspace, when it last progressed, and the last completed step?
If you hit retry, are you protected from duplicates (double-charging, sending the same email twice, creating two reports)?
In an emergency, can someone pause or rerun the job without touching code, and is it clear who can do that?

A practical test: ask a non-technical teammate to read one alert and explain what they'd do next. If they can't, the alert isn't done.

Example scenario: the missing report that was actually stuck

Stop guessing on stuck jobs

Get a free audit to find why your background work hangs and what to change.

Get Audit

A founder clicks "Generate report" for a customer, closes the tab, and waits. Ten minutes later, nothing shows up in email or in the app. The support message is short: "It's not running."

With stuck job detection in place, the dashboard tells a clearer story:

Job ID #18422 started at 10:03
Current step: "Export to PDF"
Last heartbeat: 18 minutes ago
Status: Stuck (expected heartbeat every 60 seconds)

A useful alert lands in the right place and says what matters:

Report job is stuck in "Export to PDF"
Last progress: 12,430 rows processed
Affected account: Acme Co
Safe action: retry from the export step (no double billing, no duplicate emails)

Now the person on call has a clean path. First, they restart the job safely, using a retry that reuses the same output record instead of creating a second report. The customer gets their report.

Then they fix the cause. In this case, it's often either a request inside the export step that never returns, or a database lock that freezes the query until the worker gives up.

Next time is calmer because the job is split into clearer steps, and each step gets its own timeout.

Next steps: make your jobs reliable and easy to support

Start small. Pick the one background job that causes the most pain (missed reports, failed invoices, delayed emails). Add heartbeats and a single timeout rule to that job first. You'll learn more from one well-instrumented job than from light monitoring everywhere.

Write down what "healthy" looks like in plain English so anyone can judge it at a glance. Example: "This job should update its heartbeat at least once per minute and finish within 8 minutes. If it times out, it should alert." That one sentence becomes your shared definition of normal.

If your codebase started as an AI-generated prototype, assume there are hidden edge cases even if things mostly work. The failures often show up as stuck jobs because a worker waits forever, fails silently, or retries in a way that makes data messy.

If you're dealing with an AI-generated app that hangs in the background and you want a fast, practical path to production readiness, FixMyMess (fixmymess.ai) focuses on diagnosing and repairing issues like missing heartbeats, unsafe retries, and broken worker logic so your jobs become observable and supportable.

Once the basics are stable, your next upgrade is usually: safer retries (safe to run twice), a simple status page for job health, and deployment checks that catch configuration problems before they hit production.

FAQ

What exactly is a background job?

A background job is work your app does after the user clicks something, without keeping them on a loading screen. Common examples are report generation, CSV imports, PDF exports, and sending emails.

Why does “it’s not running” feel so hard to debug?

Because three different situations can look identical from the outside: the job is slow, it failed, or it’s stuck. Without signals like progress updates or timestamps, you can’t tell which one you’re dealing with.

How can I tell if a job is slow rather than stuck?

Slow means it’s still moving forward, just taking longer than expected. You’ll see evidence like updated timestamps, increasing counts, or new log/progress updates.

What’s the difference between a failed job and a stuck job?

A failed job has stopped and won’t complete on its own. It ends with an error or hits a hard problem, and the only way to finish is to fix the cause and retry safely.

What is a heartbeat and why do I need it?

A heartbeat is a small “proof of life” update the job writes while it runs, like a timestamp or a processed count. If heartbeats stop updating, you have a clear signal the job isn’t making progress.

What’s the simplest way to add heartbeats without overengineering?

Track one simple record per job run, like last_heartbeat_at, current_step, and a basic progress number. Update it only after finishing a real unit of work so it reflects real progress, not a loop that’s spinning.

How do I choose timeouts that won’t spam me with false alerts?

Start with a soft timeout to flag “unusually slow,” and a hard timeout to declare “stuck.” A practical default is to set the soft timeout around 1.5–2× your slowest normal run, and the hard timeout around 3–5×.

What are the most common real reasons jobs get stuck?

Common causes include a worker crash mid-run, a third-party service that hangs without returning, database locking that blocks updates, or a bug that loops and repeats the same step. Heartbeats plus a “last step” field usually tell you which category it is.

What should a good stuck-job alert include?

Include the job name, affected user/account, start time, last heartbeat time, last known step, and the first safe action to take (retry once, pause, or escalate). If the alert doesn’t tell someone what to do next, it’ll get ignored.

What’s a one-day plan to make stuck jobs diagnosable?

Start with the one job that causes the most pain and define what “done” means in plain terms. Add a heartbeat, set one timeout rule, and surface four states—Running, Done, Failed, Stuck—so anyone can see what’s happening quickly.