Webhook secret rotation without downtime: dual signatures done right
Learn how to do webhook secret rotation without downtime using dual verification, clear logging, safe cutover steps, rollback, and cleanup checks.

What goes wrong when you rotate a webhook secret
Rotating a webhook secret sounds simple: change the secret in the sender, change it in your receiver, done. In practice, one timing mismatch can break verification and turn a normal day into a flood of failed deliveries.
The most common failure looks like this: the provider starts signing requests with the new secret before your server knows about it (or your server switches first while the provider still uses the old one). Every request looks “tampered with,” so your receiver rejects it.
For webhook secret rotation without downtime, the goal isn’t a perfect one-second switch. It’s a short overlap window where both secrets are accepted.
“Downtime” for webhooks usually shows up as:
- missed events that never get processed (or arrive too late)
- retries piling up and hitting rate limits
- duplicates when the provider retries and your handler isn’t idempotent
- support tickets because payments, emails, or sync jobs drift out of sync
The fix is boring but reliable: accept signatures created with either the old secret or the new secret during cutover, and watch signature failures closely. Once nearly all traffic validates with the new secret (and retries have drained), remove the old secret.
If your webhook handler is already fragile (inconsistent body parsing, flaky signature checks, mixed concerns inside one giant handler), rotation tends to expose it fast.
Webhook signatures in plain terms (and why rotation gets tricky)
A webhook is one system (the sender) calling your URL (the receiver) when something happens, like a payment or a signup. Because anyone can hit your endpoint, most providers include a shared secret and a signature header so you can tell the request is real.
With an HMAC signature, the sender takes the exact request body, mixes it with the secret, and produces a short fingerprint (the signature). Your server does the same calculation with its copy of the secret. If the fingerprints match, the sender proved it knows the secret without sending the secret over the wire.
The catch: tiny differences change the fingerprint. Many signature failures during rotation aren’t “bad secrets.” They’re mismatches in what’s being signed.
Common gotchas include:
- signing parsed JSON instead of the raw request body bytes
- whitespace or key-order changes introduced by middleware
- wrong encoding (string vs bytes, UTF-8 vs something else)
- header differences (some providers use different header names or include a timestamp)
- multiple signatures in one header (during rotation or for different algorithms)
So why does “just update the secret” break things? Because updates don’t happen everywhere at once. The provider may roll changes gradually, your deploy may roll out across instances over minutes, and retries can arrive later signed with the previous secret. If you accept only the new secret too early, you’ll reject real events.
That’s why rotation needs an overlap window where you verify with both secrets, plus monitoring that tells you when the old signature has effectively disappeared.
Plan the cutover: overlap window and success signals
A safe rotation starts with one decision: how long you’ll accept both secrets. Your overlap window should be longer than the worst-case time a webhook can arrive late. That includes provider retries (sometimes hours or days), your own queue delays, and any manual replays your team might trigger.
Before you touch code, confirm you can store two secrets at once and keep them out of logs and error messages. Treat one as “current” and one as “previous.” Make it possible to flip which is current without redeploying (a config change or secret manager update).
During the overlap, you typically verify in one of two ways:
- try new, then fall back to old
- verify both and record which one would have passed
Define success signals before the change so you don’t guess later. Track:
- signature pass rate (overall, and by endpoint if you have multiple)
- 4xx/5xx error rate on the receiver
- delivery latency (provider timestamp vs processed timestamp)
- retry volume (spikes often mean verification failures)
Pick an exit rule and stick to it, for example: 99%+ signatures passing on the new secret for 24 hours, no increase in retries, and stable latency. Then schedule removal of the old secret.
Step-by-step: implement dual verification on the receiver
To do webhook secret rotation without downtime, your receiver needs to accept two valid signatures for a short window: the new secret and the old one.
Put both secrets in config (env vars or a secret manager), and load them as an ordered list. Keep the verification function small so you can unit test it without booting your whole app.
secrets = [NEW_SECRET, OLD_SECRET] // old is optional
def verify(raw_body, headers):
sig = headers["X-Signature"]
for secret in secrets:
if secret is empty: continue
expected = hmac(secret, raw_body)
if constant_time_equal(sig, expected):
return true
return false
Details that prevent pain later:
- try the new secret first, then fall back to the old
- use constant-time comparison (or a safe helper from your crypto library)
- return the same error response for any signature failure (don’t reveal which check failed)
- keep the function pure: input is raw body + headers, output is true/false
- add focused tests: valid with new, valid with old, invalid, missing header
One practical rule that avoids a lot of mystery failures: compute the HMAC over the exact raw payload bytes you received. Parsing JSON and re-serializing it often changes whitespace or key order.
If you inherited AI-generated webhook code that mixes parsing, verification, and business logic in one handler, split verification into its own small function first. That one change makes dual verification much safer to ship.
Observability during rotation: what to log and alert on
Secret rotation fails quietly when you can’t see which secret validated a request, or why validation failed. Treat signature validation like an auth system: clear logs, simple metrics, and alerts that catch real problems without constant noise.
Log signature failures using a small set of reason buckets so you can group and act:
- missing signature header
- timestamp missing or out of range
- body read/parse error
- canonical string mismatch
- HMAC mismatch (new)
- HMAC mismatch (old)
Also track which secret validated successful requests. A counter like webhook_validated_total{secret="new"} vs ...{secret="old"} tells you whether partners are still using the old secret and whether dual verification is working.
A compact checklist that stays safe:
- Log: request ID, provider event ID, reason bucket, and which secret validated (new/old)
- Metric: total requests, total failures, validated-by-new vs validated-by-old
- Alert: sustained spike in failures (rate and absolute count)
- Alert: old-secret validations staying high past the planned overlap
- Safety: never log raw secrets; avoid full payloads if they contain PII, tokens, or payment details
Request IDs and event IDs matter because retries and duplicates look like random failures without them. If you see the same event ID failing repeatedly, it often points to a canonicalization bug rather than an attacker.
Cutover playbook: deploy order, monitoring, and rollback
A clean cutover is mostly about order. Start by making the receiver more tolerant, then switch the sender, then tighten again.
Deploy order (safe by default)
- Stage 1: Deploy the receiver with dual verification (accept old OR new). Don’t change the sender yet.
- Stage 2: Update the sender/provider to sign with the new secret.
- Stage 3: Watch validation results until most traffic validates with the new secret.
During Stage 1, monitoring should show a baseline: almost all requests validate with the old secret, and new-secret validations are near zero. After Stage 2, you should see a steady shift from old to new.
What to monitor and what “good” looks like
Track counters, not just logs: total webhooks received, valid-new, valid-old, invalid. Alert on a rise in invalid signatures, and also on valid-old staying high longer than expected (it can mean the sender didn’t actually switch).
To end the overlap, use a clear condition so dual verification doesn’t become permanent:
- a minimum overlap time (often 24-72 hours, depending on retry behavior)
- plus: zero old-secret validations for a full window (for example, 6-12 hours)
Rollback plan
If invalid signatures spike after switching the sender, revert the sender secret first. Keep dual verification on the receiver throughout the incident. That keeps rollback to one change while you investigate payload formatting, timestamp drift, or the wrong secret being deployed.
Edge cases that cause false signature failures
Most “bad signature” errors during webhook secret rotation without downtime aren’t actually bad secrets. They’re mismatches between what the sender signed and what your receiver verified.
First, confirm you’re using the right secret for the right environment. Teams often have multiple endpoints or environments, and secrets get crossed. It’s common to verify a production event with a staging secret because a worker, queue, or config file points at the wrong place.
If the provider uses timestamped signatures, clock skew can look like a signature failure. Allow a reasonable window (for example 5 minutes) and make sure your servers have accurate time. Don’t accept a huge window unless you’re comfortable with the replay risk.
Retries and out-of-order delivery also confuse debugging: an older retry can arrive after you’ve flipped secrets. During the overlap, treat the event as valid if either signature verifies, and rely on idempotency to prevent double-processing.
Two quick checks that catch a lot of “mystery failures”:
- verify against the raw request body bytes, not a re-serialized JSON object
- make sure body parsing doesn’t alter whitespace, encoding, or line endings before verification
Finally, be aware that proxies and middleware can transform the body (decompression, charset changes, newline normalization). Even if the payload looks the same in logs, the bytes may not be the same bytes the provider signed.
Common mistakes (and the simple fixes)
Most failed rotations aren’t about crypto. They’re about handling details that change what gets signed, or failures that stay hidden until customers complain.
Parsing JSON before verifying the signature is the classic mistake. Many frameworks re-encode JSON (spacing, key order, Unicode), so the bytes you verify aren’t the bytes the sender signed. Fix: capture the raw request body first, verify on those exact bytes, then parse JSON.
Another common bug is reading the request stream twice. Middleware reads the body for logging, then your handler reads it again for verification, but the second read is empty. Fix: buffer the body once and pass that buffer to both logging and verification.
Signature header handling trips people up too. Some providers include prefixes like sha256= or send multiple signatures. Fix: parse the header deliberately, select the right value, and match the provider’s algorithm (sha1 vs sha256).
One security footgun: treating verification errors as “probably fine.” Timeouts, malformed headers, decode errors, and missing fields should be hard failures, not soft passes. Fix: fail closed, return a clear 4xx, and log a reason bucket.
Safely removing the old secret and tightening security
Once your overlap window is done and the new secret is consistently passing verification, remove the old secret from config. Leaving it “just in case” quietly increases your attack surface and makes it harder to know what you’re actually validating.
Before you delete anything, confirm you have a clean success signal: a full business cycle with zero unexpected signature failures and no unexplained fallbacks to the old secret.
A safe sequence:
- stop accepting the old secret (remove it from dual verification, or disable it via a feature flag)
- remove the old secret from your secret store and runtime config
- review where the secret might have leaked (old CI logs, debug dumps, shared vault tokens)
- lock down permissions so only a small set of owners can read or change webhook secrets
- document the runbook: owner, exact steps, success criteria, rollback steps, and where to look in logs
If you suspect the secret was exposed (repo history, screenshots, vendor support tickets), rotate immediately even if you’re mid-project.
Also document where signature verification lives in the codebase: the exact module/function, how the raw body is captured (a common failure point), and which headers are used.
Quick checklist for a no-drama rotation
Treat rotation like a small migration: overlap, measure, then remove.
Before you switch anything
- Deploy receiver code that accepts both signatures (old and new).
- Add dashboards for pass rate, fail rate, and validations split by secret version.
- Confirm you can quickly change the sender secret and that you have a rollback toggle.
During the cutover
- Deploy receiver dual verification first, then switch the sender.
- Watch invalid signatures during the first few minutes and again after your normal retry window.
- Keep logs safe: include event type, timestamp, sender ID, and which secret validated. Don’t log raw payloads, raw signatures, or secrets.
After the switch
- Wait long enough for retries and delayed deliveries to finish (often at least one full retry window, sometimes 24 hours).
- When charts show zero old-secret validations for the full window you chose, delete the old secret.
- Write a short audit note: when you rotated, who approved it, what you monitored, and when the old secret was removed.
Realistic example: rotating a payment webhook secret
A small SaaS app takes card payments and receives payment.succeeded events from its payment provider. The team plans a short overlap window where the receiver accepts signatures from both the old and new secrets.
On Monday morning they deploy receiver v2 with dual signature verification. Nothing changes at the provider yet. For the first hour, almost every request validates with the old secret, and the new secret counter stays near zero (expected).
After lunch they update the provider to start signing with the new secret. Within minutes the graphs flip: valid_new climbs, valid_old slowly drops (from retries still in flight), and invalid_both stays flat. That’s the key success signal.
They keep logs and counters that answer one question fast: what happened to this event?
webhook_received event=payment.succeeded valid=old request_id=8f2...
webhook_received event=payment.succeeded valid=new request_id=912...
webhook_received event=payment.succeeded valid=none reason=signature_mismatch request_id=aa1...
metrics: valid_old=120 valid_new=118 invalid_both=0
Then a bug appears: invalid_both spikes right after a framework update. Both secrets failing at the same time is a strong hint the app is verifying the wrong bytes (body parsing or encoding changes). They fix the code to validate against the raw payload, redeploy, and the spike disappears.
The next day, after a quiet period, they remove the old secret and keep alerting on signature failures.
Next steps if your webhook code is unreliable
If you’re attempting webhook secret rotation without downtime and the receiver keeps rejecting real requests, don’t treat it as a rotation problem. Treat it as a verification problem.
Start by hardening the raw-body path. Most signature bugs happen because the payload gets changed before you compute HMAC (JSON parsing and re-serialization, whitespace changes, character encoding). Verify the signature against the exact bytes that arrived, then parse only after it passes.
Add a small set of automated tests that match real production failures:
- valid signature with the exact raw request body (should pass)
- one byte changed in the body (should fail)
- wrong secret (should fail)
- missing signature header (should fail with a clear log)
- multiple signature values (should pick the right one or fail predictably)
Before production, do a staging dry run using the same steps you’ll use for real: enable dual verification, send webhooks signed with the old secret and the new secret, and confirm logs and alerts behave the way you expect.
If your webhook handler was generated by tools like Lovable, Bolt, v0, Cursor, or Replit and it’s behaving oddly under retries or rotations, a focused review can save you a long incident. FixMyMess (fixmymess.ai) does codebase diagnosis and repair for AI-generated apps, including webhook signature validation, safe logging, and deployment prep.
FAQ
How do I rotate a webhook secret without breaking deliveries?
Use an overlap window where your receiver accepts signatures created with either the old secret or the new secret. Deploy dual verification first, switch the provider second, and only remove the old secret after retries have drained and you see almost all validations using the new secret.
Why does “just changing the secret” cause webhook failures?
Because the sender and receiver rarely switch at the exact same moment. Providers may roll changes gradually, your servers may deploy across instances over minutes, and delayed retries can arrive later signed with the previous secret, so a single-secret “flip” causes real events to fail verification.
What does webhook “downtime” look like during secret rotation?
Verification is failing, which your handler treats like tampering. The provider will usually retry, but you can still get delayed processing, retry storms, rate limiting, and duplicates if your handler isn’t idempotent, so it can look like downtime even though your server is up.
Should I verify the signature on parsed JSON or the raw request body?
Verify the HMAC against the exact raw request body bytes you received, before any JSON parsing or re-serialization. Parsing first often changes whitespace, key order, or encoding, which changes the signature result even when the secret is correct.
During the overlap, should I try the new secret first or the old one?
Default to new first, then old, and record which one succeeded. That keeps you aligned with the intended direction of migration while still accepting late retries signed with the old secret.
How long should I accept both secrets during rotation?
Keep it longer than your worst-case delay, including provider retries, your internal queue delays, and any manual replays. A common baseline is 24–72 hours, but the practical rule is: don’t remove the old secret until old-secret validations drop to zero for a full window you trust.
What should I monitor while rotating webhook secrets?
Track total webhooks received, total signature failures, and the split of successful validations by secret (new vs old). Also watch receiver 4xx/5xx rates, delivery latency, and retry volume so you can spot verification failures before they become customer-visible issues.
What’s safe and useful to log for signature verification issues?
Log a request or event identifier, a small reason bucket (missing header, timestamp out of range, body read error, HMAC mismatch), and whether validation succeeded via new or old. Avoid logging raw secrets, raw signatures, or full payloads if they may contain sensitive data.
What’s the safest rollback plan if signature failures spike?
Put dual verification on the receiver first and leave it in place. If failures spike after switching the provider, roll back the provider/sender secret first, because that’s usually the fastest single change, while you investigate body parsing, header parsing, timestamps, or the wrong secret being deployed.
What are the most common bugs that cause “invalid signature” during rotation?
The classic ones are reading the request body twice (second read is empty), verifying after middleware has altered the body, mishandling signature headers with prefixes or multiple values, using the wrong environment’s secret, and not using constant-time comparison. Fix by buffering the raw body once, parsing headers deliberately, verifying before parsing, and failing closed with consistent 4xx responses.