WP Webhooks / Blog / Architecture
Article · Architecture

Webhook Retry Policy: Exponential Backoff & Schema

Design a webhook retry policy: exponential backoff timing, what to retry, max attempts, dead-lettering, and the database schema to track deliveries.

8 min 2026-06-19
#webhooks#architecture#reliability

TL;DR: A webhook retry policy answers three questions — which failures to retry, how long to wait between attempts, and when to give up.

  • Retry only what can succeed later: 5xx and 429 yes; 4xx (except 429) and 3xx no — fail fast.
  • Use exponential backoff with a ceiling and jitter, e.g. 30s, 1m, 2m, 4m, 8m capped at 1 hour over ~5 attempts.
  • Track every attempt in a deliveries plus attempts schema so you can replay, dead-letter, and report a real success rate.

/ The policy

What is a good webhook retry policy?

A retry policy is the set of rules that decide what happens after a delivery fails. A sane default has four moving parts: retry on 5xx, 429, and network or timeout errors; back off exponentially with a ceiling so you stop hammering a struggling endpoint; cap the total at around five attempts spread over a few hours; then move the delivery to a dead-letter state for manual replay instead of retrying forever.

Everything else is tuning. The numbers below are a starting point, not a law — a payment provider and an internal analytics sink deserve different ceilings — but the shape is the same across almost every production webhook system, from Stripe's webhook delivery down to a single WordPress site.

/ Backoff

Why exponential backoff instead of fixed intervals?

Because the failure is usually the receiver being briefly overloaded or down. A fixed-interval retry hammers a struggling endpoint at a constant rate and can keep it down; exponential backoff gives the receiver geometrically more time to recover on each attempt while still retrying quickly for a one-second blip.

Add jitter — a small random offset on each delay — so a fleet of senders that all failed at the same moment does not re-fire in lockstep and create a thundering herd at every interval boundary. Even on a single site, jitter spreads retries off the exact minute mark and away from the rest of your cron load.

Webhook delivery retry state machineA queued delivery moves to sending. A 2xx response marks it delivered. A 5xx, 429, or timeout schedules a backoff retry while attempts remain, and dead-letters for manual replay once they are exhausted. A 4xx other than 429, or a 3xx, marks the delivery permanently failed.

2xx

5xx / 429 / timeout

yes

no

4xx not 429 / 3xx

queued

sending

delivered

attempts left?

wait backoff(attempt)

dead-letter
manual replay

permanently_failed

FIG 01 — Delivery retry state machine

/ The schedule

What backoff schedule should you use?

Double the delay each attempt from a small base, cap it, and stop after a fixed count. A widely used shape is base 30 seconds, factor 2, ceiling 1 hour: roughly 30s, 1m, 2m, 4m, 8m, then any further attempts pinned at the cap. The formula is delay = min(cap, base * 2 ^ (attempt - 1)), plus a jitter term.

This is exactly the schedule the Webhook Actions plugin ships: 5xx and 429 responses retry with delays of about 30s, 60s, 120s, 240s, and 480s, capped at one hour, with a default of five attempts that you can override with the fswa_max_attempts filter.

PHP — exponential backoff with jitter

function next_retry_delay( int $attempt ): int {
    $base = 30;     // seconds
    $cap  = 3600;   // 1 hour ceiling

    $delay = min( $cap, $base * ( 2 ** ( $attempt - 1 ) ) );

    // add up to 20% jitter so retries do not synchronize
    return $delay + random_int( 0, (int) ( $delay * 0.2 ) );
}

/ The schema

What database schema tracks retries?

Two tables. A deliveries row represents one logical webhook send and carries the state machine; an attempts row records each physical HTTP try so you keep the full request and response history. Splitting them keeps the hot path (find due deliveries) on a small, well-indexed table while the verbose request and response bodies live separately.

The columns that matter on deliveries: an event_id as the idempotency key, the endpoint and payload, a status enum (queued, sending, delivered, failed, dead), an attempt_count, and a next_attempt timestamp the runner queries against. Index on (status, next_attempt) so "what is due now" stays fast as the table grows.

SQL — deliveries table (the hot path)

CREATE TABLE wp_webhook_deliveries (
  id            BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
  event_id      CHAR(36)     NOT NULL,            -- idempotency key
  endpoint_url  VARCHAR(512) NOT NULL,
  payload       LONGTEXT     NOT NULL,
  status        VARCHAR(20)  NOT NULL DEFAULT 'queued',
  attempt_count SMALLINT     NOT NULL DEFAULT 0,
  next_attempt  DATETIME     NULL,
  created_at    DATETIME     NOT NULL,
  UNIQUE KEY uq_event (event_id),
  KEY due (status, next_attempt)
);

/ What to retry

Which responses should trigger a retry?

Retry 5xx and 429; never retry 4xx (except 429) or 3xx. A 5xx means the receiver broke on a request that may be fine next time. A 429 means slow down — back off and try again. A 4xx like 400 or 422 means the payload itself is wrong, so retrying just fails again more slowly; mark it permanently failed and surface it. A 3xx redirect is a configuration problem, not a transient one.

Treat network-level failures — DNS, connection refused, timeouts — like 5xx: retry them. They are the most common transient failure of all, and a request that timed out may well have one foot in the door, which is exactly why the next section matters.

ResponseHand-rolled wp_remote_postWebhook Actions
5xx / timeoutRetried only if you wrote the loop yourselfRetried with exponential backoff automatically
429 Too Many RequestsUsually ignored and treated as successRetried, respecting the backoff schedule
4xx / 3xxOften blindly retried, wasting cyclesMarked permanently_failed immediately

/ When to stop

How do you stop retrying — dead-letter and idempotency?

Cap the attempts. When a delivery exhausts its budget, set its status to dead and surface it for manual replay rather than retrying forever; an endpoint that has failed five times over an hour is down, not flaky. A dead-letter state turns an invisible silent failure into a queue someone can actually action.

Pair retries with an idempotency key — the event_id column above — sent as a header so the receiver can dedupe. Any retry policy is at-least-once delivery by definition: a request that timed out may have succeeded on the receiver before your timeout fired, so a retry delivers it twice. The key lets the receiver collapse those duplicates, the same pattern Stripe documents for its own webhooks. For a WordPress-specific build of this whole loop, see the retry and replay system walkthrough.

Retrying a 400 is just a slower way to fail. Backoff is for problems that time can fix. — Retry design, in one line

Webhook Actions implements this policy out of the box: exponential-backoff retries on 5xx and 429, a per-attempt delivery log, dead-letter visibility, and one-click replay — no schema or backoff code to maintain.

/Footnotes
¹ Webhook delivery, retries, and idempotency guidance: docs.stripe.com/webhooks.
² Retry schedule and fswa_max_attempts filter: Webhook Actions on WordPress.org.
³ WordPress HTTP request function used to deliver webhooks: wp_remote_post().
FAQ

Common questions always ask.

Don't see yours? Open an issue on GitHub or check the full reference in the API docs.

What is exponential backoff for webhooks? +
Exponential backoff doubles the wait between retry attempts from a small base up to a ceiling — for example 30s, 1m, 2m, 4m, 8m capped at 1 hour. It gives a failing receiver geometrically more time to recover on each attempt while still retrying quickly for brief blips. Add jitter so senders do not retry in lockstep.
Which HTTP responses should trigger a webhook retry? +
Retry 5xx server errors, 429 Too Many Requests, and network or timeout failures — these can succeed later. Do not retry other 4xx responses (like 400 or 422) or 3xx redirects: those mean the request itself is wrong, so retrying just fails again. Mark them permanently failed and surface them.
How many times should you retry a webhook? +
A common default is about five attempts spread over a few hours via exponential backoff. After the budget is exhausted, move the delivery to a dead-letter state for manual replay rather than retrying forever — an endpoint that has failed five times over an hour is down, not flaky.
What database schema should I use to track webhook deliveries? +
Use two tables: a deliveries row per logical send (event_id, endpoint, payload, status, attempt_count, next_attempt) and an attempts row per physical HTTP try (response code, body, duration, error). Index deliveries on (status, next_attempt) so finding due deliveries stays fast.
Why do I need an idempotency key with retries? +
Any retry policy is at-least-once delivery: a request that timed out may have already succeeded on the receiver, so a retry delivers it twice. Sending a stable idempotency key (an event_id) as a header lets the receiver detect and collapse duplicates safely.
Ready

Stop losing webhooks.
Start logging them.

$ wp plugin install flowsystems-webhook-actions --activate