TL;DR: A webhook retry policy answers three questions — which failures to retry, how long to wait between attempts, and when to give up.
- Retry only what can succeed later: 5xx and 429 yes; 4xx (except 429) and 3xx no — fail fast.
- Use exponential backoff with a ceiling and jitter, e.g. 30s, 1m, 2m, 4m, 8m capped at 1 hour over ~5 attempts.
- Track every attempt in a deliveries plus attempts schema so you can replay, dead-letter, and report a real success rate.
/ The policy
What is a good webhook retry policy?
A retry policy is the set of rules that decide what happens after a delivery fails. A sane default has four moving parts: retry on 5xx, 429, and network or timeout errors; back off exponentially with a ceiling so you stop hammering a struggling endpoint; cap the total at around five attempts spread over a few hours; then move the delivery to a dead-letter state for manual replay instead of retrying forever.
Everything else is tuning. The numbers below are a starting point, not a law — a payment provider and an internal analytics sink deserve different ceilings — but the shape is the same across almost every production webhook system, from Stripe's webhook delivery down to a single WordPress site.
/ Backoff
Why exponential backoff instead of fixed intervals?
Because the failure is usually the receiver being briefly overloaded or down. A fixed-interval retry hammers a struggling endpoint at a constant rate and can keep it down; exponential backoff gives the receiver geometrically more time to recover on each attempt while still retrying quickly for a one-second blip.
Add jitter — a small random offset on each delay — so a fleet of senders that all failed at the same moment does not re-fire in lockstep and create a thundering herd at every interval boundary. Even on a single site, jitter spreads retries off the exact minute mark and away from the rest of your cron load.
/ The schedule
What backoff schedule should you use?
Double the delay each attempt from a small base, cap it, and stop after a fixed count. A widely used shape is base 30 seconds, factor 2, ceiling 1 hour: roughly 30s, 1m, 2m, 4m, 8m, then any further attempts pinned at the cap. The formula is delay = min(cap, base * 2 ^ (attempt - 1)), plus a jitter term.
This is exactly the schedule the Webhook Actions plugin ships: 5xx and 429 responses retry with delays of about 30s, 60s, 120s, 240s, and 480s, capped at one hour, with a default of five attempts that you can override with the fswa_max_attempts filter.
PHP — exponential backoff with jitter
function next_retry_delay( int $attempt ): int { $base = 30; // seconds $cap = 3600; // 1 hour ceiling $delay = min( $cap, $base * ( 2 ** ( $attempt - 1 ) ) ); // add up to 20% jitter so retries do not synchronize return $delay + random_int( 0, (int) ( $delay * 0.2 ) ); }
/ The schema
What database schema tracks retries?
Two tables. A deliveries row represents one logical webhook send and carries the state machine; an attempts row records each physical HTTP try so you keep the full request and response history. Splitting them keeps the hot path (find due deliveries) on a small, well-indexed table while the verbose request and response bodies live separately.
The columns that matter on deliveries: an event_id as the idempotency key, the endpoint and payload, a status enum (queued, sending, delivered, failed, dead), an attempt_count, and a next_attempt timestamp the runner queries against. Index on (status, next_attempt) so "what is due now" stays fast as the table grows.
SQL — deliveries table (the hot path)
CREATE TABLE wp_webhook_deliveries ( id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY, event_id CHAR(36) NOT NULL, -- idempotency key endpoint_url VARCHAR(512) NOT NULL, payload LONGTEXT NOT NULL, status VARCHAR(20) NOT NULL DEFAULT 'queued', attempt_count SMALLINT NOT NULL DEFAULT 0, next_attempt DATETIME NULL, created_at DATETIME NOT NULL, UNIQUE KEY uq_event (event_id), KEY due (status, next_attempt) );
/ What to retry
Which responses should trigger a retry?
Retry 5xx and 429; never retry 4xx (except 429) or 3xx. A 5xx means the receiver broke on a request that may be fine next time. A 429 means slow down — back off and try again. A 4xx like 400 or 422 means the payload itself is wrong, so retrying just fails again more slowly; mark it permanently failed and surface it. A 3xx redirect is a configuration problem, not a transient one.
Treat network-level failures — DNS, connection refused, timeouts — like 5xx: retry them. They are the most common transient failure of all, and a request that timed out may well have one foot in the door, which is exactly why the next section matters.
| Response | Hand-rolled wp_remote_post | Webhook Actions |
|---|---|---|
| 5xx / timeout | Retried only if you wrote the loop yourself | Retried with exponential backoff automatically |
| 429 Too Many Requests | Usually ignored and treated as success | Retried, respecting the backoff schedule |
| 4xx / 3xx | Often blindly retried, wasting cycles | Marked permanently_failed immediately |
/ When to stop
How do you stop retrying — dead-letter and idempotency?
Cap the attempts. When a delivery exhausts its budget, set its status to dead and surface it for manual replay rather than retrying forever; an endpoint that has failed five times over an hour is down, not flaky. A dead-letter state turns an invisible silent failure into a queue someone can actually action.
Pair retries with an idempotency key — the event_id column above — sent as a header so the receiver can dedupe. Any retry policy is at-least-once delivery by definition: a request that timed out may have succeeded on the receiver before your timeout fired, so a retry delivers it twice. The key lets the receiver collapse those duplicates, the same pattern Stripe documents for its own webhooks. For a WordPress-specific build of this whole loop, see the retry and replay system walkthrough.
Retrying a 400 is just a slower way to fail. Backoff is for problems that time can fix. — Retry design, in one line
Webhook Actions implements this policy out of the box: exponential-backoff retries on 5xx and 429, a per-attempt delivery log, dead-letter visibility, and one-click replay — no schema or backoff code to maintain.
fswa_max_attempts filter: Webhook Actions on WordPress.org.