/ Article — Architecture

Inside My Webhook Replay System: Reliable WordPress Webhooks with Retry and Replay

Most WordPress webhooks fire once. If the receiving API fails, the event disappears. No retry. No visibility. No recovery.

/ The Problem

The Webhook Problem

WordPress fires a webhook, the receiving endpoint is down for 30 seconds, and the event is gone forever. No retry. No alert. The order completed, the form submitted, the customer updated — but the downstream system never heard about it.

This is the default behavior of every fire-and-forget webhook implementation. The HTTP call is made once, inline, during the PHP request. If it fails, nothing records the failure. If the endpoint returns a 500, nothing reschedules delivery. If PHP crashes mid-request, the attempt disappears entirely.

The problem compounds over time. Integration drift accumulates silently — your CRM, ERP, or automation platform slowly diverges from WordPress reality. Users report inconsistencies days later, when there is no forensic record of what happened and no mechanism to recover the lost events.

Reliable webhook delivery requires two distinct mechanisms: automatic retry for failed deliveries, and manual replay for events that delivered successfully but need to be resent. Most implementations implement one or neither. This article explains how to build both — and why both are necessary for a production-grade system.

See also: why WordPress webhooks silently fail in production — a detailed breakdown of every structural failure mode in the default WordPress webhook model.

/ Reliability

Why Retries Alone Are Not Enough

Automatic retry covers the most common failure scenario: a transient network error, a temporary endpoint outage, a brief API rate limit. You send the webhook, it fails, you wait and try again. After a few attempts it succeeds and the problem resolves itself.

But retry logic cannot solve every delivery problem. Consider what happens after a successfully delivered event:

Your receiving system had a bug. It accepted the webhook and returned 200, but processed the payload incorrectly — wrote wrong data to the database, triggered the wrong automation, or silently dropped the record due to a validation error. By the time you discover the problem, the retry window has long passed. The event has status "delivered." No retry will ever fire.

Or your integration was rebuilt. You migrated to a new CRM, redeployed your automation platform, rewrote the receiving endpoint. Now you need to resend six months of order events to populate the new system. Retries only cover the recent past — and only failures at that.

Or a downstream service had an outage. Events delivered successfully, but the service did not process them correctly during the outage window. Their support team tells you to resend the events from that period. You have no mechanism to do that.

Replay is the mechanism that covers these cases. It lets you resend any previously delivered event — on demand, to the original endpoint or a new one — regardless of how long ago it was delivered. Retry handles failures automatically. Replay handles the rest manually.

/ Mechanisms

Retry vs Replay

The two mechanisms are distinct in trigger, process, and use case. Conflating them leads to systems that do one poorly while lacking the other entirely.

Mechanism Trigger Use Case
Retry Delivery failure (5xx, timeout, network error) Temporary network/API issues — system recovers automatically
Replay Manual — delivery succeeded Bug fixes, system rebuilds, integration resync, data recovery

Retry is automatic and reactive — it fires when something goes wrong, without human intervention. Replay is manual and intentional — a developer or operator decides to resend a specific event or batch of events.

Both mechanisms require the same prerequisite: the original payload must be stored. You cannot retry or replay an event you have not persisted. This is the foundational design decision that everything else depends on.

/ Architecture

Architecture of the System

The system has three components: payload storage, a retry queue, and a replay mechanism. Every webhook event passes through all three in order.

Webhook retry and replay architecture diagram

When a WordPress event fires — an order completed, a form submitted, a post published — the first action is always to store the payload. Not to send it. Storage happens synchronously, before any HTTP call is made. Once persisted, the event cannot be lost, regardless of what happens next.

The delivery attempt is then made asynchronously, outside the WordPress request cycle. Success writes to the event log and marks the event as replay-eligible. Failure enqueues the event for automatic retry with exponential backoff. After exhausting retries, the event reaches dead-letter status and waits for manual intervention.

Replay-eligible events remain in the log indefinitely. Any of them can be manually triggered for resend — individually or in bulk — without creating a new event or modifying the original payload.

/ Storage

Payload Storage

Every webhook event is stored before any delivery is attempted. The stored record includes everything needed to reconstruct the original delivery and any subsequent retry or replay:

The stored payload is immutable after creation. Retries and replays read from it; they do not modify it. This guarantees that every delivery attempt — across all retries and replays — sends exactly the same payload the original event generated.

Storage happens in the same database transaction as the WordPress event. If storage fails, the event is not lost — the WordPress transaction rolls back cleanly. If storage succeeds and the subsequent delivery attempt fails, the event is already in the retry queue.

/ Retry

Retry Logic

Retry fires automatically on any delivery failure that has a realistic chance of succeeding on a subsequent attempt. The key distinction is between transient failures — where retry makes sense — and permanent failures, where it does not.

Retry on: 5xx responses (server errors), connection timeouts, network errors, 429 rate limit responses.

Do not retry on: 4xx responses (except 429). A 400 means the payload is malformed. A 401 means authentication failed. A 404 means the URL no longer exists. None of these resolve themselves through retrying — they require a code fix, a configuration change, or human review. Retrying 4xx responses burns the entire attempt budget on an unrecoverable failure and masks the real problem.

Retry attempts use exponential backoff to avoid hammering a struggling endpoint:

Attempt 1  →  immediate
Attempt 2  →  +30s
Attempt 3  →  +2m
Attempt 4  →  +10m
Attempt 5  →  +1h
→  Dead letter (manual review required)

Each retry reads the stored payload and sends an identical request. The X-Event-ID header carries the same UUID as the original attempt, allowing the receiving endpoint to deduplicate if needed. The X-Event-Timestamp always carries the original event time, not the retry time.

After exhausting all retry attempts, the event moves to dead-letter status. It remains in the log, visible, with a complete attempt history. From dead-letter status it can be manually replayed once the underlying problem — a misconfigured endpoint, a broken authentication key, a malformed payload — has been fixed.

See also: async webhooks in WordPress — the architectural approach to moving webhook dispatch outside the PHP request cycle entirely.

/ Replay

Replay Logic

Replay resends a previously stored event. It works on any event in the log — delivered, failed, dead-letter — regardless of age. The trigger is always manual: a developer or operator selects an event and initiates replay.

The replay process reads the stored payload verbatim and sends it as a new HTTP request. By default it uses the original target URL, but allows override to redirect delivery to a new endpoint. The X-Event-ID header carries the same UUID as the original event — this is intentional. The receiving endpoint should recognize the UUID and treat the replay as a duplicate, updating its records rather than creating a new entry.

Replay does not modify the original stored event. It creates a new delivery attempt record linked to the same event. The event log shows the full history: original delivery, any retry attempts, and all replay attempts — each with timestamp, HTTP status, and response body.

Bulk Replay

Individual replay handles one-off cases. Bulk replay handles migrations, outages, and integration resyncs. You select a date range, an event type, or a set of target endpoints, and the system re-queues all matching events for delivery.

Bulk replay runs through the same delivery queue as normal dispatches and retries — it does not bypass the rate limiting or concurrency controls. This prevents a large replay batch from overwhelming a receiving endpoint that is processing at its normal capacity.

When to Use Replay

Replay is the right tool when retry has already completed — either successfully or by exhausting attempts — and you need to resend. Common scenarios:

/ Lessons

Lessons Learned

Four principles that shaped the design of this system — and that apply to any webhook reliability implementation:

The systems that cause the most integration problems are not those with aggressive retry logic or complex replay mechanisms. They are the ones with no persistence at all — where every failed delivery is simply forgotten and every successful delivery is unrecoverable.

/ Production Alternative

If you'd rather not build this yourself

Flow Systems Webhook Actions is the plugin this architecture describes. It stores every payload before dispatch, retries failed deliveries with exponential backoff, classifies 4xx as permanent failures so retries are never wasted, and makes every successfully delivered event available for manual replay — individually or in bulk. The full delivery history, including attempt timestamps and HTTP status codes, is visible directly in the WordPress admin and queryable programmatically via the REST API.

If your team prefers configuration over maintaining custom queue infrastructure, see the full plugin details. Open source, distributed via WordPress.org and GitHub.

/ Final Thoughts

Final Thoughts

WordPress ships with no webhook persistence, no retry logic, and no replay mechanism. Every one of these capabilities has to be built — or used from a plugin that provides them. The gap between the default behavior and production-grade reliability is not small.

The architecture described here — store before send, automatic retry with exponential backoff, manual replay on demand — covers the full range of delivery failure scenarios. Transient failures recover automatically. Permanent failures surface for human review. Successful deliveries remain replayable indefinitely.

The observability layer is what makes the rest of it trustworthy. Without a complete log of every delivery attempt, you cannot verify that the system is working — you are back to hoping the events arrived.

Related reading: