Webhooks that work perfectly in local development routinely drop events in production. The reasons are structural: WordPress was not built for reliable background event delivery. This article maps every common failure mode and the infrastructure required to eliminate it.
The symptom pattern is consistent across WordPress and WooCommerce sites of every size. Some webhook events never arrive at the receiving endpoint — no error surfaces in the WordPress admin, no log entry, no alert. The order completed, the form submitted, the status changed — but the downstream system never heard about it.
Retries never happen because there is nothing tracking whether the delivery succeeded or failed. The attempt was made inline, during the PHP request, and the result was never persisted anywhere. Once the request ended, the event was gone.
Users report inconsistent behavior: "sometimes the CRM updates, sometimes it doesn't." Support tickets arrive days after the fact, when someone notices a discrepancy between the WordPress order history and the connected platform. By then the delivery window has long closed, the PHP logs have rotated, and there is no forensic record of what happened.
The logs that do exist are incomplete. WordPress does not log outbound HTTP requests by default. Unless you have explicitly wired up per-attempt logging, a failed wp_remote_post call produces nothing observable. The silence is the failure mode.
These symptoms share a single root cause: the WordPress request lifecycle is not designed for reliable background event delivery. Understanding why requires understanding how PHP executes code.
PHP executes synchronously within a single HTTP request. When a browser or API client hits a WordPress page, PHP boots, runs the request handlers, and terminates. Every line of code in that request — including any outbound HTTP calls — must complete before the response is returned to the caller.
This model is entirely appropriate for rendering pages. It becomes a structural liability the moment you try to use it for reliable event delivery. A webhook call attached to woocommerce_order_status_completed runs inline, inside the request that triggered the order completion. If that outbound call is slow, the user's page is slow. If it fails, the event is gone. If PHP crashes mid-request, nothing was recorded.
The most common webhook implementation in WordPress looks like this — and this is exactly the pattern that fails silently in production:
If this call times out or PHP crashes mid-request, the event is lost. There is no retry, no log entry, no signal that delivery failed. The order shows as completed in WooCommerce — but your CRM, ERP, or automation platform never received the event.
The fix is not to add error checking to this pattern. The fix is to move webhook dispatch out of the request cycle entirely — into a persistent queue that survives PHP crashes, retries on failure, and logs every attempt regardless of outcome.
When developers reach for a background processing solution in WordPress, WP-Cron is the natural starting point. It ships with core, requires no server configuration, and appears to offer scheduled execution. In production, it falls short in ways that directly cause webhook delivery failures — see the full breakdown of why WP-Cron is not enough for reliable automation.
WP-Cron does not run on a time-based schedule. It fires on page load. When a request hits WordPress, PHP checks whether any scheduled cron events are overdue and runs them as part of that request. This means on a site with zero traffic — at 3am, over a weekend, during a server maintenance window — WP-Cron does not fire. Jobs queue in the database, the worker never runs, and webhook events pile up undelivered for the entire zero-traffic window.
Shared hosting compounds the problem. Many hosts impose execution time limits and terminate long-running PHP processes. A WP-Cron batch that processes fifty queued webhooks may be killed partway through, leaving some jobs in an inconsistent state — marked as processing but with no delivery attempted.
The reliable alternative is to disable WP-Cron's page-load trigger and run it from a real system cron job instead. Add define( 'DISABLE_WP_CRON', true ); to wp-config.php, then configure a system crontab entry that hits the WordPress cron URL every minute:
With a system cron entry in place, the webhook worker runs on a guaranteed schedule. A low-traffic site at 3am gets the same delivery timeliness as the same site during peak hours. For step-by-step setup — including the crontab entry, WP-CLI alternative, and Action Scheduler — see Cron Job for WordPress: WP-Cron Limits and Real Fixes.
The gap between a fire-and-forget wp_remote_post call and a production-grade webhook system is not about writing better PHP. It is about the infrastructure that surrounds the HTTP call. Every cell in this table represents a deliberate design decision — and each missing feature in the fire-and-forget column is a category of silent failure.
| Aspect | Fire-and-forget | Production-grade |
|---|---|---|
| Execution model | Inline, blocks PHP | Background worker |
| Persistence | None — lost on crash | Queue table (survives restarts) |
| Event identity | No UUID | UUID + timestamp headers |
| Retry on failure | Never | Smart retry (5xx, 429 only) |
| Backoff strategy | None | Exponential with jitter |
| 4xx handling | Retried (wastes attempts) | Immediate permanent failure |
| Permanent failure state | None | Dead-letter with history |
| Attempt history | None | Per-attempt log record |
| Queue monitoring | None | Depth, age, stuck detection |
| Manual retry | Not possible | UI + bulk retry tools |
| Payload versioning | None | Version field + schema stability |
Each of these features addresses a specific failure mode. Removing any one of them reintroduces that failure mode. The comparison table is also a checklist: a reliable webhook delivery system needs all eleven properties.
Every webhook event needs a stable, globally unique identifier generated at enqueue time — not at dispatch time, and not regenerated on retries. The UUID travels with the event across every delivery attempt, including all retries. The receiving endpoint uses this UUID to deduplicate: if it has already processed event uuid-abc-123, it discards subsequent deliveries with the same ID.
This matters because retry logic and idempotency are inseparable. A retry-capable system will, by definition, sometimes deliver the same event more than once — network errors can occur after the endpoint has processed the request but before it returned a 2xx response. Without a stable event UUID, the receiver has no way to distinguish a legitimate new event from a duplicate retry.
Three standard headers carry the event identity on every request:
X-Webhook-ID — the stable UUID generated at enqueue time. Same value on every attempt.
X-Webhook-Timestamp — Unix epoch timestamp of the original event, not the retry time.
X-Webhook-Version — payload schema version. Allows the receiver to route to the correct parser as your payload structure evolves.
A version field in the payload body itself reinforces the schema contract and allows schema evolution without breaking existing consumers who are pinned to an older version.
Not all failures are equal, and treating them equally is one of the most common mistakes in webhook retry logic. The HTTP status code the endpoint returns carries precise information about what went wrong — and that information should directly determine whether retrying makes sense.
Retryable failures are transient. The endpoint was unavailable or overloaded, and the same request will likely succeed once the condition clears:
5xx errors — server-side errors (500, 502, 503, 504). The endpoint was reached but encountered an internal problem. Retry with backoff.
429 Too Many Requests — the endpoint is rate-limiting. Retry after the backoff interval, honoring any Retry-After header if present.
Network-level WP_Error — DNS failure, connection timeout, SSL handshake error. The endpoint was not reached at all. Retry.
Non-retryable failures are permanent. The endpoint understood the request and rejected it. No amount of retries will fix a structural problem:
400 Bad Request — the payload is malformed from the endpoint's perspective.
401 Unauthorized / 403 Forbidden — authentication or authorization failure. A configuration problem, not a transient outage.
404 Not Found / 410 Gone — the endpoint URL no longer exists.
422 Unprocessable Entity — the payload structure is valid JSON but fails schema validation.
Retrying 4xx responses wastes the entire retry budget on an unrecoverable failure. Mark these as permanently failed after the first attempt and surface them for human review immediately.
The is_wp_error() check is critical: it catches network failures that never produce an HTTP status code at all — DNS resolution failure, connection refused, SSL handshake error. These are distinct from HTTP errors and must be handled separately. See the wp_remote_post() documentation for the full return value specification.
Retrying immediately after a failure is rarely the right choice. An endpoint that just returned a 503 is under stress. Hammering it with immediate retries makes the situation worse — for the endpoint and for every other client hitting it. Exponential backoff spaces retries progressively further apart, giving the endpoint time to recover.
The formula is:
With a base delay of 60 seconds (1 minute), five attempts produce the following schedule:
Five attempts at base delay 60 seconds covers a 31-minute total retry window. This is long enough to survive transient outages and short-lived infrastructure incidents, without keeping a failed event in the active queue indefinitely.
For high-volume systems where many events may fail simultaneously — for example, during an endpoint outage — add jitter: a small random offset applied to each retry delay. Jitter prevents all failed events from retrying at exactly the same second, which would produce a thundering herd that re-stresses the recovering endpoint instead of allowing it to stabilize.
The queue table is what makes every other reliability property possible. Without it, there is nothing to retry, nothing to log against, and nothing to monitor. The queue is the single source of truth for the state of every webhook event from enqueue to delivery or permanent failure.
The minimal schema requires these columns: id, uuid, endpoint, payload (JSON), payload_version, status, attempt, created_at, scheduled_at, last_attempt_at, last_status_code, last_error.
Status values drive the worker's query logic: pending (waiting to be dispatched), processing (currently being dispatched by a worker), complete (delivered successfully), failed (exhausted retries or received a permanent 4xx).
The processing status acts as a dispatch lock. Before attempting delivery, the worker sets the row to processing. This prevents two concurrent worker processes from dispatching the same event simultaneously — a real risk if your cron interval is shorter than your delivery timeout. After the attempt completes (success or failure), the worker updates the status to the appropriate final or pending state.
WordPress Action Hook
│
▼
Queue Table (MySQL)
{ uuid, endpoint, payload, version, status=pending, attempt=0, scheduled_at }
│
▼
Response sent to user ◄── request ends here
│
(background)
▼
Cron Worker (system cron preferred)
│
├─ 2xx → status=complete
├─ 5xx / timeout → reschedule (exponential backoff)
├─ 4xx → status=failed (permanent)
└─ attempt >= max → status=failed (dead-letter)
The critical property of this architecture: the user-facing request ends before any delivery is attempted. The webhook payload is captured reliably in the database, and the worker runs independently. A PHP crash during the user request loses nothing — the queue row was written before the crash. A PHP crash during the worker run leaves rows in processing state, which a stuck-detection job can reset to pending for re-processing.
The queue table tracks the current state of each event. A separate attempt history table tracks every delivery attempt — including those that led to the current state. This distinction matters: the queue row tells you where the event is now; the history tells you how it got there.
Every attempt — success, transient failure, and permanent failure — should produce a log record. The structure of each record:
Store the payload hash rather than the raw payload in the log record. The full payload already exists in the queue table against the event row — the log record needs only enough to correlate with that row and establish delivery context. Storing raw payloads in a separate log table doubles the PII surface area and complicates GDPR compliance.
With per-attempt history in place, diagnosing a delivery failure requires no reproduction and no live debugging. You query the history table for the event UUID, read the sequence of status codes, and immediately know whether the failure was a sustained 503 (endpoint outage), a 422 (payload schema mismatch), or a network-level timeout. Root cause analysis without guesswork.
A webhook queue without monitoring is a queue that fails silently — which is where we started. Five metrics cover the operational health of the entire delivery system:
status = 'pending'. Growing without bound means the worker is not running or cannot keep pace with the enqueue rate. This is the first signal of a cron failure.
pending row, measured in minutes. A healthy queue drains within one cron interval. Events older than five minutes on a one-minute cron are a signal worth alerting on.
status = 'failed'. A sudden spike indicates a systematic endpoint problem — the URL changed, authentication broke, or the schema was updated without versioning. Growing slowly over time indicates a payload issue affecting a consistent subset of events.
status = 'processing' and last_attempt_at < NOW() - INTERVAL 10 MINUTE. These are worker processes that crashed mid-batch. A stuck detection job should run periodically and reset these rows to pending.
Expose these metrics via a WP-Admin panel for operational visibility, and optionally via a REST endpoint returning JSON for integration with external monitoring tools. A ten-line query wrapped in a REST route is enough for uptime tools, alerting systems, and dashboards to poll on a schedule. The plugin ships this built in — see the REST API reference for the queue status and delivery log endpoints.
Dead-letter events need a path back to active delivery. Automated retry schedules handle transient failures, but systematic failures — an endpoint outage that lasted longer than the retry window, a temporary authentication misconfiguration — require human intervention and a reliable manual retry mechanism.
Single-event retry is the base case: reset the event's status to pending, clear scheduled_at to the current time so the worker picks it up immediately on the next run, and preserve the original UUID. The UUID must not be regenerated on manual retry — the receiver must see the same event identifier it would have seen on automated delivery, so its deduplication logic functions correctly.
Bulk retry is essential for recovering from endpoint outages. When an endpoint comes back online after a two-hour incident, you may have dozens or hundreds of failed events that need requeuing simultaneously. A bulk retry UI that accepts an endpoint URL, a date range, and a status filter — and requeues all matching rows in a single operation — turns a potentially hour-long manual task into a thirty-second operation. When you need that recovery triggered automatically — by a monitoring system, a CI pipeline, or an on-call script — the REST API exposes bulk retry and the full delivery log as authenticated HTTP endpoints.
Note that bulk retry only covers failed events. For events that delivered successfully but need to be resent — after a bug in the receiving system, a migration, or a downstream outage — that is a separate mechanism: replay. See how retry and replay differ and how to build both.
Date filtering combined with the attempt history log enables delivery window investigation. If a client reports missing events between 14:00 and 16:00 on a specific date, you can query the history table for that window, identify the failure pattern, and produce a specific list of event UUIDs that need redelivery — without guessing or manual cross-referencing.
Flow Systems Webhook Actions implements this architecture — persistent queue, smart retry routing, exponential backoff, event UUID headers, per-attempt history, and queue observability — without requiring custom infrastructure maintenance. All of it is also accessible programmatically via the REST API. If your team prefers configuration over building and owning a webhook delivery system from scratch, explore the full details at production-grade WordPress webhook plugin.
All implementation patterns described here use WordPress-native APIs. These are the primary references:
X-Webhook-ID — a stable UUID generated at enqueue time, not at dispatch time. The same value on every retry attempt, allowing receivers to deduplicate.X-Webhook-Timestamp — Unix epoch timestamp of the original event, not the retry time. Allows receivers to detect replays and enforce recency windows.X-Webhook-Version — payload schema version. Allows schema evolution without breaking existing consumers.Content-Type: application/json — explicit MIME type so the receiver does not need to guess the encoding.
status = 'processing' and last_attempt_at < NOW() - INTERVAL 10 MINUTE. These are workers that crashed mid-batch without updating the row status back to pending or failed.pending and clears scheduled_at to now, making them immediately eligible for the next worker run. Alert on any reset so the underlying crash can be investigated.
define( 'DISABLE_WP_CRON', true ); to wp-config.php and configure a system cron entry that hits /wp-cron.php?doing_wp_cron every minute. This decouples the scheduler from traffic and guarantees time-based execution regardless of site activity.