What's wrong with retrying 4xx responses?

A 4xx status means the endpoint understood the request and rejected it — the payload is malformed, the authentication is wrong, or the resource doesn't exist. No amount of retries fixes a structural problem. Retrying burns the entire attempt budget on an unrecoverable failure and masks the real issue that needs human attention.

What headers should every production webhook include?

Every production webhook should include: X-Webhook-ID (stable UUID generated at enqueue time), X-Webhook-Timestamp (Unix epoch of the original event), X-Webhook-Version (payload schema version), and Content-Type: application/json. The UUID allows receivers to deduplicate retries; the version allows schema evolution without breaking existing consumers.

How do I detect a stuck webhook queue?

Query for events where status = 'processing' and last_attempt_at < NOW() - INTERVAL 10 MINUTE. These are workers that crashed mid-batch without updating status back to pending or failed. Stuck detection should run as a separate cron job and reset these rows to pending so they are re-processed.

Why WordPress Webhooks Silently Fail in Production

Q: Why do WordPress webhooks work locally but fail in production?

Local development has no traffic dependency, low-latency endpoints, and a controlled environment. Production introduces real endpoint latency, WP-Cron unreliability (it only fires on page load), transient endpoint failures, and PHP timeouts. Without a persistent queue and retry logic, any of these conditions silently drops the event.

Q: Is WP-Cron reliable enough for production webhook dispatch?

No. WP-Cron fires on page load, not on a real time schedule. A low-traffic site at 3am may not receive a page load for hours, meaning queued webhook jobs sit unprocessed for that entire window. Use a system cron entry hitting /wp-cron.php?doing_wp_cron every minute to guarantee time-based execution.

/ The Problem

What silent failure looks like

The symptom pattern is consistent across WordPress and WooCommerce sites of every size. Some webhook events never arrive at the receiving endpoint — no error surfaces in the WordPress admin, no log entry, no alert. The order completed, the form submitted, the status changed — but the downstream system never heard about it.

Retries never happen because there is nothing tracking whether the delivery succeeded or failed. The attempt was made inline, during the PHP request, and the result was never persisted anywhere. Once the request ended, the event was gone.

Users report inconsistent behavior: "sometimes the CRM updates, sometimes it doesn't." Support tickets arrive days after the fact, when someone notices a discrepancy between the WordPress order history and the connected platform. By then the delivery window has long closed, the PHP logs have rotated, and there is no forensic record of what happened.

The logs that do exist are incomplete. WordPress does not log outbound HTTP requests by default. Unless you have explicitly wired up per-attempt logging, a failed wp_remote_post call produces nothing observable. The silence is the failure mode.

These symptoms share a single root cause: the WordPress request lifecycle is not designed for reliable background event delivery. Understanding why requires understanding how PHP executes code.

/ Root Cause

WordPress is request-based, not event-infrastructure

PHP executes synchronously within a single HTTP request. When a browser or API client hits a WordPress page, PHP boots, runs the request handlers, and terminates. Every line of code in that request — including any outbound HTTP calls — must complete before the response is returned to the caller.

This model is entirely appropriate for rendering pages. It becomes a structural liability the moment you try to use it for reliable event delivery. A webhook call attached to woocommerce_order_status_completed runs inline, inside the request that triggered the order completion. If that outbound call is slow, the user's page is slow. If it fails, the event is gone. If PHP crashes mid-request, nothing was recorded.

The most common webhook implementation in WordPress looks like this — and this is exactly the pattern that fails silently in production:

fire-and-forget.php — the fragile pattern

// Fragile: fire-and-forget inline webhook call.
// If this times out or PHP crashes mid-request, the event is lost.
// There is no retry, no log entry, no signal that delivery failed.
add_action( 'woocommerce_order_status_completed', function( $order_id ) {
    wp_remote_post( 'https://your-endpoint.example.com/webhook', [
        'headers' => [ 'Content-Type' => 'application/json' ],
        'body'    => wp_json_encode( [ 'order_id' => $order_id ] ),
        'timeout' => 5,
    ] );
    // Return value never checked.
    // No retry on failure. No log on success or failure.
} );

If this call times out or PHP crashes mid-request, the event is lost. There is no retry, no log entry, no signal that delivery failed. The order shows as completed in WooCommerce — but your CRM, ERP, or automation platform never received the event.

The fix is not to add error checking to this pattern. The fix is to move webhook dispatch out of the request cycle entirely — into a persistent queue that survives PHP crashes, retries on failure, and logs every attempt regardless of outcome.

/ WP-Cron

WP-Cron is traffic-dependent, not a real scheduler

When developers reach for a background processing solution in WordPress, WP-Cron is the natural starting point. It ships with core, requires no server configuration, and appears to offer scheduled execution. In production, it falls short in ways that directly cause webhook delivery failures — see the full breakdown of why WP-Cron is not enough for reliable automation.

WP-Cron does not run on a time-based schedule. It fires on page load. When a request hits WordPress, PHP checks whether any scheduled cron events are overdue and runs them as part of that request. This means on a site with zero traffic — at 3am, over a weekend, during a server maintenance window — WP-Cron does not fire. Jobs queue in the database, the worker never runs, and webhook events pile up undelivered for the entire zero-traffic window.

Shared hosting compounds the problem. Many hosts impose execution time limits and terminate long-running PHP processes. A WP-Cron batch that processes fifty queued webhooks may be killed partway through, leaving some jobs in an inconsistent state — marked as processing but with no delivery attempted.

The reliable alternative is to disable WP-Cron's page-load trigger and run it from a real system cron job instead. Add define( 'DISABLE_WP_CRON', true ); to wp-config.php, then configure a system crontab entry that hits the WordPress cron URL every minute:

System crontab — reliable one-minute scheduling

# /etc/cron.d/wordpress — runs every minute regardless of site traffic
* * * * * www-data curl -s https://your-site.com/wp-cron.php?doing_wp_cron > /dev/null 2>&1

With a system cron entry in place, the webhook worker runs on a guaranteed schedule. A low-traffic site at 3am gets the same delivery timeliness as the same site during peak hours. For step-by-step setup — including the crontab entry, WP-CLI alternative, and Action Scheduler — see Cron Job for WordPress: WP-Cron Limits and Real Fixes.

/ Comparison

Synchronous vs Production-Grade Webhooks

The gap between a fire-and-forget wp_remote_post call and a production-grade webhook system is not about writing better PHP. It is about the infrastructure that surrounds the HTTP call. Every cell in this table represents a deliberate design decision — and each missing feature in the fire-and-forget column is a category of silent failure.

Aspect	Fire-and-forget	Production-grade
Execution model	Inline, blocks PHP	Background worker
Persistence	None — lost on crash	Queue table (survives restarts)
Event identity	No UUID	UUID + timestamp headers
Retry on failure	Never	Smart retry (5xx, 429 only)
Backoff strategy	None	Exponential with jitter
4xx handling	Retried (wastes attempts)	Immediate permanent failure
Permanent failure state	None	Dead-letter with history
Attempt history	None	Per-attempt log record
Queue monitoring	None	Depth, age, stuck detection
Manual retry	Not possible	UI + bulk retry tools
Payload versioning	None	Version field + schema stability

Each of these features addresses a specific failure mode. Removing any one of them reintroduces that failure mode. The comparison table is also a checklist: a reliable webhook delivery system needs all eleven properties.

/ Idempotency

Event identity: UUID, versioning, and timestamp headers

Every webhook event needs a stable, globally unique identifier generated at enqueue time — not at dispatch time, and not regenerated on retries. The UUID travels with the event across every delivery attempt, including all retries. The receiving endpoint uses this UUID to deduplicate: if it has already processed event uuid-abc-123, it discards subsequent deliveries with the same ID.

This matters because retry logic and idempotency are inseparable. A retry-capable system will, by definition, sometimes deliver the same event more than once — network errors can occur after the endpoint has processed the request but before it returned a 2xx response. Without a stable event UUID, the receiver has no way to distinguish a legitimate new event from a duplicate retry.

Three standard headers carry the event identity on every request:

X-Webhook-ID — the stable UUID generated at enqueue time. Same value on every attempt.
X-Webhook-Timestamp — Unix epoch timestamp of the original event, not the retry time.
X-Webhook-Version — payload schema version. Allows the receiver to route to the correct parser as your payload structure evolves.

A version field in the payload body itself reinforces the schema contract and allows schema evolution without breaking existing consumers who are pinned to an older version.

webhook-headers.php — UUID generation and standard headers

// UUID is generated at enqueue time — not at dispatch time.
// The same UUID is used on every retry attempt.
function my_enqueue_webhook( $endpoint, $payload, $version = '1' ) {
    global $wpdb;
    $uuid       = wp_generate_uuid4();
    $created_at = current_time( 'mysql', true );

    $wpdb->insert(
        $wpdb->prefix . 'webhook_queue',
        [
            'uuid'            => $uuid,
            'endpoint'        => $endpoint,
            'payload'         => wp_json_encode( $payload ),
            'payload_version' => $version,
            'status'          => 'pending',
            'attempt'         => 0,
            'created_at'      => $created_at,
            'scheduled_at'    => $created_at,
        ]
    );
}

// At dispatch time: attach UUID and timestamp as standard headers.
function my_build_request_args( $job ) {
    return [
        'headers' => [
            'Content-Type'        => 'application/json',
            'X-Webhook-ID'        => $job->uuid,         // stable across all retries
            'X-Webhook-Timestamp' => strtotime( $job->created_at ), // original event time
            'X-Webhook-Version'   => $job->payload_version,
        ],
        'body'    => $job->payload,
        'timeout' => 10,
    ];
}

/ Retry Logic

Retrying the right failures

Not all failures are equal, and treating them equally is one of the most common mistakes in webhook retry logic. The HTTP status code the endpoint returns carries precise information about what went wrong — and that information should directly determine whether retrying makes sense.

Retryable failures are transient. The endpoint was unavailable or overloaded, and the same request will likely succeed once the condition clears:

5xx errors — server-side errors (500, 502, 503, 504). The endpoint was reached but encountered an internal problem. Retry with backoff.
429 Too Many Requests — the endpoint is rate-limiting. Retry after the backoff interval, honoring any Retry-After header if present.
Network-level WP_Error — DNS failure, connection timeout, SSL handshake error. The endpoint was not reached at all. Retry.

Non-retryable failures are permanent. The endpoint understood the request and rejected it. No amount of retries will fix a structural problem:

400 Bad Request — the payload is malformed from the endpoint's perspective.
401 Unauthorized / 403 Forbidden — authentication or authorization failure. A configuration problem, not a transient outage.
404 Not Found / 410 Gone — the endpoint URL no longer exists.
422 Unprocessable Entity — the payload structure is valid JSON but fails schema validation.

Retrying 4xx responses wastes the entire retry budget on an unrecoverable failure. Mark these as permanently failed after the first attempt and surface them for human review immediately.

webhook-dispatch.php — status-based retry routing

$response    = wp_remote_post( $job->endpoint, my_build_request_args( $job ) );
$status_code = wp_remote_retrieve_response_code( $response );
$attempt     = (int) $job->attempt + 1;

if ( is_wp_error( $response ) ) {
    // Network-level failure: DNS, timeout, SSL. Retryable.
    my_handle_retryable_failure( $job, $attempt, null, $response->get_error_message() );

} elseif ( $status_code >= 200 && $status_code < 300 ) {
    // Success.
    my_queue_update( $job->id, 'complete', $attempt, $status_code );

} elseif ( $status_code >= 500 || $status_code === 429 ) {
    // Transient server error or rate limit. Retryable.
    my_handle_retryable_failure( $job, $attempt, $status_code, null );

} else {
    // 4xx — payload or config problem. Permanent failure. Do not retry.
    my_queue_update( $job->id, 'failed', $attempt, $status_code );
}

// Log every attempt regardless of outcome.
my_log_attempt( $job, $attempt, $status_code, $response );

The is_wp_error() check is critical: it catches network failures that never produce an HTTP status code at all — DNS resolution failure, connection refused, SSL handshake error. These are distinct from HTTP errors and must be handled separately. See the wp_remote_post() documentation for the full return value specification.

/ Backoff

Exponential backoff: spacing retries correctly

Retrying immediately after a failure is rarely the right choice. An endpoint that just returned a 503 is under stress. Hammering it with immediate retries makes the situation worse — for the endpoint and for every other client hitting it. Exponential backoff spaces retries progressively further apart, giving the endpoint time to recover.

The formula is:

delay = base_delay × 2^{(attempt - 1)}

With a base delay of 60 seconds (1 minute), five attempts produce the following schedule:

Attempt 1

→ 1 min →

Attempt 2

→ 2 min →

Attempt 3

→ 4 min →

Attempt 4

→ 8 min →

Attempt 5

→ 16 min →

Dead-letter

Five attempts at base delay 60 seconds covers a 31-minute total retry window. This is long enough to survive transient outages and short-lived infrastructure incidents, without keeping a failed event in the active queue indefinitely.

For high-volume systems where many events may fail simultaneously — for example, during an endpoint outage — add jitter: a small random offset applied to each retry delay. Jitter prevents all failed events from retrying at exactly the same second, which would produce a thundering herd that re-stresses the recovering endpoint instead of allowing it to stabilize.

/ Queue

Persistent queue: the foundation of reliable delivery

The queue table is what makes every other reliability property possible. Without it, there is nothing to retry, nothing to log against, and nothing to monitor. The queue is the single source of truth for the state of every webhook event from enqueue to delivery or permanent failure.

The minimal schema requires these columns: id, uuid, endpoint, payload (JSON), payload_version, status, attempt, created_at, scheduled_at, last_attempt_at, last_status_code, last_error.

Status values drive the worker's query logic: pending (waiting to be dispatched), processing (currently being dispatched by a worker), complete (delivered successfully), failed (exhausted retries or received a permanent 4xx).

The processing status acts as a dispatch lock. Before attempting delivery, the worker sets the row to processing. This prevents two concurrent worker processes from dispatching the same event simultaneously — a real risk if your cron interval is shorter than your delivery timeout. After the attempt completes (success or failure), the worker updates the status to the appropriate final or pending state.

  WordPress Action Hook
       │
       ▼
  Queue Table (MySQL)
  { uuid, endpoint, payload, version, status=pending, attempt=0, scheduled_at }
       │
       ▼
  Response sent to user  ◄── request ends here
       │
       (background)
       ▼
  Cron Worker (system cron preferred)
       │
       ├─ 2xx              → status=complete
       ├─ 5xx / timeout    → reschedule (exponential backoff)
       ├─ 4xx              → status=failed (permanent)
       └─ attempt >= max   → status=failed (dead-letter)

The critical property of this architecture: the user-facing request ends before any delivery is attempted. The webhook payload is captured reliably in the database, and the worker runs independently. A PHP crash during the user request loses nothing — the queue row was written before the crash. A PHP crash during the worker run leaves rows in processing state, which a stuck-detection job can reset to pending for re-processing.

/ History

Attempt history: forensic debugging

The queue table tracks the current state of each event. A separate attempt history table tracks every delivery attempt — including those that led to the current state. This distinction matters: the queue row tells you where the event is now; the history tells you how it got there.

Every attempt — success, transient failure, and permanent failure — should produce a log record. The structure of each record:

Per-attempt log record

{
  "event_id":      "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "attempt":      3,
  "endpoint":     "https://hooks.example.com/order-complete",
  "status_code":  503,
  "wp_error":     null,
  "duration_ms":  9812,
  "payload_hash": "sha256:a3f9c2...",  // hash, not raw payload — avoids PII in logs
  "timestamp":    "2026-02-28T09:11:04Z",
  "next_retry_at": "2026-02-28T09:19:04Z"
}

Store the payload hash rather than the raw payload in the log record. The full payload already exists in the queue table against the event row — the log record needs only enough to correlate with that row and establish delivery context. Storing raw payloads in a separate log table doubles the PII surface area and complicates GDPR compliance.

With per-attempt history in place, diagnosing a delivery failure requires no reproduction and no live debugging. You query the history table for the event UUID, read the sequence of status codes, and immediately know whether the failure was a sustained 503 (endpoint outage), a 422 (payload schema mismatch), or a network-level timeout. Root cause analysis without guesswork.

/ Observability

Queue monitoring: metrics that matter

A webhook queue without monitoring is a queue that fails silently — which is where we started. Five metrics cover the operational health of the entire delivery system:

Pending queue depth Count of rows with status = 'pending'. Growing without bound means the worker is not running or cannot keep pace with the enqueue rate. This is the first signal of a cron failure.
Oldest pending event age Age of the oldest pending row, measured in minutes. A healthy queue drains within one cron interval. Events older than five minutes on a one-minute cron are a signal worth alerting on.
Failed queue count Count of rows with status = 'failed'. A sudden spike indicates a systematic endpoint problem — the URL changed, authentication broke, or the schema was updated without versioning. Growing slowly over time indicates a payload issue affecting a consistent subset of events.
Average attempts per event The mean attempt count across recently completed events. Baseline drift upward signals endpoint degradation before the failure threshold is reached. If healthy delivery takes one attempt and you see the average rise to 2.5, something is wrong.
Stuck detection Count of rows where status = 'processing' and last_attempt_at < NOW() - INTERVAL 10 MINUTE. These are worker processes that crashed mid-batch. A stuck detection job should run periodically and reset these rows to pending.

Expose these metrics via a WP-Admin panel for operational visibility, and optionally via a REST endpoint returning JSON for integration with external monitoring tools. A ten-line query wrapped in a REST route is enough for uptime tools, alerting systems, and dashboards to poll on a schedule. The plugin ships this built in — see the REST API reference for the queue status and delivery log endpoints.

/ Manual Retry

Manual and bulk retry workflows

Dead-letter events need a path back to active delivery. Automated retry schedules handle transient failures, but systematic failures — an endpoint outage that lasted longer than the retry window, a temporary authentication misconfiguration — require human intervention and a reliable manual retry mechanism.

Single-event retry is the base case: reset the event's status to pending, clear scheduled_at to the current time so the worker picks it up immediately on the next run, and preserve the original UUID. The UUID must not be regenerated on manual retry — the receiver must see the same event identifier it would have seen on automated delivery, so its deduplication logic functions correctly.

Bulk retry is essential for recovering from endpoint outages. When an endpoint comes back online after a two-hour incident, you may have dozens or hundreds of failed events that need requeuing simultaneously. A bulk retry UI that accepts an endpoint URL, a date range, and a status filter — and requeues all matching rows in a single operation — turns a potentially hour-long manual task into a thirty-second operation. When you need that recovery triggered automatically — by a monitoring system, a CI pipeline, or an on-call script — the REST API exposes bulk retry and the full delivery log as authenticated HTTP endpoints.

Note that bulk retry only covers failed events. For events that delivered successfully but need to be resent — after a bug in the receiving system, a migration, or a downstream outage — that is a separate mechanism: replay. See how retry and replay differ and how to build both.

Date filtering combined with the attempt history log enables delivery window investigation. If a client reports missing events between 14:00 and 16:00 on a specific date, you can query the history table for that window, identify the failure pattern, and produce a specific list of event UUIDs that need redelivery — without guessing or manual cross-referencing.

/ Production Alternative

If you'd rather not maintain this yourself

Flow Systems Webhook Actions implements this architecture — persistent queue, smart retry routing, exponential backoff, event UUID headers, per-attempt history, and queue observability — without requiring custom infrastructure maintenance. All of it is also accessible programmatically via the REST API. If your team prefers configuration over building and owning a webhook delivery system from scratch, explore the full details at production-grade WordPress webhook plugin.

/ References

Official documentation

All implementation patterns described here use WordPress-native APIs. These are the primary references:

/ FAQ

Common questions

Local development has no traffic dependency, low-latency endpoints, and a controlled environment. Production introduces real endpoint latency, WP-Cron unreliability — it only fires on page load, not on a time-based schedule — transient endpoint failures, and PHP timeouts under load.

Without a persistent queue and retry logic, any of these conditions silently drops the event. The local environment masks all of them because it has no traffic gaps, no shared hosting restrictions, and usually hits fast local or nearby endpoints.

A 4xx status means the endpoint understood the request and rejected it deliberately. A 400 means the payload is malformed. A 401 or 403 means authentication failed. A 404 means the URL no longer exists. None of these conditions will resolve themselves through retrying — they require a code change, a configuration fix, or an endpoint correction.

Retrying 4xx responses burns the entire retry budget on an unrecoverable failure. After five retries, the event reaches dead-letter status — but the real problem (wrong URL, broken auth, bad payload) still exists and now needs to be diagnosed under time pressure. Mark 4xx failures as permanent immediately and surface them for human review.

Every production webhook should include four standard headers:

X-Webhook-ID — a stable UUID generated at enqueue time, not at dispatch time. The same value on every retry attempt, allowing receivers to deduplicate.

X-Webhook-Timestamp — Unix epoch timestamp of the original event, not the retry time. Allows receivers to detect replays and enforce recency windows.

X-Webhook-Version — payload schema version. Allows schema evolution without breaking existing consumers.

Content-Type: application/json — explicit MIME type so the receiver does not need to guess the encoding.

Query for events where status = 'processing' and last_attempt_at < NOW() - INTERVAL 10 MINUTE. These are workers that crashed mid-batch without updating the row status back to pending or failed.

A stuck detection job should run as a separate scheduled task — every five minutes is sufficient. When it finds stuck rows, it resets their status to pending and clears scheduled_at to now, making them immediately eligible for the next worker run. Alert on any reset so the underlying crash can be investigated.

No. WP-Cron fires on page load, not on a real time schedule. A low-traffic site at 3am may not receive a page load for hours, meaning queued webhook jobs sit unprocessed for that entire window. The same applies to maintenance windows, cache warmup periods, and any time your site receives no organic traffic.

The reliable solution is to add define( 'DISABLE_WP_CRON', true ); to wp-config.php and configure a system cron entry that hits /wp-cron.php?doing_wp_cron every minute. This decouples the scheduler from traffic and guarantees time-based execution regardless of site activity.

/ Related