
Why Your Webhooks Keep Failing in Production (And How to Fix Them)
Most developers think webhooks are simple—just an HTTP POST from one service to another. If that were true, you wouldn't be here reading about why your integrations randomly drop events, duplicate notifications, or bring down your servers during traffic spikes. Building webhook receivers that actually survive real-world conditions requires more than a basic endpoint. You need retry logic with exponential backoff, idempotency checks, circuit breakers, and signature verification. This post walks through a battle-tested architecture for webhook handling that won't wake you up at 3 AM.
Why Do Webhooks Fail Even When Everything Looks Correct?
The root problem isn't your code—it's assuming the network is reliable. Webhooks cross organizational boundaries, traverse the public internet, and depend on infrastructure you don't control. Your provider might retry a failed delivery (creating duplicates), their servers might hiccup (causing delays), or your own app might briefly go offline during a deploy. Without proper handling, these hiccups cascade into data inconsistencies, missed events, and angry customers wondering why their invoice never processed.
Most webhook implementations I see in the wild follow a dangerous pattern: receive the payload, process it synchronously, return a 200 OK. This works beautifully in development—local servers, minimal latency, no load. In production? A slow database query during peak traffic causes timeouts. The provider retries. Your server, already struggling, gets hammered again. Now you've got a positive feedback loop that ends in a very bad place. The fix is treating webhooks as asynchronous jobs from day one—not afterthoughts.
How Should I Structure My Webhook Endpoint for Reliability?
Separate receiving from processing. Your webhook endpoint should do exactly three things: validate the signature, persist the raw payload, and return a 200 OK. That's it. No business logic, no database lookups, no external API calls. The actual work happens later—picked up by a background job processor that can retry, back off, and handle failures without affecting the webhook response time.
// Anti-pattern: Synchronous processing
app.post('/webhooks/stripe', async (req, res) => {
const event = req.body;
await updateUserSubscription(event); // Dangerous!
await sendConfirmationEmail(event); // Risky!
res.sendStatus(200);
});
// Better: Queue for async processing
app.post('/webhooks/stripe', async (req, res) => {
const signature = req.headers['stripe-signature'];
// 1. Validate signature
if (!verifySignature(req.body, signature)) {
return res.sendStatus(400);
}
// 2. Persist raw payload
await WebhookEvent.create({
provider: 'stripe',
payload: req.body,
status: 'pending'
});
// 3. Acknowledge immediately
res.sendStatus(200);
});
The queue worker—whether you're using Bull, Celery, Sidekiq, or plain SQS—handles the actual processing with full retry capability. If your database is slow, the job waits and retries. If an external API is down, the job backs off and tries again. Your webhook endpoint stays fast and responsive, which matters because most providers (Stripe, GitHub, Slack) will disable endpoints that return too many non-200 responses. A responsive 200 with background processing beats a correct 500 every time.
What's the Right Way to Handle Webhook Retries and Duplicates?
Idempotency is non-negotiable. Providers will retry failed deliveries—sometimes aggressively—and there's no guarantee they'll tell you it's a retry. Stripe includes an idempotency-key header. GitHub sends a unique X-GitHub-Delivery UUID. Others don't give you anything reliable, forcing you to fingerprint the payload itself. Regardless of the mechanism, your processing logic must handle the same event arriving multiple times without side effects.
Store processed event IDs in a fast lookup store—Redis, DynamoDB, or even your primary database with a unique constraint. Before processing, check if you've seen this ID before. If yes, return early. The check-and-set operation should be atomic to prevent race conditions during concurrent retries. A simple Redis SET idempotency-key "1" NX EX 86400 (set if not exists, expire after 24 hours) covers most use cases without database bloat.
"Exactly-once delivery is a distributed systems myth. Build for at-least-once delivery with idempotent consumers." — Martin Kleppmann, Designing Data-Intensive Applications
Exponential backoff isn't just for sending—it's for receiving too. When your processor encounters a transient error (database timeout, rate-limited external API), don't retry immediately. Back off: 1 second, 2 seconds, 4 seconds, 8 seconds, capping at maybe 5 minutes. Most job queues support this natively. The key is distinguishing transient failures (retry) from permanent failures (dead letter queue). A 500 from your database is transient. A 400 Bad Request from a malformed payload is permanent—retrying won't fix it.
How Do I Protect My Webhook Endpoint From Abuse?
Signature verification is your first line of defense, but most implementations get it subtly wrong. Don't parse the request body before verifying—use the raw payload. JSON parsers reorder keys, strip whitespace, or normalize unicode, breaking the signature check. Store the raw body as bytes, verify the HMAC signature against it, then parse. Libraries like stripe-node handle this correctly, but custom implementations often trip on this detail.
Rate limiting and circuit breakers protect against legitimate traffic spikes and malicious actors. A misconfigured integration firing thousands of events per second shouldn't take down your app. Implement per-provider rate limits—maybe 100 requests per minute for most services, higher for known high-volume providers. When limits are exceeded, return 429 Too Many Requests. Good providers will back off; bad ones you probably want to block anyway.
Circuit breakers prevent cascade failures. If your database goes down and every webhook job starts failing, you don't want to keep hammering the database with retries. After N consecutive failures, trip the circuit breaker—all webhook processing pauses for a cooldown period. This gives your infrastructure time to recover instead of amplifying the overload. Libraries like Netflix's Hystrix (or modern alternatives like opossum for Node.js) make this straightforward to implement.
Monitoring: The Forgotten Requirement
You can't fix what you can't see. Webhook failures often manifest as subtle data drift—orders that never shipped, users stuck in trial status, notifications that vanished. Set up alerts for webhook endpoint response codes (watch for 4xx/5xx spikes), queue depth (jobs backing up), and processing latency. A dashboard showing events received vs. processed per hour will catch problems before customers do.
Structured logging transforms debugging from guesswork into science. Log every webhook receipt with its ID, timestamp, and delivery attempt number. Log processing starts and completions. When something breaks, you can trace the exact path an event took through your system. Without this audit trail, you're staring at production data wondering "did we even receive this webhook?"
Architecture Checklist for Production Webhooks
- Async processing queue—never process synchronously
- Idempotency storage with atomic check-and-set
- Exponential backoff for retries (transient vs. permanent failure distinction)
- Signature verification on raw payload before parsing
- Per-provider rate limiting (429 responses)
- Circuit breaker for downstream failures
- Structured logging with event tracing
- Alerting on error rates and queue depth
- Dead letter queue for permanently failed events
Building reliable webhook infrastructure isn't glamorous work—it won't demo well in a product showcase. But it will save you from those brutal debugging sessions where you're piecing together scattered logs at midnight, trying to figure out why a customer's payment went through but their account never upgraded. The patterns here—async queues, idempotency, circuit breakers—are the same ones used by companies processing billions of webhooks daily. Start with the queue separation and add the rest incrementally. Your future self (and your on-call rotation) will appreciate it.
One last thing: test your failure modes. Use httpbin.org endpoints that return 500s, simulate slow network conditions with tc (Linux traffic control), and deliberately fill your queue to see how backpressure propagates. The only way to know your webhook handling works is to watch it fail—and recover.
