· ObjectSource GmbH · Tutorials · 4 min read
The Thundering Herd - Surviving Mass Reconnection in Azure IoT
Onboarding a large fleet of devices at once on Azure IoT can quickly overwhelm available system resources. We provide guidelines to help you prevent this.

If you are designing an onboarding flow for ESP32 devices on Azure, here is how to prevent the “Thundering Herd” problem:
The Problem: A “Thundering Herd” occurs when a large fleet of devices (e.g., after a power outage) attempts to reconnect to the Azure Device Provisioning Service (DPS) or Azure IoT Hub simultaneously, causing service throttling and recursive connection failures.
The Solution: Jittered Exponential Backoff.
Exponential Backoff: Devices increase the wait time between retry attempts (e.g., 2s, 4s, 8s).
Jitter: Add a random delay to the backoff to ensure devices don’t retry in synchronized “waves.”
The Benefit: Ensures system availability during recovery and prevents your Azure subscription from hitting “429 Too Many Requests” limits, which can block legitimate traffic.
Introduction: The Morning the Grid Woke Up
Imagine it’s 3:00 AM. A minor firmware bug or a regional power flicker causes 10,000 ESP32-powered smart meters to reboot at the exact same millisecond.
As the power returns, every single device executes setup() and calls dps_register_device(). In the Azure portal, the Cloud Architect sees a vertical spike in “Provisioning Attempts.” Because the devices are all running identical code, they all fail at once, wait exactly 5 seconds, and retry at once.
This is the Thundering Herd. Within minutes, the Azure IoT Hub hits its daily quota, the Device Provisioning Service starts throttling requests, and the entire fleet is effectively “denial-of-serviced” by its own programming.
By the time the sun comes up, the Architect has a broken fleet and a massive headache. Here is how to build a “polite” fleet that knows how to wait its turn.
Mastering the Onboarding Storm
Why does the ESP32 struggle with “Thundering Herds” more than other devices?
Unlike a PC or a server, an ESP32 is a “bare-metal” or RTOS-driven device. When power is restored, it boots instantly. If 10,000 devices have the same code, they act with perfect, accidental synchronization. Without a randomized delay, they become a botnet attacking your own Azure infrastructure.
What is the “Jittered Exponential Backoff” pattern?
It is a strategy where the time a device waits to retry a failed connection increases with each failure, but with a “random” twist.
- Exponential:
- Jitter:
How should I handle Azure DPS “429 Too Many Requests” errors?
If the Azure DPS returns a 429 error, it is telling your ESP32 to “slow down.” Your firmware must be written to parse the Retry-After header if available, or immediately fall back into its longest backoff state. Continuing to “hammer” the service will only extend the duration of the throttle.
Should I cache the IoT Hub Assigned Hub address on the ESP32?
Yes. This is the best way to avoid a Thundering Herd at the DPS level. Instead of calling the Device Provisioning Service every time the ESP32 reboots, store the assigned IoT Hub Hostname in the ESP32’s Non-Volatile Storage (NVS). On boot, the device should try to connect directly to the Hub first. Only if that connection fails should it “fallback” to the DPS to see if it has been reassigned.
How do I test if my “Herd” logic actually works?
Use a Load Test. You can simulate a Thundering Herd using a script that triggers a reset on a subset of your devices or by using a tool like Azure IoT Device Telemetry Simulator to mimic thousands of simultaneous connection requests. Monitor your “Throttling Errors” in Azure Monitor to see if your jitter logic successfully flattens the spike into a manageable “hill.”
Implementation Cheat Sheet: The “Polite” ESP32 Boot Logic
| Step | Action | Why? |
|---|---|---|
| 1. Initial Boot | Wait a random 0–30 seconds. | Prevents the very first wave from hitting DPS. |
| 2. Check NVS | Look for cached “Assigned Hub” URI. | Bypasses DPS entirely for 99% of reboots. |
| 3. Connect | Attempt IoT Hub connection. | Fast path to data transmission. |
| 4. Failure? | Enter Exponential Backoff + Jitter. | Gives Azure services room to breathe. |
Photo by Juliana e Mariana Amorim on Unsplash




Comments
Add Comment