Beat Tone

Several months ago we witnessed a very peculiar behavior of a fairly small setup of a wireless sensor network. There were several sensors reporting data periodically, each sending a message on average every minute. Every now and then there was a pair of sensors that appeared to stop working. No messages were received from the two. When we rolled out traffic sniffing equipment to analyze what was happening we found nothing wrong. The data was coming periodically from both. We packed and went home just to be notified there were reports missing from another pair of sensors. Came back to the site to find nothing wrong again. And the story repeated. It was peculiar.

Finally when we diagnosed the installation, we found the root of the problem were slowly drifting clocks. The sensors were reporting at the same cadence, but each had randomized delay on start, so the radio messages were fairly randomly distributed in time, to avoid collisions. But except for the randomized startup delay, the code responsible for periodic wake-ups was precise: send a message, go to sleep for the configured period, wake up, send a message, go to sleep... repeat.

Due to slowly drifting clocks, every now and then there was a pair of sensors firing exactly at the same time. Initially all were unsynchronized, but as the clocks were drifting, after some time, a random pair of them started sending messages exactly at the same moment. The messages were colliding, resulting in reports being lost. This is when the support team got alerted. By the time they got on site, the clocks drifted away and the collisions went away. So they could not find any problem analyzing the network. Then after several hours another random pair of sensors started sending their messages at exactly the same time, which lasted long enough to be noticed and call support, and by the time they arrived, everything drifted out of sync again, and the network went back to normal.

Of course the bug was the lack of a small randomization at sending each periodic message. There was randomization at power up to spread traffic evenly and there was randomization on sending multiple messages one after another. But the first message after a long sleep was (to save energy) going out always immediately. The sleep timers kept ticking waking the code that was executing the transmission immediately and was hibernating the device until the next wake-up. And so the sleep clocks were drifting so every couple of hours there was a random pair waking up repeatedly at exactly the same time, perfectly colliding.

Lesson learned - of course. Something that was not that obvious and not that easy to catch when testing only a handful of devices for short periods.

Comments