Reboot Era

Somehow related to my last week's Multi-decade Stability blog, is related the concept of reboots. While the last week's conclusion was that software does not age (hardware does), getting a stable software in the first place is difficult.

Typically there are two conditions in software that may result in it failing after some time of operation.

The first one is a memory leak. If it is slow, it may take time (days, weeks, even months), but eventually the system will run out of resources and fail.If it is protected by a watchdog, it will reboot itself and continue for another week or month. This actually happens quite frequently, but software is written today in a way that you may not notice such reboot. Of devices I have, the one that does that most often is the Garmin watch. The reason I notice is that it takes a good while for it to boot (it does some filesystem checks and reads many GBs of topographic map data from its storage). But if the process was like 5 seconds, I probably would not noticed at all.

A watch that executes a rando reboot is not a huge problem. But imagine a flight computer on an aircraft going dark for a couple of minutes during an approach to an airport. That is a disaster and probably an immediate go-around, but if the cockpit is dark, even going around blindly may be a huge issue. And this does happen. Garmin makes avionics too:) But that also happens on big passenger aircraft, of any brand. The FAA is mandating that operators of Boeing’s 787 Dreamliner periodically reset the power on the airplane to avoid a glitch that could cause all three computer modules that manage the jet’s flight control surfaces to briefly stop working while in flight.

Reboots are also considered the default approach to fixing all sorts of weird problems, like this one that left a driver stranded in a remote part of California in a connected car: Driver stranded after connected rental car can’t call home.

The second problem, also solvable by watchdog and reboots, is variable overflow. There may be a counter that is incremented once in a while. And at some point it gets to the maximum value, which, when incremented blindly, wraps to the minimum value. An application may test for the counter being bigger that x, and suddenly after the last increment it becomes smaller than x, triggering unexpected avalanche of actions, which usually results in a reboot. To get to such condition, it may take, again, days or weeks or months. And it will manifest itself with a small blink. But again, should never happen.

Unfortunately testing for these conditions is difficult and may take (a lot of) time. Which is why we have watchdogs and are experiencing reboots. From tiny watches to big airliners. We live in a reboot era. And (unfortunately) this status quo has been universally accepted and there is no economic / business incentive to change that. Sad :(.

Comments