Late Innocent CRs: a Recipe for a Disaster

From last week's flying high on emotions today we're down to earth. Which - honestly - I really love. Since the devil is always in the details.

We're on the last lap before releasing a major upgrade to our system. The stabilization phase, during which - obviously - we should not introduce any new functionality, just fix bugs. But there always is the temptation to squeeze in something extra. And it usually is a recipe for a disaster.

But knowing all this, we needed one more evidence...

This time it was the status page. You know, one that is displayed when you point your browser to the device's IP address. It was the requirement from the support team to have the serial numbers easily accessible. On his final day before the vacation the program manager entered a simple enhancement ticket in our Trac system: "Implement a status page". Thinking it could not be easier to complete, he left.

The web developer prepared a template. It probably seemed too simple and empty to him, so he added several innocent extras to fill the space: IP / MAC addresses, status of the services, and indicators of error entries in log files. Then he requested a script filling the data from our sysadmin.

Not aware what the script was for (just knowing the content it should return), the sysadmin implemented a bash script. The script was executed at system startup and was writing the results to a file located on a flash partition. The script was slow, the return data was redundant and not up-to-date (run once and stored in flash). Every execution was wearing the flash. I would not blame the sysadmin for this. Rather myself or the PM for not specifying the requirements and not discussing the use case among the team members.

Then I got it for testing. I concentrated on the layout, thinking the implementation was so simple, it could not be a ticking bomb. Which, apparently, it was. Only several days later I looked into the script to realize, among other things, it was grepping the logs to find errors. "But grep is fast, especially as we grep the /tmp, which resides in ram". "Really?" - I asked. "What if we use the debug logging to a flash drive? I have a 16GB flash drive for logs." "Then it will be slow, very slow!" - he replied. "So get rid of that!".

Then our QA team came into action. Citing the learning curve, I have to say I'm extremely happy how they've been approaching things recently. Using the status page they quickly implemented a DOS (Denial Of Service) attack on the entire system. Proving we should go back to the drawing board.

The final implementation is correct. But it took us almost three weeks to do and re-do something as simple as a html status page. We did wrong everything we could in the process. The main fault was the lack of clear communications within the team. We were misguided with simplicity and innocence of the task. We should have followed follow the rules, we have in place and treat this simple CR the same way we handle the most complicated ones. But we did not, and it cost us dearly.

I have to say I'm happy this happened. Because ultimately nobody is hurt and we, as a team, have learned a lot. Should such mistake happen elsewhere, especially after we've reached the V1, we would crash and burst in flames. Luckily this incident is just another contributor to our learning curve. Which I consider the most important process within our organization.

Comments