Six Sigma and 24x7 Reliability in Large Scale Web Design
by Gianna Giavelli
Many times practitioners of Six Sigma or Gemba Kaizen will push on process and review as the way to get to near perfect system uptime. But defect prevention is a complicated thing and when it involves systems engineering, the process of hitting a defect and then performing a six sigma resolution process may be too slow. By too slow I mean it might iterate and slightly different errors keep occuring resulting in terrible uptime scores before all the issues are ever figured out, if ever.
So what to do? What is an appropriate technology strategy when the boss says we have 99.999 uptime contracts? The answer isn't going to please people. The real solution is a combination of a six sigma defect policy but only alongside a technology review policy.
Forming a technology policy:
One of the secrets to uptime is having clear and optimized code with appropriately selected technologies, and knowing those technologies limits. After I led the re-architecting at Blueturn the analytics platform went from complex, slow, and crashing (rarely) to stable and performing. You need someone who can really think through edge and black swan conditions and anticipate as much as possible and defend from it.
Perform a database audit to see how it will survive a hypergrowth situation. Solution would be to consider a distributed multi node technology like Cassandra or disk level solutions like striping + separation of index and data, and confirming drive space. I once got hit when numbers of users when from 400 to over a million very quickly. And if it isn't the db layer itself, it might be the caching technology or app server that can't keep up. Test this case specifically!
Front Web Services with message queues. Web services are thread bound for performance, message queues are CPU bound but with greater recover-ability. Even for a process which is a synchronous request, using a call to a message queue ensures that the work item is not lost and will eventually be processed.
Clear contracts between layers and performance and load test each layer. This ensures that each path can be separately qualified rather than one path with many many routes of code of possibility, which is inherently easier to hit an edge case that simply hadn't turned up before.
Review edge cases and protection. Commonly this can be things like testing null value and negative value cases, large data cases. This will be one of the most painful and tedious things to put into automated testing. Clear layer type architectures using fascade patterns if necessary will make this level of isolation possible. Beware of missing a technology or code section when reverting to layer testing. Also have "try to break it" crazy data entry sessions to see if people randomly can come up with something which is a break. You can do this directly against a service using a tool like SoapUI if need be.
Get Professionally Load tested and verify the usage pattern: You need load testing which is done from multiple site and multiple computers. Usually doing it in house is not enough if you are serious about 24x7 9999 uptime. Mirroring the hardware is also key. You need to invest in fully identical mirror setups for load testing of this nature or it might be useless.
Have extensive logging. And point to Method places in the log don't just save the error. Number the log entries with each one getting a unique code so very quickly you can search for it, don't count on a stack trace. This is key for the 3am phone call. If you don't have enough logging to investigate a edge case that never should have happened, to minimize the code that needs to be reviewed and checked, then you need more.
Beware of anything with pointers: Sorry, but the big advantage of modern languages is not having pointers which we glady give up for the tiny performance hit.
Specify development, system and regression test, production and production mirror environments RIGOROUSLY and have a strict process for change or deviation. Many times a new rev of a 3rd party tool will inject a but or unexpected behaviour so this has to be protected for throughout the process.
Set up a review process for every failure and make sure that there is a serious attitude with all necessary parties involved. It should include:
- Results of the Investigation
- Proposal to prevent this AND SIMILAR CLASSES OF PROBLEMS
- REVIEW of PRIOR defects and if the company is being successful in preventing
One key thing for this kind of review session is that a lackadaisical attitude "the bug is fixed" can lead to continued problems. Why did the bug happen? What about the nature or approach to the code allowed it to happen? What can we look for in code reviews. If it's a system or 3rd party related issue then review if there are options or how you can work with the vendor to ensure not just a bug fix, but that the whole class of problem is reviewed.
I hope this helps you begin to think of all the issues with seeking 24x7 9999 uptimes with modern technologies. I did not go into the process side because already there is much written on the six sigma and gemba kaizen methodologies. In the end, clean well organized code and architecture is MORE important to six sigma success with technology than review process and formal process definition. Stressing code elegance and software as an ART not a commodity is key for management, which also means you cannot treat your engineers like commodities with a basket of skills either. The art of good code is taught from good senior engineers on down and never in school and never in most development companies. Keep that in mind when chosing your senior team.