Whether it’s a blowout at an oil rig or the crash of an e-commerce site, the failure of big, complex systems is usually caused not by one catastrophic component failure or human error but a series of small breakdowns in equipment and processes that combine in unpredictable ways.

The 2010 explosion of the Deepwater Horizon drilling rig, for example, has been linked to eight separate mechanical and human failures. These include improperly formulated cement that failed to seal the bottom of the borehole, the misinterpretation of pressure tests that might have signaled imminent disaster, and a dead battery and defective switch that kept preventive equipment from functioning.

In early 2017, a similar series of failures led to a nearly five-hour outage of Amazon’s Simple Storage Service (S3). The problem began when an Amazon employee debugging a billing system inadvertently took a number of critical servers offline. Those systems, in turn, took more time to recover than expected because their recovery processes had not been checked during years of rapid growth in the S3 service. Finally, because of its dependency on the S3 service, it was impossible to update the AWS Service Health Dashboard with information about the outage.  

The Need for Quality Engineering

The risk of such cascading and interrelated failures is why quality engineering (QE) has become so important.  Quality engineering focuses on building quality into software and services from the very start of the development lifecycle, rather than leaving it to a final testing phase.

While quality engineering is not new, ongoing advances in artificial intelligence (AI) and the availability of massive cloud-based processing power allow quality engineering to deliver far more reliable, flexible, resilient and high-performance applications and architectures more quickly than ever before. Using AI, quality engineers in multiple industries are predicting and preventing equipment failures, and developing physical robots to inspect and even repair hard-to-reach areas in equipment such as aircraft engines. Digital enterprises are using software robots (sometimes called “monkeys”) to automatically find and fix problems before they can crash critical systems.   

AI enables the analysis of years of information about the performance of various software modules to predict defects when certain combinations of code interact. This helps organizations prevent problems, which is far less expensive than fixing them, and to better predict whether projects can be completed on-time and will meet customer needs. 

The resulting delivery of “quality at speed” means lower development and support costs, faster time to market, greater customer satisfaction, and opportunities to lead new industry segments with innovative digital solutions.

Two Paths to Quality Engineering   

Today’s digital services tap data from a wide range of sources and must deliver a delightful experience on an array of platforms. This makes for massive complexity and the potential for thousands of failure scenarios. What’s more, the list of “moving parts” that must work together flawlessly are continuously changing as new devices, protocols and information sources (such as the Internet of Things) enter the picture.

How do we embed a QE-first approach into the development, deployment and enhancement of such digital services?  There are two distinct, yet complementary, requirements: 

  • The first is taking a systems-thinking approach to how all components in the application and services ecosystem work together. We use quality engineering to assess the components and their interactions early in the development cycle, and strengthen them to create a scalable, resilient ecosystem.
  • The second is to constantly monitor the quality of the live system and collect feedback in every area, from performance and security to customer satisfaction. This feedback is amplified and analyzed to drive enhancements, ensuring there are no adverse consequences to the improvements.

Monkey Armies  

Quality engineering is made much more thorough, rapid and cost-effective through the use of autonomous software agents, or bots. Think of the bots as two types of monkeys running loose in the cloud infrastructure:

  • Monkeys that break: This type of bot simulates all possible weaknesses to identify potential failure points and thus create a more resilient ecosystem. Netflix pioneered this approach with Chaos Monkey, which randomly introduces problems to make sure an application can withstand failures without harming the customer experience. We helped a large communications service provider use the open-sourced Chaos Monkey to run various failure scenarios such as CPU usage spikes and network latency. We simulated and analyzed both isolated and complex failures, and used the insights to upgrade the infrastructure for a 25% reduction in platform outages.
  • Monkeys that build: This bot detects system failures in live systems and automatically builds fixes to those issues through self-learning and self-healing. We apply these bots to analyze how customers use fast-changing applications or services to perform common activities, and to find and fix problems to assure quality and a superior customer experience. At a leading CSP, we used build monkeys to identify and remotely fix common infrastructure problems, resulting in a significant reduction in on-site repair visits.

Prep for Usage Spikes Now

To help retailers prepare for peak sales periods, we designed a “holiday-readiness” quality assurance package that includes:  

  • AI algorithms that analyze historical and projected trends in both system and business performance to identify potential performance and security issues. We center our analysis on the customer journey to ensure robustness across usage scenarios and graceful degradation of the user experience in the event of failure.
  • Continual monitoring of the live system to assess potential performance, scalability or security issues and share recommended fixes with the core team. We combine this with a crowdsourced testing exercise to uncover any blind spots. We use “monkeys that fix” to automatically run test and self-heal processes and send information to the service team, while remediating issues so the user experience isn’t harmed. All this is also enabled by a cloud-based quality engineering framework and a real-time dashboard accessible by all stakeholders.

This is only one example of how a quality engineering-first mindset ensures that quality is engineered into all aspects of the application and systems lifecycle. With today’s low-cost open source frameworks, cloud-based AI platforms, crowd-sourced testing and easily accessible best practices, quality engineering isn’t only for rocket scientists. It’s for anyone who needs to deliver quality, at speed and at scale, for digital success.   

N. Subramanian, a Cognizant AVP and Process and Quality Consulting Leader Offshore, contributed to this blog.

Anbu Muppidathi

Anbu Muppidathi

Anbu Muppidathi is a member of Cognizant’s executive leadership team and is a senior leader in the company’s Digital Systems and Technology... Read more