Businesses have long thrived on applications that are flexible, cost-effective, scalable and resilient. The measure of a system’s resilience is its ability to withstand the impact of planned or unplanned disruptions, and recover quickly from a failure.

With increased application complexity and an abrupt rise in the number and type of network attacks, however, businesses face a higher risk of costly system outages than ever before, even as these tech failures become more difficult to predict. It wasn’t long ago that British Airways faced £80 million in added cost because of a small IT failure that stranded 75,000 passengers. Such failures have made it crucial to create resilient applications that can withstand pressure and prevent failure. The goal is to make the application self-sustaining and immune to production-level failures, maximizing its performance.

Complexity Leads to a New Approach to Resilience

But just as application design has changed, so must our approach to designing for resiliency. In the traditional world, designers and architects relied on a comprehensive design and prototyping phase to ensure resilience in highly industrialized applications. Testing happened during the prototype phase to assess resilience and suggest changes based on the feedback received.

Modern distributed applications, however, are large and complex, and are deployed on hybrid cloud infrastructures. Traditional heuristics- and analytics-based techniques have proved insufficient in addressing the resiliency challenges of these distributed systems. The mainstream move to Agile software development has put a spotlight on the weaknesses of the traditional approach to performance engineering and testing. With the additional challenge of ensuring performance-driven design considerations, businesses now have a significant charter to address.

Time for a New Model

An effective way to address this challenge is to adopt a model-based system validation technique that is deployed during the architecture and design of the logical and physical elements of the platform. We advise using a model that predicts the input parameters impacting real-world application behavior and maps these to the application’s performance to mitigate risk and improve resilience (see figure below). We believe this approach can uncover “rare” faults that cause significant reputational risk owing to outages.




Major advantages of this approach include:

  • Prevention of technical failures that are otherwise difficult to identify.
  • Reduced maintenance costs.
  • Avoidance of revenue loss due to application unavailability and a degraded user experience.
  • Increased availability of services for customers.
  • Greater stakeholder confidence in production applications.

Applying Chaos Engineering

The resilience of distributed applications needs to be monitored on a continuous basis, throughout the lifecycle of the application. When a runtime environment is not available, resilience can be validated on a representative model of the real application. The chaos engineering method – which is a set of processes and techniques to perform experiments on large-scale distributed applications – can be used for staging, quality assurance, user acceptance testing and even in production environments. This approach can increase resilience by testing the application for failure conditions that might occur in production systems. By pairing chaos engineering with a resilience validation model, businesses can cost-effectively unearth rare cases of resilience failures that would otherwise be very difficult to reproduce in a real deployment environment.  

We’ve used this approach for several clients, including one of the largest cable television providers in the U.S. The goal was to combat resiliency issues related to single points of failure and a lack of failure alert mechanisms. We detected serious issues related to storage and CPU utilization triggered by scenarios not covered during design reviews that would have led to system nonavailability.

We also used this approach to aid in the performance testing/industrialization phase for a critical release for one of the largest retailers in the world. We detected a message loss scenario in the company’s big data pipeline that would have been difficult to simulate in a production environment, avoiding real-world system outages.

A Continual Journey

Since user behavior is unpredictable, ensuring foolproof resilience is a challenging task. Using a reliable and tested resilience validation framework can help prevent cascading failures, mitigate reputational risk and substantially reduce disaster recovery costs. The aim is to create an efficient, agile and resilient system that ensures businesses avoid the costs and detrimental effects of unexpected system failure.

Prashant Achanta

Prashant Achanta

Prashant is the Research Partner for Cognizant’s Global Technology Office in the India/APAC region, leading the research relationships and project execution across... Read more