How to Make IT Systems Resilient

Expert Architecture Techniques

Free image from Freepiks courtesy MacroVector

No humanmade system is immune to failure. However robust we make a service, it will go down sometime. What we architects need to do is deliberately design services to recover quickly. Here’s how, from my experience in architecting and improving resilience in IT systems.

Let’s first sort out the terms. Resilience is different from Availability. Availability focuses on not going down, whereas Resilience aims to recover quickly from a fall. Resilience contributes to availability but has other metrics and specific design techniques. Whereas availability is defined as the percentage of time a service is functional within its required working hours, resilience is defined by NIST as follows:

The ability of an information system to continue to: (i) operate under adverse conditions or stress, even if in a degraded or debilitated state, while maintaining essential operational capabilities; and (ii) recover to an effective operational posture in a time frame consistent with mission needs.

The first part falls in availability design and contributes to it. Please see my article below on this aspect.

Here we will focus on the second aspect, recovery from failure.

Let’s examine resilience in five steps:

A. Understanding Failure

B. Resilience Requirements and Measurement

C. Resilience Solution Patterns & Capabilities

D. Resilience Solution Method

A. Understanding Failure

Solutions are easy once we understand the problem well. Let’s look at aspects of failure that affect the service experience and downtime cost.

Failure Contexts

  1. Normal and peak-usage time failure — The probability of failure during normal service usage hours (aka working hours) is higher as the systems are under load and more use cases traverse more systems. The fundamental resilience design is for this context. For example, an eCommerce site could have the most demanding resilience requirements in regular working hours, 6 am to 12 am, and peaks between 10 am to 1 pm and 7 pm to 10 pm.
  2. Low or no-usage time failure — Some failures will statistically fall into periods of low or no use, and others may happen only in this period as a pattern. Neither type should be neglected, and they’ll reveal interesting gaps in the design that must be rectified. For example, a simple batch transfer job fails every morning at 1 am and doesn’t recover, even though the system load is very low.

Failure Sources

  1. Independent Failure — The component within which something goes wrong in its hardware or software and causes it to fail is called the Origin Point. Every type of failure for all potential origin points should be considered, starting with well-known sources and causes. For example, the storage system of the eCommerce database fails due to a failure in its power supply.
  2. Dependent Failure — It would be an outstanding resilience design if Origin Point failure does not cascade into the failure of other components. The usual case is dependent systems failing in proportion to their dependence on the Origin Point component. For example, the eCommerce portal fails as its database fails as its storage fails, as its power supply fails.

Top ten causes of IT system failure

From independent studies and my experience over three decades, here are the most common causes of hardware and software failure. Cater for these at a minimum in the resilience solution analysis that follows. (Each of these causes has its own micro-causes, e.g. dust accumulation or fan failure for the first one. Addressing that level is beyond the scope of this article.)

Hardware (servers, storage systems, network appliances, etc.)

  1. Overheating
  2. Power supply failure
  3. Motherboard or sub-system board failure
  4. Storage System failure
  5. Driver/firmware incompatibility

Software (applications, http/web servers, database software, integration software, virtual functions, web services, etc.)

  1. Overutilisation
  2. Human error (especially during any change)
  3. Bugs
  4. Untested conditions
  5. Malware or attack

Net Recovery Time

For the business service to recover, the entire dependent hierarchy of systems from the top down to the Origin Point has to —  

a. Recover all individually

b. Resume their working together

The former is usually additive, and the latter is more complicated to understand and design for. For example, the power supply is swapped automatically to a backup in 0.05 s, and the storage system returns online in 0.1 s. Still, the database takes 0.5 s to start using it again. The application layer takes another 0.5 s to begin communicating with the database, plus an additional 0.5 s to clear the queue in its in-memory database. The net recovery time is then 1.65s. (The method section of this article guides such analysis.)

Net Recovery Time (NRT) will generally be highest for compute, storage and network hardware failures and lowest for the failure of a single application or network software component.

B. Resilience Requirements and Measurement

Resilience Requirements

The trick is to recover just right to get the benefit of user satisfaction and avoid the cost of over-engineering. For resilience, this starts with understanding the impact of downtime in terms of duration and frequency.

Let’s take the example of home internet wifi. Say the service has an availability requirement of 99.99% between 9 am to 5 pm, which means it can be down for at most 4.8 minutes in that period. Let’s say that on a bad day, it goes down thrice for about a minute each between 9 and 5. It has met its availability service level, but the disruption to work video calls or movie streaming is unacceptable.

Assuming we accept the availability figure, what resilience would we expect? The large variety of motivations falls into two general areas: Experience and Cost. Depending on the service type, user type, user numbers, and working hours, they may or may not be related.

A. Failure Experience — All IT systems are ultimately used by humans through interaction initiated by them, received by them, or both. The users will be customers, citizens, staff, and partners. Their experience of failures will affect their mood, happiness, health, and productivity. The failure experience may be:

  • Unfelt — If we make the NRT low enough that the user does not realise there was a failure somewhere behind the service. For example, the failed power supply of the storage system is replaced so quickly that eCommerce purchases continue normally.
  • Felt — When failures cannot be prevented from being sensed by the user, we should minimise the mental disruption. For example, the product page (one of the most common eCommerce activities) is loaded in under 2 seconds for all active users during a failure-recovery incident instead of the normal 1 s and doesn’t leave a negative feeling.

B. Failure Cost — Every failure has direct and indirect costs that must be minimised. Direct cost examples are loss of orders, payments, etc. Indirect cost examples are loss of customers, higher operations automation and support costs, etc. Estimate both types of costs for working out the resilience requirements.

To take Failure Experience and Failure Cost into account and arrive at the resilience requirement, use the map below with the service’s stakeholders. (Please use it for the approach rather than reference values. You can download and use it with attribution from here →NRT Template.)

Image made by the author

Resilience Measurement

Let’s say Required Net Recovery Time is RNRT, and Actual Net Recovery Time is ANRT. Then Service Resilience is defined as:

SR = Monthly Average (RNRT/ANRT)

The higher the value of S, the better the resilience. It could be close to zero (no resilience) to a very large number (expensive to achieve). For non-life-critical services, our architecture should aim for 1.2 as a reasonable trade-off. For life-critical services, it should be close to 2 for a margin of safety.

  • Example 1: Say RNRT = 2 seconds and three outages in a month with ANRT1=0.8 s, ANRT2 = 2.1 s and ANRT3 = 1.5 s. Then SR = 2/[(0.8+2.1+1.5)/3] = 1.36
  • Example 2: Say RNRT = 1 second, and there are five outages in a month with ANRT1= 2 s, ANRT2 = 3 s, ANRT3 = 7s, ANRT4 = 1 s and ANRT5 = 9 s. Then SR = 1/[(2+3+7+1+9)/5] = 0.23

C. Resilience Solution Patterns & Capabilities

Recovery Patterns

  1. Self-healing — IT Systems should, by default, be designed to recover by themselves from both independent and dependent failures. This will be the fastest recovery and typically in the mS range. For example, a database that chokes as its allocated memory becomes full acquires more memory from the operating system manager, or if it is unable to read or write to storage, it keeps transactions in memory and retries storage access until it gets it.
  2. External — Automation— The next best recovery pattern is to monitor an IT component’s health externally and do what it should have done or take more drastic measures. This is slower than self-healing and typically in the sub-second range. For example, an automaton monitors the memory used by the database and allocates more if it gets full, or for storage failure, it detects its recovery and calls a management API to tell the database to start using it again.
  3. External — Human Intervention— If neither self-healing nor automation is available, human intervention is the last resort to recover from failure. This is the slowest, typically in the seconds to minutes range, and may not meet the service resilience requirement in most cases. For example, the database remains choked, and there is no automation to allocate it more memory. The high memory utilisation alarm is seen by an operator who does the needful.

Failure Source Visibility

All resilience solutions must cater for two situations:

  1. When the cause of failure is known — As the IT of an enterprise matures through large and small quality processes, the causes will be known in most failures. The source of the problem will be addressed quickly and effectively by one of the recovery patterns in the preceding section, especially the first, self-healing, and the second, externally automated rectification.
  2. When the cause is unknown —The cause of the failure may not be known at the time it happens, especially in immature IT landscapes but even in mature ones. Yet, we have to restore the component to a working state. The recovery patterns above apply to this situation, although there will be more of the second, automation, and even more of the third, manual rectification.

Resilience Solution Capabilities

The following enabling capabilities are vital for resilient systems.

  1. Currency — Software and hardware continuously improve, including self-healing and failure avoidance. Ensure that all the components are updated to the latest version or next-to-latest (also known as N-1, N being the current version number), and none of them is beyond End of Support for patching and updates.
  2. Health Monitoring — Monitor the Key Health Indicators (KHIs), especially those that lead to failure. The notifications and alarms can feed into the component, an automation system, or a manual operator. For example, the utilisation of the allocated file system for an application, the temperature of the CPU of a server, etc.
  3. Recovery Automation — One of the most essential applications of automation is failure recovery. Automation assists the component to self-heal, or it takes external actions to fix it or keep the service going in other ways to buy time. Automation needs open, wide, lateral, and deep thinking, and its particulars are beyond the scope of this article. Please study automation as a subject before and while applying it for resilience. For example, automata monitor the memory utilisation of a database and tell it when it’s full, increase it, or switch the function to a standby database.
  4. Timeout Hierarchy — This is a little understood but vital aspect in the design of a resilient service. Timeout is the length of time a component tries to communicate with a provider about something it needs before it gives up. For a service and its resilience to work well, the timeouts in a hierarchy of dependent systems should taper down from being largest at the top to smallest at the bottom. If this is not followed, the behaviour of the service will be error-prone and difficult to control. This article cannot accommodate a complete treatment of this topic, but you can contact me if you need help. For example, say an eCommerce system’s storage fails. Users time out after waiting for a product page to load for 10 s → the browser times out after 8 s → the app server times out after 5 s → the database times out after 3 s.
  5. Backup & Restore — We have reached the three non-real-time capabilities required for recovery. Every component has software, data, configuration, and in many instances, a ‘state’ that should be backed up so that if online recovery fails, the service can be brought back as per the restore time and restore point requirements. Backup composition, delay and frequency should match these requirements and not be picked arbitrarily, as backing up is expensive. For example, an eCommerce site’s orders and payment data may be backed up (almost) continuously, whereas the search history may be backed up every night.
  6. Analytics — Storing information on key health parameters, outages and recoveries allows the detection of failure patterns, dependency chains and root causes. This has two benefits— (a) near-real-time intervention and recovery and (b) proactive improvement in component robustness, automation and human intervention processes. For example, analytics reveals a bug in a network router that brings it down whenever a particular external endpoint accesses a specific internal server port; this allows a temporary automaton to activate an alternative network path and the bug to be fixed.
  7. Resilience Competency — Process, Skills, Artefacts — Everything cannot self-heal or be automated. Human-designed systems will have flaws; we must dedicate humans to preventing and correcting them. This means a resilience process, artefacts, and a dedicated team of architects, designers, and automation and operations experts for medium to large enterprises. Its complete definition is beyond this article, but you can contact me if you need help defining it.

Availability Capabilities

The following capabilities give time for components to recover by maintaining the service availability. Strictly speaking, they do not make the failed component recover faster. Still, as this area of architectural thinking is around the user’s service experience and the cost of downtime, we cannot make a gulf between availability and resilience. When designing for availability, follow this article and design for resilience too. And vice versa.

The seven factors of resilience we saw above are used for availability also. In addition, availability needs the following — 

  1. Clustering and load balancing — Compared to having a single component for a function, which becomes a single point of failure, using multiple systems to share the functional load remarkably increases a service’s availability. See my article here →How to Calculate and Design IT Service Availability.
  2. Multipathing — In addition to multiple instances of hardware and software for distributed transactions, we need to provide multiple network (local, wide-area, national, international, etc.) and SAN (Storage Area Network) Fabric paths so that if one path is inoperative another can be used by components.
  3. Capacity Management — One of the frequent causes of service failure is the overloading of one or more components. Unfortunately, many components don’t just top out at a specific capacity and elegantly ask new users and their events to wait but fail catastrophically, and the service goes down for everyone. Predicting the load pattern over hours, days, months, and years and designing for transient and permanent capacity increases is vital for service stability.

D. Resilience Solution Method

Now that we understand the architectural thinking around failure and recovery let’s see how to methodically create the solution to achieve the required NRT and Service Resilience.

1. Gather resilience-related information

1.1 Use Case Sequence Systems and Layers Diagram

List the top ten to twenty critical business services and prioritise the ones that must be assuredly resilient. Each service will have more than one use case.

The first essential step is to know what lies beneath each critical use case. Visualise and draw its end-to-end sequence through the various systems it traverses, from the user equipment and its software and hardware to the company’s systems running in private or cloud data centres, and partner systems.

For each system in the path, visualise and draw the layers that comprise it.

The drawing of this Systems and Layers Diagram (SALD) can be summarised below, at a minimum. If more layers are involved, capture them.

Business Service →Use Case →Leg →Functional System →[Layers: App →Middleware →Compute →Storage →Network ]→Integration System →[Same Layers] →Next Functional System →[Same Layers] →Integration System →[Same Layers] →Next Functional System →[Same Layers]….repeat until last system.

Draw one diagram per use case and put the sequence on the diagram.

Once you gather timeout information and identify the gaps from the next two steps, add this information to the SALD for ease of understanding and communication. Also, put the relevant LBLA sheet names on it for reference.

The diagram below illustrates this for one use case of one business critical service.

Make diagrams like this for your business service use cases in Visio or, etc. (You can download the sample here and use it with attribution →Sample SALD.)

Image made by the author

1.2 Layer-by-Layer Analysis Spreadsheet

For each unique system across the SALDs, make a spreadsheet using the standard template described below to capture the details of the technology of each component and its layers. 

This will allow us to analyse how to make them resilient. Use the SALDs to ensure every system or layer is included. (While gathering the detailed information, you will likely discover more systems and layers than you had visualised. Please include them in the SALD and spreadsheet.)

For each layer, capture the following information and more if it’s relevant to resilience (and availability, as you can solve for both together).

  1. On sheet 1, put the architecture diagram of the system with all software and hardware components and instances picturised.
  2. On sheet 2, put one row for each standard layer of the system, such as — 
  • HTTP Server
  • Web Server
  • Business Logic Component
  • UI Component
  • Database Component.

Then for each layer above, put these standard layers as sub-rows — 

  • Software 
  • Hypervisor/Kubernetes and OS 
  • Physical server 
  • Network cards 
  • Storage cards 
  • Racks 
  • Storage
  • Network switches, routers

For each row, have the following columns of detailed information and fill them in as applicable — 

  1. Make, name, version, latest version
  2. Number of instances with location, individual IDs, and IPs as relevant
  3. LB instances and rules configured
  4. Failover setup
  5. Timeout value
  6. Capacity utilised
  7. Max connection pool size
  8. Monitoring and alerts configuration
  9. Transaction monitoring configuration
  10. Self-healing configuration
  11. Automated recovery configuration
  12. Manual recovery procedure
  13. Operations team competency
  14. Backup and restore configuration
  15. Logging configuration
  16. Analytics configuration
  17. Experiential impact of failure
  18. The cost impact of failure
  19. Gaps found and actions required
  20. Estimated Recovery Time
  21. Information and analysis provider names

The screengrab below shows some of the columns of the sample spreadsheet you can download and use with attribution → LBLA Template. (The colour coding is explained in the next section.) 

You can take it as the illustration of one system in the example SALD of the previous section that comprises two components.

Image made by the author

1.4 Big picture on the number of SALDs and LBLAs

Let’s know how many diagrams and spreadsheets you’ll need to create. Suppose you identify ten business-critical services with three critical use cases each on average (for example, order — with place order, cancel order, track order; payment — with make payment, collect payment, refund payment). 

You would then end up with 10 x 3 = 30 Sequence and Layer Diagrams or SLADs.

Now suppose you find that these thirty use cases traverse fifteen unique systems. Then make one Layer-by-Layer Analysis sheet for each to end up with 15 LBLA sheets.

You can make a mapping table for Use Case vs LBLA sheets or make a folder for each case and store its SALDs and LBLAs together (recommended).

2. Identify the Gaps

Check what recovery solution exists for each layer’s components in the LBLA. It will rarely be a blank slate as most modern systems have some in-built self-healing capabilities. When a new use case is added over existing systems, or you are ensuring resilience for an existing solution, you’ll probably find a patchwork quilt of automation and human intervention with gaps. If it is a new service and IT solution, you’ll find automation and human intervention need to be added. 

Mark the missing, partial, and complete resilience solutions in red, amber, and green, respectively, then fill in the ‘Gaps found and actions required’ column of the LBLA sheet.

3. Detail the Solution

The solution is usually obvious once we know the problem and its root cause well. All we need to do is follow the solution principles and guidelines. In our case, we will look for one of the three solution patterns in this order of priority — self-healing, automation, and manual intervention. (Sometimes, for example, in life-critical services, we may add external automation and manual intervention even if there is a self-healing capability in a component.)

3.1 NRT Estimation and Design

With all the resilience solutions filled in, we will determine whether and how to meet the required Net Recovery Time for the use case of the business service. Follow these steps:

  1. Estimate the recovery time of each layer of every component and system in the LBLA sheets.
  2. Take each layer as a potential Origin Point for failure and add up the NRT as its own recovery time plus that of layers that depend on it.
  3. Identify the Maximum NRT value for the use case. If it exceeds the required NRT, change the recovery solutions until it equals or less.
  4. Put the Maximum NRT info on the Sequence and Layers Diagram of the use case for broader communication.

3.2 HLD and LLD

Describe in a separate document or the LBLA sheet the High-Level Design (HLD) of the recommended solution for each gap found in step 2.

Brief the infrastructure and software architects on the resilience solutions so they can make the Low-Level Design (LLD) for them.

Brief the Testing SMEs, Project Managers, Operations SMEs, and others as required with the SALD, LBLA with HLD, and LLD so all the usual parts of a typical Software Delivery Life Cycle (SDLC) can be planned and financed.

4. Build and Test

Once the required self-healing, automation and manual solution has been designed, the necessary infrastructure, software, configuration, integration, etc., is built per the low-level design documents and the SDLC plan.

Testing for resilience needs particular diligence in recreating the failure scenarios to verify the recovery solutions. Lay out the test cases, runs and results in standard templates. Investing time, effort and tools for resilience testing will pay off handsomely.

5. Deploy and Operate

The overall resilience solution must be deployed and operated like a business service. Document the operations team, artefacts and processes, including roles and responsibilities, dashboards, standard operating procedures, Root Cause Analysis documents, incident reports, logs, workflows, emergency contacts, support arrangements, escalation matrix, etc. The EA team should conduct a bi-annual or annual review of resilience operations to improve the solutions.

End Note

Resilient systems make life pleasant for people and keep the wheels of commerce running smoothly. It’s your responsibility, my dear architect, to deliver resilience methodically. Do it for all your solutions, and take my help if you need it.

Connect with me:

Leave a Reply!

Scroll to Top
%d bloggers like this: