01 Architecture Essays

How to Calculate and Design IT Service Availability

Architecture Techniques

Image by the author from Canva Pro

A common problem we do not deal with the attention it deserves is the proper specification, design and estimation of IT services and systems availability. The result is the shockingly high number of outages of web portals, mobile applications, and many business support systems not exposed to the ordinary person.

This article lays down a systematic method to think about the availability of a business service and what underlies it.

Three foundations of service availability

A. The inherent availability of the physical IT system

  1. The formula below expresses the availability of a physical system or software application:
    A = MTBF / (MTBF + MTTR)
    Where MTBF = Mean Time Between Failures, MTTR = Mean Time To Repair. E.g., A = 0.999 when MTBF = 3 years and MTTR = 1 day.
  2. Availability values come out as fractions under 1, e.g., 0.999, and we commonly refer to them by multiplying by 100, e.g., as 99.9% availability.
  3. We use the number of 9s in the value as a reference, and we have three 9s (99.9%), four 9s (99.99%), five 9s (99.999), etc., although other values such as 99.5 or 99.95 are also common.
  4. Although commonly applied to hardware, MTBF and MTTR also apply to software due to memory leaks, bugs, incompatibilities, etc., that are far from uncommon.
  5. Typical MTBF for hardware is in years and MTTR in hours to days. Typical MTBF for software is in days to decades, and MTTR is in hours to weeks, depending on maturity.
  6. MTBF and MTTR are outside planned downtimes for patching, upgrading, changes, etc.
  7. Availability is not the same as working hours. A system could have 24×7 or Mon-Fri 9–6 as working hours, but the system’s availability within the working hours is vital to customer satisfaction and is the aspect that requires designing.
  8. Several systems in series realise a business service. These systems comprise subsystems and components, to use the formal architectural terms.
  9. The availability of the business service is a combination of the availability of the series of systems it traverses.
  10. We express the availability of systems in series by the formula: A = Ax * Ay. Multiplying numbers less than one reduces the overall result further below one. Intuitively, we know that if one thing depends on another that is not 100% available, its own availability reduces for those it services. E.g., A = 0.998 if Ax and Ay are 0.999 each.
  11. But the wonderful thing is that using a set of duplicate systems in parallel remarkably increases the set’s availability. We express the availability of groups or clusters with n nodes of availability Ax each as:
    A = 1-(1-Ax )^n. E.g. A = 0.999999 for two parallel nodes of Ax = 0.999. So three nines become six nines in this simple case of a two-node cluster.
  12. To get a service’s overall availability, calculate each system’s availability as a set of its parallel node clusters (1-n). Then take systems in series and multiply their availabilities to get the total availability of the service.

B. The organisation and process for proactive management of IT system availability

Prevention is better than cure. Proactively keeping systems healthy is vital to their availability and hence that of the business service.

  1. Create an L2 support team with skills for each system and technology in the chain, including applications, middleware and hardware.
  2. Apply a process for the L2 support team to keep the systems patched and upgraded to the latest or second-latest version.
  3. The process should include monitoring capacity and augmenting it well in time if required.

C. The organisation and process for reactive management of IT system availability

Despite proactive care and high-quality physical systems, there will be failures. That’s why MTBF and MTTR exist. Fixing problems quickly ensures MTTR remains low and can keep the MTBF high if we do the Root Cause Analysis well.

  1. Create an L3 operations team with skills for each system and technology in the chain, including applications, middleware and hardware.
  2. Apply a process for the L3 support team to use notifications and alerts to prevent issues.
  3. When an issue occurs, the process should include rapid responses in two ways:
  • Begin Root Cause Analysis (RCA). Immediately collect any forensic data that will not be available after systems restart.
  • Implement workarounds, short-term fixes, and long-term fixes.

Example availability calculation

Business Service

Search for a product on an eCommerce site from a web browser.

Systems in the path

Browser->Wifi Modem->ISP Fibre/Copper->Internet GW->Service Provider NW->Border GW->External Firewall->Reverse Proxy->Internal Firewall->Web Server->Search Engine->Database
(Take each software application with its underlying operating system and hardware as one system)

Cluster Sizes

One->One->One->Two->Many Paths->Two->Two->Two->Two->Three->Two->Two

Diagram with availabilities marked

Image by the author; continued below
Image by the author; continued from above

(Availabilities are as observed typically in India)

A. Net availability calculated of the physical system

A = 0.995 x 0.999 x 0.995 x [1-(1–0.999)²] x 0.9999 x [1-(1–0.9999)²] x [1-(1–0.9999)²] x [1-(1–0.999)²] x [1-(1–0.9999)²] x [1-(1–0.9995)²] x [1-(1–0.995)²] x [1-(1–0.9995)³]

= 0.9889 or 98.89%

If planned maintenance is for 2 hours every month, this availability figure means maximum unavailability of about 8 hours per month. It is quite typical for web application service unavailability in India.

B. The organisation and process for proactive management of IT system availability

A team of 10 L2 personnel and a process (defining the process is beyond the scope of this article).

C. The organisation and process for reactive management of IT system availability

A team of 15 L3 SMEs and a process (defining the process is beyond the scope of this article).

Conclusion

Availability needs a methodical selection of physical systems and their cluster sizes and the thoughtful deployment of the required personnel and processes.

Thinking about availability numerically, knowing how to estimate it and designing for it is not difficult if you know how. Now you do. All you have to do is apply it. Don’t try to critique it and argue about it. Just apply it. When you’ve done that at least twenty times for real solutions, then we can quibble.

Enjoy, dear architect.


Home

Shashi on LinkedIn, Medium, FB, Twitter

Tagged , , ,

Leave a Reply!