It is very common in IT to see “Service Level Agreements” specifying a certain amount of uptime. This is usually considered in “nines”: when someone talks about five nines, they’re referring to 99.999% uptime.
Very few services actually attain it, or even come close. All it takes is one bad day and the “downtime budget” for the next century is cooked.
But what’s the real problem? Is it the mean time between failures? Or the total amount time offline? Or is it how long business is disrupted for? I believe the latter.
So here’s my thought bubble: SLAs could be specified as Maximum Expected Downtime. We could even do it with nines, if we liked, with some help from the actuaries. “30 seconds, five nines” would mean that in 99.999% of downtime events, the system is available again in less than 30 seconds.
For those who are worried about interruptions to service and who want to keep the old measure: does it really work for you? And how seriously do you need that old-style uptime SLA? If you really mean it, consider buying a NonStop or z-Series.