A thought bubble about service-level agreements.

It is very common in IT to see “Service Level Agreements” specifying a certain amount of uptime. This is usually considered in “nines”: when someone talks about five nines, they’re referring to 99.999% uptime.

Very few services actually attain it, or even come close. All it takes is one bad day and the “downtime budget” for the next century is cooked.

But what’s the real problem? Is it the mean time between failures? Or the total amount time offline? Or is it how long business is disrupted for? I believe the latter.

So here’s my thought bubble: SLAs could be specified as Maximum Expected Downtime. We could even do it with nines, if we liked, with some help from the actuaries. “30 seconds, five nines” would mean that in 99.999% of downtime events, the system is available again in less than 30 seconds.

For those who are worried about interruptions to service and who want to keep the old measure: does it really work for you? And how seriously do you need that old-style uptime SLA? If you really mean it, consider buying a NonStop or z-Series.

This entry was posted in Geeky Musings, IT and Internet. Bookmark the permalink.

2 Responses to A thought bubble about service-level agreements.

  1. billie says:

    Service level agreements should also include a statement of corporate social responsibility along with maximum expected downtime.

    Should organisations be able to offshore jobs leaving educated, experienced and skilled Australian workers on the unemployment scrap heap. Retrenched workers are generally told to retrain for hospitality or computer programming. Hospitality wants young, good looking workers with pleasant personalities.
    Computer jobs are being off-shored on a daily basis.
    It makes you question the value of a technical education if you are going to end up in the back room operations. back room operations can easily be offshored, the secure jobs are the face to face marketing positions.
    God help us – we are becoming a nation of marketers, all BS no substance!

  2. John Quiggin says:

    I guess it depends on whether failures imply data loss. If so (as with a power loss) the duration isn’t that important, it’s the frequency that matters.

Leave a Reply

Your email address will not be published. Required fields are marked *

Notify me of followup comments via e-mail. You can also subscribe without commenting.