Overview#Availability Management is the practice of identifying levels of IT Service availability for use in Service Level Reviews with Customers.
All areas of a service must be measurable and defined within the Service Level Agreement (SLA).
To measure service availability the following areas are usually included in the SLA:
- Agreement statistics such as what is included within the agreed service.
- Availability agreed service times, response times, etc.
- Help Desk Calls number of incidents raised, response times, resolution times.
- Contingency agreed contingency details, location of documentation, contingency site, 3rd party involvement, etc.
- Capacity performance timings for online transactions, report production, numbers of users, etc.
- Costing Details charges for the service, and any penalties should service levels not be met.
Availability is usually calculated based on a model involving the Availability Ratio and techniques such as Fault Tree Analysis, and includes the following elements:
- Serviceability where a service is provided by a 3rd party organisation, this is the expected availability of a component.
- Reliability the time for which a component can be expected to perform under specific conditions without failure.
- Recoverability the time it should take to restore a component back to its operational state after a failure.
- Maintainability the ease with which a component can be maintained, which can be both remedial or preventative.
- Resilience the ability to withstand failure.
- Security the ability of components to withstand breaches of security.
Availability Management and IT Security#IT Security is an integral part of Availability Management, this being the primary focus of ensuring IT infrastructure continues to be available for the provision of IT Services.
Some of the above elements are really the outcome of performing a risk analysis to identify any resilience measures to be put in place, identifying just how reliable elements are and how many problems have been caused as a result of system failure.
The risk analysis also recommends controls to improve availability of IT infrastructure such as development standards, testing, physical security, the right skills in the right place at the right time, etc.