Business Impact Analysis
Business impact analysis (BIA) is the process of determining the potential impacts resulting from the interruption of time-sensitive or critical business processes. IT risk assessment, as well as planning for both disaster recovery and operational continuity, relies on conducting a BIA as part of the overall plan to ensure continued operations and the capability to recover from disaster. The BIA focuses on the relative impact of the loss of operational capability on critical business functions. Conducting a business impact analysis involves identifying critical business functions and the services and technologies required for them, along with determining the associated costs and the maximum acceptable outage period.
For hardware-related outages, the assessment should also include the current age of existing solutions, along with standards for the expected average time between failures, based on vendor data or accepted industry standards. Planning strategies are intended to minimize this cost by arranging recovery actions to restore critical functions in the most effective manner based on cost, legal or statutory mandates, and calculations of the mean time to restore.
A business impact analysis is a key component in ensuring continued operations. For that reason, it is a major part of a business continuity plan (BCP) or continuity of operations plan (COOP) as well. The focus is on ensuring the continued operation of key mission and business processes. U.S. government organizations commonly use the term mission-essential functions to refer to functions that need to be immediately functional at an alternate site until normal operations can be restored. Essential functions for any organization require resiliency. Organizations also must identify the dependent systems for both the functions and the processes that are critical to the mission or business.
A BCP must identify critical systems and components. If a disaster is widespread or targets an Internet service provider (ISP) or key routing hardware point, an organization’s continuity plan should detail options for alternate network access. This should include dedicated administrative connections that might be required for recovery. Continuity planning should include considerations for recovery in case existing hardware and facilities are rendered inaccessible or unrecoverable. It should also consider the hardware configuration details, network requirements, and utilities agreements for alternate sites.
RTO and RPO
Recovery point objective (RPO) and recovery time objective (RTO) are important concepts of the BCP and form part of the broader risk management strategy. RPO, which specifically refers to data backup capabilities, is the amount of time that can elapse during a disruption before the quantity of data lost during that period exceeds the BCP’s maximum allowable threshold. Simply put, RPO specifies the allowable data loss. It determines up to what point in time data recovery can happen before business is disrupted. For example, if an organization does a backup at 10:00 p.m. every day and an incident happens at 7:00 p.m. the following day, everything that changed since the last backup would be lost. The RPO in this context is the backup from the previous day. If the organization set the threshold at 24 hours, the RPO would be within the threshold because it is less than 24 hours.
The RTO is the amount of time within which a process must be restored after a disaster to meet business continuity requirements. The RTO is how long the organization can go without a specific application; it defines how much time is needed to recover after a notification of process disruption.
MTTF, MTBF, and MTTR
When systems fail, one of the first questions asked is, “How long will it take to get things back up?” It is better to know the answer to such a question before disaster strikes than to try to find the answer afterward. Fortunately, established mechanisms can help you determine this answer. Understanding these mechanisms is a big part of the overall analysis of business impact.
Mean time to failure (MTTF) is the length of time a device or product is expected to last in operation. It represents how long a product can reasonably be expected to perform, based on specific testing. MTTF metrics supplied by vendors about their products or components might not have been collected by running one unit continuously until failure. Instead, MTTF data is often collected by running many units for a specific number of hours and then is calculated as an average based on when the components fail.
MTTF is one of many ways to evaluate the reliability of hardware or other technology and is extremely important when evaluating mission-critical systems hardware. Knowing the general reliability of hardware is vital, especially when it is part of a larger system. MTTF is used for nonrepairable products. When MTTF is used as a measure, repair is not an option.
Mean time between failures (MTBF) is the average amount of time that passes between hardware component failures, excluding time spent repairing components or waiting for repairs. MTBF is intended to measure only the time a component is available and operating. MTBF is similar to MTTF, but it is important to understand the difference. MTBF is used for products that can be repaired and returned to use. MTTF is used for nonrepairable products. MTBF is calculated as a ratio of the cumulative operating time to the number of failures for that item.
MTBF ratings can be predicted based on product experience or data supplied by the manufacturer. MTBF ratings are measured in hours and are often used to determine the durability of hard drives and printers. For example, typical hard drives for personal computers have MTBF ratings of about 500,000 hours.
These risk calculations help determine the life spans and failure rates of components. These calculations help an organization measure the reliability of a product.
One final calculation assists with understanding approximately how long a repair will take on a component that can be repaired. The mean time to repair (MTTR; also called mean time to recovery) is the average time required to fix a failed component or device and return it to production status. MTTR is corrective maintenance. The calculation includes preparation time, active maintenance time, and delay time. Because of the uncertainty of these factors, MTTR is often difficult to calculate. In order to reduce the MTTR, some systems have redundancy built in so that when one subsystem fails, another takes its place and keeps the whole system running.