Disaster Recovery and Business Continuity Management
There are many different approaches to BCP and DRP. Some companies address these processes separately, whereas others focus on a continuous process that interweaves the plans. The National Institute of Standards and Technology (NIST) (http://www.csrc.nist.gov) offers a good example of the contingency process in Special Publication 800-34: Continuity Planning Guide for Information Technology Systems (http://tinyurl.com/yb3lcw). In NIST SP 800-34, the BCP/DRP process is defined as
- Develop the contingency planning policy statement.
- Conduct the BIA (business impact analysis).
- Identify preventive controls.
- Develop recovery strategies.
- Develop an IT contingency plan.
- Test the plan, train employees, and hold exercises.
- Maintain the plan.
Before we go further, let's define the terms disaster and business continuity. A disaster is any sudden, unplanned calamitous event that brings about great damage or loss. Entire communities have concerns following a disaster; however, businesses face special challenges because they have responsibilities to protect the lives and livelihoods of their employees, and to guard company assets on behalf of shareholders. In the business realm, a disaster can be seen as any event that prevents the continuance of critical business functions for a predetermined period of time. In other words, the estimated outage might force the declaration of a disaster.
Business continuity is the process of sustaining operation of critical systems. The goal of business continuity is to reduce or prevent outage time and optimize operations. The Business Continuity Institute (http://www.thebci.org), a professional body for business continuity management, defines business continuity management in the following terms:
- Business Continuity Management is a holistic management process that identifies potential impacts that threaten an organization, provides a framework for building resilience, ensures an effective response, and safeguards the interests of its key stakeholders, reputation, brand, and value.
Although there are competing methodologies that can be used to complete the BCP/DRP process, this chapter will follow steps that most closely align with reference documentation recommended by ISC2. Figure 7.1 illustrates an overview of the process, the steps for which are as follows:
Figure 7.1 BCP/DRP process.
- Project initiation
- Business impact assessment
- Recovery strategy
- Plan design and development
- Monitoring and maintenance
We will discuss each of these steps individually.
Project Management and Initiation
Before the BCP process can begin, it is essential to have the support of management. You might need to educate management about the need for a BCP. One way to accomplish this is to prepare and present a seminar for management that overviews the risk the organization faces, identifies basic threats, and documents the costs of potential outages. This is a good time to remind management that, ultimately, they are legally responsible. Customers, shareholders, stockholders, or anyone else could bring civil suits against senior management if they feel the company has not practiced due care. Without management support, you will not have funds to successfully complete the project, and resulting efforts will be marginally successful, if at all. Management is responsible for
- Setting the budget
- Determining the team leader
- Starting the BCP process
Management must choose a team leader. This individual must have enough creditability with senior management to influence them in regard to BCP results and recommendations. After the team leader is appointed, an action plan can be established and the team can be assembled. Members of the team should include representatives from management, legal staff, recovery team leaders, information security department, various business units, networking, and physical security. It is important to include asset owners and the individuals that would be responsible for executing the plan.
Next, determine the scope. A properly defined scope is of tremendous help in maximizing the effectiveness of the BCP plan. Be sensitive to interoffice politics, which, if out of control, can derail the planning process. Another problem to avoid is project creep, which occurs when more and more items that were not part of original project plan are added to the plan. This can delay completion of the project or cause it to run over budget.
The BCP benefits from adherence to traditional project plan phases. Issues such as resources (personnel, financial), time schedules, budget estimates, and any critical success factors must be managed. Schedule an initial meeting to kick off the process.
Finally, the team is ready to get to work. The team can expect a host of duties and responsibilities:
- Identifying regulatory and legal requirements that must be complied with
- Identifying all possible threats and risks
- Estimating the probability of these threats and correctly identifying their loss potential
- Performing a BIA
- Outlining the priority in which departments, systems, and processes must be up and running before any others
- Developing the procedures and steps to resume business functions following a disaster
- Assigning tasks to the employee roles, or individuals, that will complete those tasks during a crisis situation
- Documenting plans, communicating plans to employees, and performing necessary training and drills
It's important for everyone on the team to realize that the BCP is the most important corrective control the organization will have, and to use the planning period as an opportunity to shape it. The BCP is more than just corrective controls; the BCP is also about preventive and detective controls. These three elements are described here:
- Preventive—Including controls to identify critical assets and prevent outages
- Detective—Including controls to alert the organization quickly in case of outages or problems
- Corrective—Including controls to restore normal operations as quickly as possible
Business Impact Analysis
The next task is to create the BIA, the role of which is to measure the impact each type of disaster could have on critical business functions. The BIA is an important step in the process because it considers all threats and the implications of those threats. As an example, the city of Galveston, Texas is an island known to be prone to hurricanes. Although it might be winter in Galveston and the possibility of a hurricane is extremely low, it doesn't mean that planning can't take place to reduce the potential negative impact if and when a hurricane arrives. The steps for accomplishing this require trying to think through all possible disasters, assess the risk of those disasters, quantify the impact, determine the loss, and identify and prioritize operations that would require disaster recovery planning in the event of those disasters. The BIA is tasked with answering three vital questions:
- What is most critical?—The prioritization must be developed to address what processes are most critical to the organization.
- How long of an outage can the company endure?—The downtime estimation is performed to determine which processes must resume first, second, third, and so on, and to determine which systems must be kept up and running.
- What resources are required?—Resource requirements must be identified and require correlation of system assets to business processes. As an example, a generator can provide backup power, but requires fuel to operate.
The development of multiple scenarios should provide a clear picture of what is needed to continue operations in the event of a disaster. The team creating the BIA will need to look at the organization from many different angles and use information from a variety of sources. Different tools can be used to help gather data. Strohl Systems BIA Professional and SunGard's Paragon software can automate portions of the data input and collection process. Although the CISSP exam will not require that you know the names of various tools, it is important to understand how the BIA process works, and it helps to know tools that are available.
Whether the BIA process is completed manually or with the assistance of tools, its completion will take some time. Anytime individuals are studying processes, techniques, and procedures they are not familiar with, a learning curve will be involved.
As you might be starting to realize, creation of a BIA is no easy task. It requires not only the knowledge of business processes but also a thorough understanding of the organization itself, including IT resources, individual business units, and the interrelationships of each. This task will require the support of senior management and the cooperation of IT personnel, business unit managers, and end users. The general steps within the BIA include
- Determine data-gathering techniques
- Gather business impact analysis data
- Identify critical business functions and resources
- Verify completeness of data
- Establish recovery time for operations
- Define recovery alternatives and costs
Assessing Potential Loss
There are different approaches to assessing potential loss. One of the most popular methods is the use of a questionnaire. This approach requires the development of a questionnaire distributed to senior management and end users. The objective of the questionnaire is to maximize the identification of real loss from the people completing business processes jeopardized by the disaster. This questionnaire might be distributed and independently completed or filled out during an interactive interview process. Figure 7.2 shows a sample questionnaire.
Figure 7.2 BIA questionnaire.
The questionnaire can also be completed in a round table setting. In fact, this sort of group completion can add synergy to the process, providing the dynamics of the group allow for open communication and the required key individuals can all schedule and meet to discuss what impact specific types of disruptions would have on the organization. The importance of the inclusion of all key individuals must be emphasized because management might not be aware of critical key tasks for which they do not have direct oversight.
A questionnaire is a qualitative technique for assessing risk. Qualitative assessments are scenario-driven and do not attempt to assign dollar values to anticipated loss. A qualitative assessment ranks the seriousness of an impact using grades or classes, such as low, medium, high, or critical. This sort of grading process enables quicker progress in the identification of risks, and provides a means of classifying processes that might not easily equate to a dollar value. As an example:
- Low—Minor inconvenience that customers might not notice. Outages could last for up to 30 days without any real inconvenience.
- Medium—Loss of service would impact the organization after a few days to a week. Longer outages could affect the company's bottom line or result in the loss of customers.
- High—Only short term outages of a few minutes to hours could be endured. Longer outages would have a severe financial impact. Negative press might also reduce outlook for future products and services.
- Critical—Outage of any significance cannot be endured. Systems and controls must be in place or be developed to ensure redundancy so that no outage occurs.
The BIA can also be undertaken using a quantitative approach. This method of analysis attempts to assign a monetary value to all assets, exposures, and processes identified during the risk assessment. These values are used to calculate the material impact of a potential disaster, including both loss of income and expenses. A quantitative approach requires
- Estimation of potential losses and determination of single loss expectancy (SLE)
- Completion of a threat frequency analysis and calculation of the annual rate of occurrence (ARO)
- Determination of the annual loss expectancy (ALE)
The process of performing a quantitative assessment is covered in much more detail in Chapter 10. It is important that a quantitative study include all associated costs resulting from a disaster, such as
- Lost productivity
- Delayed or canceled orders
- Cost of repair
- The value of the damaged equipment or lost data
- The cost of rental equipment
- The cost of emergency services
- The cost to replace equipment or reload data
Both quantitative and qualitative assessment techniques require the BIA team to examine how the loss of service or data would affect the company. Each method is seeking to reduce risk and plan for contingencies, as shown in Figure 7.3.
Figure 7.3 Risk reduction process.
The severity of an outage is generally measured by considering the maximum tolerable downtime (MTD) for which the organization can survive without that function or service. Will there be a loss of revenue or operational capital or will the organization be held personally liable? Although the team might be focused on what the immediate effect on an outage would be, cost can be immediate or delayed. Many organizations are under regulatory requirements. The result of an outage could be a legal penalty or fine. The organization's reputation could even be tarnished.
Recovery strategies are the predefined actions that management has approved in the event that normal operations are interrupted. To judge the best strategy to recover from a given interruption, the team must evaluate and complete:
- Detailed documentation of all costs associated with each possible alternative
- Quoted cost estimates for any outside services that might be needed
- Written agreements with chosen vendors for all outside services
- Possible resumption strategies in case there is a complete loss of the facility
- Complete documentation of findings and conclusions as report to management of chosen recovery strategy for feedback and approval
This information is used to determine the best course of action based on the analysis of data from the BIA. With so much to consider, it is helpful to divide the organization's recovery into specific areas, functions, or categories:
- Business process recovery
- Facility and supply recovery
- User recovery
- Operations recovery
- Data and information recovery
Business Process Recovery
Business processes can be interrupted due to the loss of personnel, critical equipment, supplies, or office space; or from uprisings, such as strikes. As an example, in 2005 after Katrina, New Orleans had a huge influx of workers in the city rebuilding homes, offices, and damaged buildings. Fast food restaurants were eager to meet the demand these workers had for burgers, fries, tacos, and fried chicken. However, there was insufficient low-cost housing for the fast food industry's employees. The resulting shortage forced fast food restaurants to pay bonuses of up to $6,000 to entice potential employees to the area. It is worth noting that even if the facility is intact after a disaster, people are still required and are an important part of the business process recovery.
Workflow diagrams and documents can assist business process recovery by mapping relationships between critical functions. Let's process an order for a widget to illustrate a sample flow:
- Is the widget in stock?
- Which warehouse has the widget?
- When can the widget be shipped?
- Confirm capability to fulfill order with customer and provide total.
- Process credit card information.
- Verify funds were deposited in the bank.
- Ship item to customer.
- Restock widget for subsequent sales.
A more detailed listing would be appropriate for industrial use, but you get the idea. Building these types of flowcharts allows organizations to examine what resources are required for each step and what functions are critical for continued business operations.
Facility and Supply Recovery
Facility and supply interruptions can be caused by fire, loss of inventory, transportation problems, telecommunications, or heating, ventilating, and air conditioning (HVAC) problems. It is too late to start discussions on alternative sites when a disaster is striking your facility. Redundant services enable rapid recovery from these interruptions. Many options are available, from a dedicated offsite facility, to agreements with other organizations for shared space, to the option of building a prefab building and leaving it empty as a type of cold backup site. The following sections examine some of these options.
Organizations might opt to contract their facility needs to a subscription service. The CISSP exam considers hot, warm, and cold sites to be subscription services. Data-processing facilities are expensive. The organization might decide to dedicate the funds for a hot, warm, or cold site. A hot site facility is ready to be brought online quickly. A hot site is fully configured and is equipped with the same system as the production network. It can be made operational within just a few hours. A hot site will need staff, data files, and procedural documentation. Hot sites are a high-cost recovery option, but can be justified when a short recovery time is required. Because hot sites are typically a subscription service, a range of associated fees exist, including monthly cost, subscription fees, testing costs, and usage or activation fees. Contracts for hot sites need to be closely examined because some charge extremely high activation fees to prevent users from utilizing the facility for anything less than a true disaster. To get an idea of the types of costs involved, http://www.drj.com reports that subscriptions for hot sites average 52 months in length and costs can be as high as $120,000 per month. Compare this to cold sites, which can also be 5 to 6 years in length and can average anywhere between $500 to $2,000 per month.
Regardless of what fees are involved, the hot site needs to be periodically tested. Theses tests should evaluate processing abilities as well as security. The physical security of the hot site should be at the same level or greater than the primary site. Finally, it is important to remember that the hot site is intended for short term usage only. As a subscriber-based service, there might be others competing for the same resource. The organization should have a plan to recover primary services quickly or move to a secondary location.
For those companies lacking the funds to spend on a hot site or in situations where a short term outage is acceptable, a warm site might be acceptable. A warm site has data equipment and cables, and is partially configured. It could be made operational anywhere from in a few hours to a few days. The assumption with a warm site is that computer equipment and software can be procured as required due to a disaster. Although the warm site might have some computer equipment installed, it is typically of lower processing power than the primary site. The costs associated with a warm site are similar to those of a hot site but slightly lower. The warm site is a popular subscription alternative.
In situations where even longer outages are acceptable, a cold site might be the right choice. A cold site is basically an empty room with only rudimentary electrical power and computing capability. Although it might have a raised floor and some racks, it is nowhere near ready for use. It might take several weeks to a month to get the site operational. Cold sites offer the least preparedness when compared to hot and warm subscription services discussed.
The CISSP exam considers redundant sites to be sites owned by the company. Although these might be either partially or totally configured, the CISSP exam does not typically expect you to know that level of detail. A redundant site is capable of handling all operations if another site fails. Although there is an increased cost, it offers the company fault tolerance. If the redundant sites are geographically dispersed, the possibility of more than one being damaged is reduced. For low to medium priority services, a distance of 10 to 20 miles from the primary site is considered acceptable. If the loss of services, for even a very short time, could cost the organization millions of dollars, the redundant site should be farther away. Therefore, redundant sites that are to support highly critical services should not be in the same geographical region or subject to the same types of natural disasters as the primary site.
For organizations that have multiple sites dispersed in different regions of the world, multiple processing centers might be an option. Multiple processing centers allow a branch in one area to act as backup for a branch in another area. Table 7.1 shows some sample functions and their recovery times.
Table 7.1. Organization Functions and Example Recovery Times
Minutes to hours
Database shadowing (covered in the later section, "Other Data Backup Methods")
7 to 14 days
Research and development
Several weeks to a month
1 to 2 days
1 to 5 days
Mobile sites are another processing alternative. Mobile sites are usually tractor-trailer rigs that have been converted into data-processing centers. These sites contain all the necessary equipment and are mobile, permitting transport to any business location quickly. Rigs can also be chained together to provide space for data processing and provide communication capabilities. Mobile units are a good choice for areas where no recovery facilities exist and are commonly used by the military, large insurance agencies, and others.
Whatever recovery method is chosen, regular testing is important to verify that the redundant site meets the organization's needs, and that the plan can handle the workload to meet minimum processing requirements.
The reciprocal agreement option requires two organizations to pledge assistance to one another in case of disaster. The support requires sharing space, computer facilities, and technology resources. On paper, this appears to be a cost-effective approach, but it has its drawbacks. The parties to this agreement must place their trust in the other organization to provide aid in case of a disaster. However, a nonvictim might become hesitant to follow through when a disaster actually occurs. Also, confidentiality requires special consideration. This is because the damaged organization is placed in a vulnerable position while needing to trust the sponsoring party housing the victim's confidential information. Legal liability can also be a concern. One company agrees to help the other organization out when down and as a result it is hacked. Finally, if locations of the parties of the agreement have physical proximity, there is always the danger that disaster could strike both parties; thereby, rendering the agreement useless.
User recovery is primarily about what employees must have to accomplish their jobs. Requirements include
- Procedures, documents, and manuals
- Communication system
- Means of mobility and transportation
- User workspace and equipment
- Alternative site facilities
At issue here is the fact that a company might be able to get employees to a backup facility after a disaster, but if there are no phones, desks, or computers, the employees' ability to work will be severely limited.
User recovery can even include food. As an example, my brother-in-law works for a large chemical company on the Texas Gulf Coast. During storms, hurricanes, or other disasters, he is required to stay at work as part of the emergency operations team. His job is to stay at the facility regardless of time; the disaster might last two days or two weeks. During a simulation test several years ago, it was discovered that someone had forgotten to order food for the facility where the employees were to remain for the duration of the drill. Luckily, the 40 or so hungry employees were not really in a disaster, and were able to order pizza and have it delivered. Had it been a real disaster, no takeout would have been available.
Operations recovery addresses interruptions caused by the loss of capability due to equipment failure. Redundancy solves this potential loss of availability, such as redundant equipment, Redundant Array of Inexpensive Disks (RAID), backup power supplies (BPS), and other redundant services.
Hardware failures are one of the most common disruptions that can occur. Preventing the disruptions is critical to operations. The best place to start planning redundancy is when equipment is purchased. At purchase time, there are two important numbers that the buyer must investigate:
- Mean time between failure (MTBF)—Used to calculate the expected lifetime of a device. A higher MTBF means the equipment should last longer.
- Mean time to repair (MTTR)—Used to estimate how long it would take to repair the equipment and get it back into production. Lower MTTR numbers mean the equipment requires less repair time and can be returned to service sooner.
A formula for calculating availability is
MTBF / (MTBF+ MTTR) = Availability
To maximize availability of critical equipment, an organization can consider obtaining a service level agreement (SLA). There are all kinds of SLAs. In this situation the SLA is a contract between a company and a hardware vendor, in which the vendor promises to provide a certain level of protection and support. For a fee, the vendor agrees to repair or replace the covered equipment within the contracted time.
Fault tolerance can be used at the server or drive level. For servers, there is clustering, which is technology that allows you to group several servers together, where those servers are viewed logically as a single server. Users see the cluster as one unit. The advantage is that if one server in the cluster fails, the remaining active servers pick up the load and continue operation.
Fault tolerance on the drive level is achieved primarily with RAID, which provides hardware fault tolerance and/or performance improvements. This is achieved by breaking up the data and writing it to multiple disks. To applications and other devices, RAID appears as a single drive. Most RAID systems have hot-swappable disks. This means that faulty drives can be removed and replaced without restoring the entire computer system. If the RAID system uses parity and is fault tolerant, the parity data can be used to reconstruct the newly replaced drive. The technique for writing the data across multiple drives is called striping. Although write performance remains almost constant, read performance is drastically increased. RAID has humble beginnings that date back to the 1980s at the University of California. RAID is discussed in depth in Chapter 11, "Operations Security."
Although operations can be disrupted because of the failure of equipment, the loss of communications can also disrupt critical processes. Protecting communication with fault tolerance can be achieved through redundant WAN links, diverse routing, and alternate routing. Whatever method is chosen, the organization should verify capacity requirements and acceptable outage times. The primary methods for network protection include the following:
- Diverse routing—This is the practice of routing traffic through different cable facilities. Organizations can obtain both diverse routing and alternate routing, but the cost is not cheap. Most of these systems use buried facilities. These systems usually enter a facility through the basement and can sometimes share space with other mechanical equipment. Recognize that this sharing adds to the risk of potential failure. Also, many cities have aging infrastructures, which is another potential point of failure.
- Alternate routing—Redundant routing provides use of another transmission line if the regular line is busy or unavailable. This can include using a dialup connection in place of a dedicated connection, cell phone instead of a land line, or microwave communication in place of a fiber connection.
- Last mile protection—This is a good choice for recovery facilities; it provides a second local loop connection, and is even more redundantly capable if an alternative carrier is used.
- Voice communication recovery—Many organizations are highly dependent on voice communications. Others have started making the switch to Voice over IP (VoIP) for both voice and fax communication because of the cost savings. Some number of land lines should always be maintained to provide backup capability.
Networks are susceptible to the same types of outages as equipment. If operational recovery concerns are not addressed, these outages can be a real problem for companies that rely heavily on networks to deliver data when needed.
Data and Information Recovery
The focus here is on recovering the data. Solutions to data interruptions include backups, offsite storage, and/or remote journaling. Because data processing is essential to most organizations, the data and information recovery plan is critical. The objective of the plan is to back up critical software and data that permits quick restores with minimum loss of content. Policy should dictate when backups are performed, where the media is stored, who has access to the media, and what the reuse or rotation policy will be. Types of backup media include tape reels, tape cartridges, removable hard drives, disks, and cassettes.
Tape and optical systems still have the majority of market share for backup systems. Common types of media include
- 8mm tape
- CDR/W media (recommended for temporary storage only)
- Digital Audio Tape (DAT)
- Digital Linear Tape (DLT)
- Quarter Inch Tape (QIC)
- Write Once Read Many (WORM)
Another technology worth mentioning is MAID (Massive Array of Inactive Disk). MAID offers a distributed hardware storage option for the storage for data and applications. It was designed to reduce the operational costs and improve long-term reliability of disk-based archives and backups. MAID is similar to RAID except it provides power management and advanced disk monitoring. MAID might or might not stripe data and/or supply redundancy. The MAID system powers down inactive drives, reduces heat output, electrical consumption, and increases the drive's life expectancy.
In addition to defining the media type, the organization must determine how often backups should be performed and what type of backup should be performed. Answers will vary depending on the cost of the media, the speed of the restoration needed, and the time allocated for backups. Backup methods include
- Full backup—During a full backup, all data is backed up. No data files are skipped or bypassed. All items are copied to one tape, set of tapes, or backup media. If a restoration is required, only one tape or set of tapes is needed. Full backups take the most time to create, and the most space for storage media, but they also take the least time for restoration. A full backup resets the archive bit on all files.
- Differential backup—A differential backup is a partial backup performed in conjunction with a full backup. Typically, a full backup is done once a week, and a daily differential backup is done periodically thereafter to back up only those files that have changed since the last full backup. Any restoration requires the last full backup and the most recent differential backup. This method takes less time than a full backup per each backup, but increases the restoration time because both the full and differential backups will be needed. A differential backup does not reset the archive bit on files.
- Incremental backup—An incremental backup is faster yet to perform. It backs up only those files that have been modified since the previous incremental (or full) backup. Although fast to create, incremental backups require the most backup media and take the longest to recover from. A restoration requires the last full backup and all incremental backups since the last full backup. An incremental backup resets the archive bit on files.
- Continuous backup—Some backup applications perform continuous backups, and keep a database of backup information. These systems are useful when a restoration is needed because the application can provide a full restore, point-in-time restore, or restore based on a selected list of files.
Backup and Restoration
Backups need to be stored somewhere, and backups are needed quickly when it's time to restore. Where the backup media is stored can have a real impact on how quickly data can be restored and brought back online. The media should be stored in more than one physical location so that the possibility of loss is reduced. These remote sites should be managed by a tape librarian. It is this individual's job to maintain the site, control access, rotate media, and protect this valuable asset. Unauthorized access to the media is a huge risk because it could impact the organization's capability to provide uninterrupted service. Transportation to and from the remote site is also an important concern. Important backup and restoration considerations include
- Maintenance of secure transportation to and from the site
- Use of bonded delivery vehicles
- Appropriate handling, loading, and unloading of backup media
- Use of drivers trained on proper procedures to pick up, handle, and deliver backup media
- Legal obligations for data such as encrypted media, and separation of sensitive data sets such as credit card numbers and CVCs
- 24/7 access to the backup facility in case of an emergency
It is recommended that companies contract their offsite storage needs with a known firm that demonstrates control of their facility and is responsible for its maintenance. Physical and environmental controls at offsite storage locations should be equal to or better than the organization's own facility. A letter of agreement should specify who has access to the media and who is authorized to drop off or pick up media. There should also be agreement on response times that will be met in times of disaster. Onsite storage should maintain copies of recent backups to ensure the capability to recover critical files quickly.
Backup media should be securely maintained in an environmentally controlled facility with physical control appropriate for critical assets. The area should be fireproof, and anyone depositing or removing media should have a record of their access logged.
Software itself can be vulnerable, even when good backup policies are followed, because sometimes software vendors go out of business or no longer support needed applications. In these instances, escrow agreements can help.
Although most backup media is rather robust, no backup media can last forever; it will fail over time. This means that tape rotation is another important part of backup and restoration. Additionally, backup media needs to be periodically tested. Backups will be of little use if you find out during a disaster that they have malfunctioned and no longer work.
Tape-rotation strategies can range from simple to complex.
- Simple—A simple tape-rotation scheme uses one tape for every day of the week and then repeats the pattern the following week. One tape can be for Monday, one for Tuesday, and so on. You add a set of new tapes each month and then archive the previous month's set. After a predetermined number of months, you put the oldest tapes back into use.
- Grandfather-father-son (GFS)—This scheme includes four tapes for weekly backups, one tape for monthly backups, and four tapes for daily backups (assuming you are using a five-day work week). It is called grandfather-father-son because the scheme establishes a kind of hierarchy. Grandfathers are the one monthly backup, fathers are the four weekly backups, and sons are the four daily backups.
- Tower of Hanoi—This tape-rotation scheme is named after a mathematical puzzle. It involves using five sets of tapes, each set labeled A through E. Set A is used every other day; set B is used on the first non-A backup day and is used every 4th day; set C is used on the first non-A or non-B backup day and is used every 8th day; set D is used on the first non-A, non-B, or non-C day and is used every 16th day; and set E alternates with set D.
Other Data Backup Methods
Other alternatives that exist for further enhancing a company's resiliency and redundancy are listed in the following list. Some organizations use these techniques by themselves; others combine these techniques with other backup methods.
- Database shadowing—Databases are a high-value asset for most organizations. File-based incremental backups can read only entire database tables and are considered too slow. A database shadowing system uses two physical disks to write the data to. It creates good redundancy by duplicating the database sets to mirrored servers. Therefore, this is an excellent way to provide fault tolerance and redundancy. Shadowing mirrors changes to the database as they occur.
- Electronic vaulting—Electronic vaulting makes a copy of database changes to a secure backup location. This is a batch-process operation copying all current records, transactions, and/or files to the offsite location. To implement vaulting, an organization typically loads a software agent onto the systems to be backed up, and then, periodically, the vaulting service access the software agent on these systems to copy changed data.
- Remote journaling—Remote journaling is similar to electronic vaulting, except that information is duplicated to the remote site as it is committed on the primary system. By performing live data transfers, this mechanism allows alternative sites to be fully synchronized and fault tolerant at all times. Depending on configuration, it is possible to configure remote journaling to record only the occurrence of transactions and not the actual content of the transactions. Remote journaling can provide a very high level of redundancy.
- Storage area network (SAN)—An alternative to tape backup, a SAN supports disk mirroring, backup and restore, archiving, and retrieval of archived data in addition to data migration from one storage device to another. A SAN can be implemented locally or use storage at a redundant facility.
Choosing the Right Backup Method
It is not easy to choose the right backup method. To start the process, the team must consider how long of an outage the organization can endure and how current the restored information must be. These two recovery requirements are technically called
- Recovery point objective (RPO)—Defines how current the data must be or how much data an organization can afford to lose. The greater the RPO, the more tolerant the process is to interruption.
- Recovery time objective (RTO)—Specifies the maximum elapsed time required to recover an application at an alternative site. The greater the RTO, the longer the process can take to be restored and the more tolerant the organization is to interruption. Figure 7.4 illustrates the RTO can be used to determine acceptable downtime.
Figure 7.4 RPO and RTO.
What you should realize about both RPO and RTO is that the lower the time requirements are, the higher the maintenance cost will be to provide for reduced restoration capabilities. For example, most banks have a very small RPO because they cannot afford to lose any processed information.
Plan Design and Development
The BCP process is now ready for its next phase—plan design and development. In this phase, the team designs and develops a detailed plan for the recovery of critical business systems. The plan should be directed toward major catastrophes. Worst case scenarios are planned for because, by definition, the entire facility has been destroyed. If the organization can handle these types of events, less severe events such as disasters, which render the facility unusable only for a time, can be easily dealt with. The plan should be a guide for implementation. The plan should include information on both long-term and short-term goals and objectives:
- Identify critical functions and priorities for restoration.
- Identify support systems needed by critical functions.
- Estimate potential outages and calculate the minimum resources needed to recover from the catastrophe.
- Select recovery strategies and determine what vital personnel, systems, and equipment will be needed to accomplish the recovery.
- Determine who will manage the restoration and testing process.
- Calculate what type of funding and fiscal management is needed to accomplish these goals.
The plan should also detail how the organization will contact and mobilize employees, provide for ongoing communication between employees, interface with external groups, the media, and provide employee services. Each of these items is discussed next.
The process for contacting employees in case of an emergency needs to be worked out before a disaster. The process chosen depends on the nature and frequency of the emergency. Call trees and outbound dialing systems are widely used. An outbound dialing system stores the numbers to be called in an emergency. These systems can provide various services such as
- Call rollover—If one number gets no response, the next is called.
- Leave a recorded message—If an answering machine answers, a message can be left for the individual.
- Request a call back—Even if a message is left, the system will continue to call back until the user calls in to the predefined phone number.
A call tree is a communication system in which the person in charge of the tree calls a lead person on every branch, who in turn calls all the leaves on that branch. If call trees are used, the team will want to verify that there is a feedback mechanism built in. As an example, the last person on any branch of the tree calls and confirms that he /she got the message. This can help ensure that everyone has been contacted. Call trees can be automated with VoIP and public switched telephone networks (PSTNs) and online services. Personnel mobilization can also be triggered by emails to PDAs, BlackBerrys, and so on. Such systems require the email server to be functioning.
Interface with External Groups
Deciding how to interface with external groups is another important aspect of business continuity. Damaging rumors can easily start and it is important to have protocols in place for dealing with these incidents, accidents, and catastrophes. The organization must decide how to deal with response teams, the fire department, the police department, ambulance, and other emergency response personnel.
Someone should be identified to deal with the media. Negative public opinion can be costly. It is important to have a properly trained spokesperson to speak and represent the organization. The media spokesperson must be in the communication path to have the facts before speaking or meeting with the press. The appointed spokesperson should interface with senior management and legal counsel prior to making any public statement. Meeting with the media during a crisis is not something that should be done without preparation.
The corporate plan should include generic communiqués that address each possible incident. The spokesperson will also need to know how to handle tough questions. Liability should never be assumed; the spokesperson should simply state that an investigation has begun. Tackling these tough issues up front will allow the company to have a preapproved framework to work with should a real disaster occur.
Companies have an inherent responsibility to employees and to their families. This means that paychecks must continue and that employees need to be taken care of. Employees must be trained on what to do in case of emergencies and on what they can expect from the company. Insurance and other necessary services must continue.
During a disaster, employees must know what is expected of them and who is in charge. Someone must have the authority to allocate emergency funding as needed. As an example, after Hurricane Katrina, the U.S. Congress passed 48 C.F.R. § 13.201(b) (2005), which increased the limit on FEMA-issued credit cards to $250,000. The idea was to allow government employees to acquire needed items quickly and without delay. Although funding is important, controls must also be in place to ensure that funds are not misappropriated.
Insurance is one option that companies can consider to remove a portion of the risk the team has uncovered during the BIA. Just as protection insurance can be purchased by individuals for a host of reasons, companies can purchase protection insurance for each of the following items:
- Data centers
- Hacker insurance
- Software recovery
- Business interruption
- Documents, records, and important papers
- Errors and omissions
- Media transportation
Insurance is not without its drawbacks, such as high premiums, delayed claim payout, denied claims, and problems proving real financial loss. Also, most insurance policies pay for only a percentage of any actual loss and do not pay for lost income, increased operating expenses, or consequential loss.
The BCP team is now nearing the end of the plan's development process, and is ready to submit a completed plan for implementation. The plan is the result of all information gathered during the project initiation, the BIA, and the recovery strategies phase. A final checklist for completeness ensures the plan addresses all relevant factors, such as
- Calculates what type of funding and fiscal management is needed to accomplish the stated goals
- Determines the procedures for declaring a disaster and under what circumstances this will occur
- Evaluates potential disasters and calculates the minimum resources needed to recover from the catastrophe
- Determines critical functions and priorities for restoration
- Identifies what recovery strategy and equipment will be needed to accomplish the recovery
- Identifies individuals that are responsible for each function in the plan
- Determines who will manage the restoration and testing process
The completed plan should be presented to senior management for approval. References for the plan should be cited in all related documents so that the plan is maintained and updated whenever there is a change or update to the infrastructure. When management approves the plan, it must be released and disseminated to employees. Awareness training will help make sure that everyone understands what their tasks and responsibilities are when an emergency occurs.
Awareness and Training
The goal of awareness and training is to make sure all employees know what to do in case of an emergency. If employees are untrained, they might simply stop what they're doing and run for the door anytime there's an emergency. Even worse, they might not leave when an alarm has sounded, even though the plan required they leave because of possible danger. Instructions should be written in easy to understand language that uses common terminology that everyone will understand. The organization should design and develop training programs to make sure each employee knows what to do and how to do it. Employees assigned to specific tasks should be trained to carry out needed procedures. If possible, plan for cross-training of teams so that those team members are familiar with a variety of recovery roles and responsibilities.
This final phase of the process is to test and maintain the BCP. Training and awareness programs are also developed during this phase. The test of the disaster-recovery plan is critical. Without performing a test, there is no way to know whether the plan will work. Testing transforms theoretical plans into reality. Testing should be repeated at least once a year. Tests should start with easiest parts of the plan and then build to more complex items. The initial tests should focus on items that support core processing, and they should be scheduled during a time that causes minimal disruption to normal business operations. As a CISSP candidate, you should be aware of the five different types of BCP tests:
- Checklist—Although this is not considered a replacement for a live test, a checklist is a good first test. A checklist test is performed by sending copies of the plan to different department managers and business unit managers for review. Each recipient reviews the plan to make sure nothing was overlooked.
- Structured walkthrough—This test is performed by having the members of the emergency management team and business unit managers meet in a conference to discuss the plan. The plan then is "walked through" line by line. This gives all attendees a chance to see how an actual emergency would be handled and to discover discrepancies. By reviewing the plan in this way, errors and omissions might become apparent.
- Simulation—This is an actual simulation of a real disaster. This drill involves members of the response team acting in the same way they would if there had been an actual emergency. This test proceeds to the point of recovery or to relocation of the alternative site. The primary purpose of this test is to verify that members of the response team can perform the required duties with only the tools they would have available in a real disaster.
- Parallel—A parallel test is similar to a structured walkthrough but actually invokes operations at the alternative site. Operations at the new and old sites are run in parallel.
- Full interruption—This plan is the most detailed, time-consuming, and thorough. A full interruption test mimics a real disaster, and all steps are performed to complete backup operations. It includes all the individuals who would be involved in a real emergency; both internal and external to the organization. Although a full interruption test is the most thorough, it is also the scariest because it can create its own disaster.
The final step of the BCP process is to combine all this information into the BCP plan and inter-reference it with the organization's other emergency plans. Although the organization will want to keep a copy of the plan onsite, there should be another copy offsite. If a disaster occurs, rapid access to the plan will be critical.
Monitoring and Maintenance
When the testing process is complete, a few additional items still need to be considered. This is important because some might falsely believe that the plan is completed once tested. That's not true. All the hard work that has gone into developing the plan can be lost if controls are not put into place to maintain the current level of business continuity and disaster recovery. Life is not static and neither should the organization's BCP plans be. The BCP should be a living document, subject to constant change.
To ensure the plan is maintained, first build in responsibility for the plan. This can be done by
- Job descriptions—Individuals responsible for the plan should have this responsibility detailed in their job description. Management should work with HR to have this information added to the appropriate documents.
- Performance reviews—The accomplishment (or lack of accomplishment) of appropriate plan maintenance tasks should be discussed in the responsible individual's annual or biannual evaluations.
- Audits—The audit team should review the plan and make sure that it is current and appropriate. The audit team will also want to inspect the offsite storage facility and review its security, policies, and configuration.
Also, disaster recovery implications for monitoring, maintaining, and recovery should be made a part of any discussions for procuring new equipment, modifying current equipment, or for making changes to the infrastructure. The best method to accomplish this is to add BCP review into all change management procedures. If changes are required to the approved plans, they must also be documented and structured using change management. A centralized command and control structure eases this burden. Table 7.2 lists the individuals responsible for specific parts of the BCP process are listed in.
Table 7.2. BCP Process Responsibilities
Person or Department
Project initiation, ultimate responsibility, overall approval, and support
Middle management or business
Identification and prioritization of critical systems unit managers
BCP committee and team members
Planning, day-to-day management, implementation, and testing of the plan
Functional business units
Plan implementation, incorporation, and testing