In the first section of this series we looked at the definition of business continuity, and then we threw a few more words into the vocabulary mix such as “planning,” and “management,” and we had a robust perspective on what BCP meant, namely the intelligence work we put into motion along with the tools, techniques and processes we deploy to ensure that services continue operating even in the worst of scenarios. Then we went further to look at SharePoint as a tool that can be used in a BCP plan, and stated that the objective was to look at SharePoint not as a BCP tool, but rather as the object of the actual BCP. Namely, what do we do if SharePoint goes down and is unable to render services? We came up with salient points such as:
- * Clearly identify and designate contacts
- * Clearly identify and document procedures
- * Make onsite and offsite storage of required resources, storage, records, etc., accessible
- * Provide onsite and offsite recovery mechanisms
- * Provide ongoing staff training, drills and practice
As well as:
- Identify and define your SharePoint business requirements
- Identify and choose what to protect and recover
- Identify and choose the tools to be used
- Identify and define the strategies to be used
- Test-run your entire plan at least once a year
And, in my final notes, I promised we would look at disaster recovery in detail, as well as high availability, tools and techniques. We will also analyze models and you will be able to use these to define your own exhaustive BCP strategy. If we are ready, let’s kick off with something known as business requirements in terms of what is an acceptable level of interruption to your SharePoint continual service offering objectives.
IDENTIFY AND DEFINE YOUR SHAREPOINT BUSINESS REQUIREMENTS
As an organization, what do you use your SharePoint farm for? If what you have is a single box setup with an average user base of 20 people, you are likely using it for just about anything from custom development to storage and collaboration. Perhaps you are a consulting firm, or maybe you are a huge multinational with a large 24-server SharePoint farm and a user base running into tens of thousands. The question still holds true: what is important to you as it concerns your SharePoint environment? You have probably converged your disparate solution systems into SharePoint, including your intranet, line of business applications, your content management systems, reporting systems, your email system, office applications, etc. This could mean that SharePoint has become your single point of failure. If it goes down for up to four hours (God forbid!), you could lose anything from goodwill to decreased productivity, contractual damage, information loss, and halted decisions due to lack of access to business information, etc. All of these will create legal and financial implications depending on the organization’s size and industry. To mitigate these, our business requirement will be categorized into any of the following:
RECOVERY TIME OBJECTIVE (RTO)
This means how quickly we can get back in business in terms of the least and maximum amount of time we get back online, with new database transactions continuing in the background systems. Emphasis is on “recovery” and “time,” with TIME to recover being of greater importance, because it doesn’t make sense if we truly recover and it takes us 10 days to do so. It varies from organization to organization, but the best case scenario should be within a few seconds or minutes to within a few hours, and when I say a few hours I don’t mean a few hours lasting into the next day. To this end business requirement would include:
- High availability
- Degraded availability
- Downtime avoidance and/or reduction
- Opportunity cost and return on investment
RECOVERY POINT OBJECTIVE (RPO)
The emphasis is on “recovery point,” with “point” being the last committed data transaction just before the failure occurred, and the most recent data recovered after we put the system back online. This means we will have something called acceptable range of data loss, and that loss can be a result of anything from type of failure to workload on the system. Remember the users don’t really care about your workload story. To this end business requirement would include:
- Scheduled maintenance such as backup frequency
- Planned and unplanned outages
Now this is SharePoint we are talking about, not just any kind or all kinds of applications, so I have to throw in another category of objective closely related to RPO, and it is:
RECOVERY LEVEL OBJECTIVE (RLO)
This emphasizes the level of recovery. SharePoint is one of the most granular application service tools around. The RLO means to what level do you want to recover data? Do you want to recover just only the farm and leave the rest, or do you want just the farm and its configurations? Recovery level can go as far as taking back the web applications, site collections, sites, lists, libraries, or even specific items. The business requirement at this level will include:
- Granular backup/granular recovery to the least item
- Multiple backup strategies
Finally, it is good to throw in Service Level Agreements (SLAs) here. Actually, I should have categorized RTO, RPO, RLO under SLA, as this is usually the case with SharePoint deployments, namely organizations like to contract this aspect of the technology to third-party consulting firms for reasons best known to them. (NOTE: I deleted the previous section because it deviates from the subject at hand, but ultimately that’s your call) In the place of SLAs, let’s use a governance plan. A typical governance plan contains a record of your decisions about the very things SLAs cover, such as service delivery requirements for business and IT, among a host of other things. SLAs are too business-like, and come with legal implications, but they tend to be more detailed and we can draw procedures and documentations from them which is more beneficial for our BCP.
IDENTIFY AND CHOOSE WHAT TO PROTECT AND RECOVER
I am not going to dwell on this here because the greater part of it has been covered above, so I will recap just in case you didn’t notice them. In any SharePoint infrastructure BCP plan, the things to protect are:
The physical hardware. In the days of WSS 2.0 and 3.0 this didn’t cut much from the budget, but with Server 2008 and 2012, Visual Studio 2010, SharePoint 2010 and 2013, Exchange Server 2010, SQL Server 2008 and 2012 and all the x64 level hardware and configurations, it really cuts a huge chunk of any budget to get a good BCP for SharePoint. If you are big in terms of farm size, and you have to consider the option of a cold, warm, or hot standby in another hot site, pray you are in an oil and gas industry to not feel that cost. We have to protect the hardware, and we have to think in terms of two things:
- If we are talking continuity of operation and minimal disruption and down time, we need to invest in a fault-tolerance, redundant infrastructure, workload distribution to span likely points of failure and cost of preventive maintenance. I feel I am sounding gibberish. This is what I mean: You have four web front end servers (WFEs), two application servers, two search servers, and two clustered SQL servers. Diagram below:
Typically the WFEs share a load balanced URL.. This ensures the requests coming from the users are distributed among the WFEs based on load. This means that if WFE 1 thinks it can’t handle the request, it will forward it to the next WFE. Theoretically this is true, but practically, it rarely happens. This leaves WFE 2 to be only moderately busy, and WFE 3 and WFE 4 to be useless. You only need to use Task Manager to see it. The bulk of the work is treated by the first WFE. That server will not push the rest of the workload till it is saturated and we don’t have to wait for that. The solution is to implement network load balancing (NLB) schemes at:
- The hardware level by purchasing a hardware NLB device
- The software level by using the operating system’s inbuilt NLB feature and obtaining virtual IP addresses from your IT infrastructure team
- Configure DNS round-robin NLB
This is not my destination right now. Remember what brought us to this point: I was trying to identify and choose what to protect in this BCP plan, and I have said we need to protect the physical hardware. Under this I explained how to minimize disruption of service and downtime by ensuring service continuation through fault-tolerance, redundant infrastructure, workload distribution to span likely points of failure. So the diagram above was created, but it doesn’t end there. That diagram actually needs to become this:
In the above, we have solved our hardware component. This solution can be another set of physical or redundant failovers, or it could be virtual. I am tempted to mention at this point that going the way of virtual failover may not give you the same high level of availability in terms of throttling, but it will give you degraded availability in the RTO section mentioned at the beginning. Note degraded availability does not come from virtualization alone. It is a factor of other components such as network, disk and other infrastructure factors.
- If we are talking absolute outage and disaster we need to invest in secondary/standby infrastructure. This means we need an onsite or offsite standby, in cold, warm or hot state. I don’t think I need to put a diagram to this. You just have to picture that whatever you have now is available elsewhere and ready to take over production either by being manually kick-started or automatically kick-started by IP routing rules and monitoring from a network supervision centre.
- Farm configuration needs to be protected. This can only be done by a backup strategy. There are two ways to go about this farm configuration protection. You can use the internal SharePoint backup mechanism. It is available in Central Administration, and it works by starting a Microsoft SQL Server backup of the content databases and service applications. It takes the configuration content to files, goes to the search index files and synchronizes them with the search database backups. It keeps them all as files on the designated backup location of the farm for later pickup by some alternative means. Those means include manual pickup, or pickup by a third-party tool for onsite storage and later offsite archiving. Note also that the IT infrastructure team is backing up your drives as part of their own system state backups, so you can fall back on that as well. Also, if your database team has their own backup strategy in place, your individual databases, including your SharePoint Config_DB has probably been backed up as well. Regardless of all these backups, you should still verify this is true in your case, becaue you don’t want to make a wrong assumption.
- Web applications need to be protected. These web applications are linked to content databases where all your data resides. Additionally, it holds the application pool name, application pool account name, authentication settings, general web application settings such as alerts and managed paths, IIS binding information (protocol type, host header, port number), web.config file and changes made to it via the object model or Central Administration.
- SQL databases need to be protected. In each of those databases are the actual data, the solution definitions, the triggers, constraints, permissions, relationships, and supporting data, and the transaction logs which equally needs backing up for scenarios where data might need to be rebuilt. This can be done through the inbuilt SQL backup methods and shipped off for storage and archiving.
- The file system needs to be protected. In the file system are the original setup files, event logs, health analytic data, solution setups, web part definitions, CSS definitions, custom pages, and forms.
In the next series, we will dive into PowerShell scripts, SQL statements and Central Administration methods to getting some of our BCP plans into motion, especially when it comes to preparing for the actual disaster everyone seems to be longing and waiting for.