Here we are, and staring us in the face is a big phrase with a big meaning. So what really is business continuity planning? In this article, I am going to refer to business continuity planning as just “BCP.” This way I gain more space for more words. I am going to start off by saying that there seems to be another word out there that is closely related to business continuity planning (BCP), and that word is Disaster Recovery (DR). For most people, there seems to be no difference. But there is a difference.
Before I go shooting off by saying, “First off, let’s look around for a very good definition of BCP,” which I will still say anyway, I want to say that the more I wrote about this topic, the more information I was discarding, throwing away, and deleting as I typed. I realized that as much as I would like to make this a single topic treatise, I really wouldn’t be doing justice to it. So, after a bit of painful and harrowing thought and a little grudge, I am going to have to make this at least a two-part article, with this being the first part.
So, first off, let’s look around for a very good definition of BCP.
Business Continuity is the activity performed by an organization to ensure that critical business functions will be available to customers, suppliers, regulators, and other entities that must have access to those functions. This is what Microsoft says.
I am quite in agreement here, that is what continuity of business is all about: the ability to make service offerings and functions to be continually available. Let’s throw in another word; it is called “planning.” So now, how do we plan for business continuity? To plan means to put a strategy in place for doing something.
Business continuity planning therefore focuses on preparing for the recovery of critical business functions and processes in the absence or loss of the normal office environment, which is the business work area, and secondly the loss of technology, tools, and services typically used in daily business operations.
This is where BCP gets confused with disaster recovery, because we think that, if we cannot continue business operations, it spells disaster. Well, that is true, it does spell disaster. The set of activities we carry out during disaster recovery is a form of BCP, and more precisely it is a subset of BCP. Business continuity planning in itself is big, and it is not limited to getting out of crises but does involve getting out of crises at some point in its implementation. Planning is the actual strategy, intelligence work, and foresight to prepare for, and/or to avoid, depending on the peculiarity of your circumstance, or volatility of your locale.
There is still another side to this story, and that is to introduce into the vocabulary set the word “management,” and make it business continuity management. This consists of the business decisions, processes, and tools you put in place in advance to handle crises. A crisis might affect your business only, or it could be a part of local, regional, or national event. Now Microsoft said that, and I am in consonance with it.
What do all these add up to tell us about SharePoint? Two things are evident. The first one is that we can either use SharePoint as part of a business continuity plan and/or management, which is good, and the way to go, and which many organizations around the world do, and I have been right in the middle of that a couple of times. But that is not what I want to look at here. The second one is that I am looking at a scenario in which SharePoint itself as an asset, as a service, and as a function is made the object of a BCP. In other words, how do we continue to make SharePoint service offerings always available?
Let’s take a look at the mentality of an effective SharePoint BCP methodology. When a SharePoint service or asset goes offline and becomes unavailable, the very first questions we need to ask here are WHAT happened, WHEN did it happen, HOW fast can it be fixed, and WHO can fix it? All these questions are expected to be resolved in the following elements in no particular sequence or order:
- Clearly identified and designated contacts
- Clearly identified and documented procedures
- Accessible onsite and offsite storage of required resources, storage, records, etc.
- Onsite and offsite recovery mechanisms
- Ongoing staff training, drills and practice
In an average day, I anticipate some form of business continuity issue call or support. So our methodology listed above is not limited to major disasters, either man-made or natural. No, it could be in any form, such as:
- A database goes offline and the web service is unable to access data for display in any site collection or page, or an entire tool is inaccessible. This can be as a result of disk infrastructure in a SAN (storage area network), where the server drives are not locally hosted on physical servers but logically assigned to a server, having absolutely nothing to do with SharePoint. At other times, it could be a database file going corrupt or becoming unable to be read from.
- A user fiddling with a site collection, not a rogue user, but a designated one deleting a page, library, store, or even a simple notepad or xml file in the site hierarchy containing referential data required on other pages.
- A generally bad and degraded network bandwidth, resulting in pages taking forever to load.
- An impromptu server reboot, unscheduled, but which the team in infrastructure deem necessary and for some reason made the call without your consent, knowing of course that you wouldn’t immediately agree to it because you don’t know what user in “what-where” is modifying critical data, and you don’t want to have to find out you gave the green light only to discover your MD/CEO was that user, and one call from him to your top echelon hierarchy would create enough avalanche all the way down from your EGM, GM, DGM, manager, head of section, until it hits you down there and you become Snowden before he was self-emancipatingly popular. You create bad attention. The IT infrastructure team sometimes thinks that, because a server’s drive is offline, it spells doom and gloom, but the truth is that I really won’t care much if those drives are missing, as long the servers affected are web front ends (WFEs). I will not allow a reboot even if it will last only 60 seconds, except at lunch hour or close of work. And, oh yes, even then I am careful, because I know DMD works as late as 10 pm. So, you aren’t rebooting on my watch!
- At other times, something goes wrong with a supposedly tested Windows Server update and it has messed up your custom-built site and it’s broken a few workflows that are now piling up unexecuted, undelivered notifications, and throngs of users are calling to tell you nothing is working, clients are getting irritated, the queues are getting longer because your line of business (LOB) apps are unable to consume data required to produce other outputs. The worst part of this is that you will not know for a long time that it’s an update causing these problems. In other cases, the real culprit is your very own SharePoint cumulative update, because Microsoft warned you not to modify core SharePoint files; you were warned that, if you do, the next update will reset it back to its original content state, but somehow you thought you would remember that before you apply the next update and then you forgot.
The scenarios can be numerous. This is why I say that a good SharePoint BCP plan starts with everyday work; a mindset to stay true to the agreed-upon and approved service level agreements of what is an acceptable or unacceptable level of interruption. So, to ensure daily normal operation continuity, there is always a need to:
- Identify and define your SharePoint business requirements.
- Identify and choose what to protect and recover.
- Identify and choose the tools to be used.
- Identify and define the strategies to be used.
- Test-run your entire plan at least once a year.
Everything revolves around these. But in the sub-routine part of it, you would have to:
- Do regular daily backup of not just the farm, but the site collections. This will allow you to easily restore single site collections back to their users. This sounds quite simple, but it is ONLY if you have well-defined strategy and principles for site creation and solution creation. For example, if you are fond of combining different solutions from different departments into one single site collection, too bad it won’t be that simple if one of the departments needs a restore of an exceedingly mission-critical file that has been deleted not just from the site, but also from the user site’s recycle bin and from the site collection recycle bin, so that the only way forward is a restore. How would you implement your restore without disrupting the data of other departments sitting in the same collection? Your options would be if you had a third-party granular backup solution or you restore the backup of the previous day to a test environment and exhume the required singular file. As you can see, that is a very stressful option that is avoidable if you have a per-department site collection strategy. Other scenarios could be that you should keep sites that operate as storage repositories in a separate web application from those that operate as workflow and business-process-driven tools.
- Do a separate SQL server scheduled backup that runs every day, based on a backup strategy of choice, namely: full backup, incremental, differential, or daily. Not to forget to ensure that, in that backup set, you plan log files inclusively, and that means both SharePoint log files and SQL transaction log files, both of which have an extreme tendency to grow large, which always irritates any DBA who isn’t SharePoint savvy and, yes, it does irritate the storage and backup guys, too. Expect their calls!
- File system backup is unavoidable. As much as you may not be responsible for this, the infrastructure people in your organization are always taking these. Perhaps you might not even be aware they do—simply ask them, they do; if they aren’t doing it, act fast and tell them to do it. Duplication of data and storage space in the offsite backup location? Absolutely yes, but you see it’s all just tapes marked with size capacity, and they really don’t keep all your data forever, they are recycled based on some storage policy of your organization, hopefully. It has its place.
- This next point may sound as if it is off-limits and overkill, but it has its own place too: a SharePoint farm mirror. What is that, you ask? For example, here you have a production environment of let’s say 10 servers, aside from your development environment of say five servers; a mirror would be a server farm that is an exact replica of the production SharePoint farm. It will be in a warm state, and it will be in a different geographical location, safer, and ready to take over in the event of a major disaster like an earthquake, fire, or, worse, man-made anti-human activities. Not every organization can pull this off; that depends on your infrastructure and the number of people. It can be as simple as a virtualized farm, or as terrible as a physical farm. I do not envy anyone who will combine this job as part of his/her job description, aside from what they already do in SharePoint. But it is indeed a possibility, and very feasible. In fact it is better as part of an integrated BCP plan, especially a disaster recovery plan that tries to solve the loss of primary work space area or data center and continue in a secondary work space area or data center.
- Keep the SharePoint guy from running under a speeding truck. Keep him from DUI (driving under the influence of alcohol). Keep the SharePoint guy alive. Get him an extra heart, an extra brain, an extra set of limbs, and all the stuff. Do I mean this literally? Absolutely not! It would be crazy to do that, unless you work for the CIA, or NSA, or something like that. What I am saying here is that it will be really bad for everyone if, while making all these BCP plans, we don’t plan for what happens if the SharePoint person dies, because he goes away with all he knows. Yes, he leaves documentation behind, but someone needs to implement it. So, don’t be like the company with 10,000 users that loves high availability and is converging all of itself into SharePoint and has only one SharePoint person! Nothing could be dumber! He can’t go on leave, he can’t take some days off, he can’t travel, he can’t close early like other people, and the stress is so much he has a protruding mid-section. That isn’t good for our SharePoint BCP plan, because staring you right in the eyes is the classic single point of failure—your SharePoint personnel. You have to take good care of him. How? Simple. Get him more hands to work with, more brains, and more heart, and that means you let him have a power user in his team, you let him have a developer in his team if he isn’t one, you let him have an architect in his team if he isn’t one, you let him have a business intelligence expert in his team if he isn’t one. It is too risky to leave your SharePoint BCP plan to fate; many times it doesn’t go well. So hope for the best, but plan for the worst.
In the second part, we will talk about the disaster recovery in detail, high availability, tools, and techniques; we will analyze models and you will be able to use these to define your own exhaustive BCP strategy.