Hello everyone and welcome to this new series on network reliability. In this series, we will explore various design considerations for reliability in a Cisco network and how we can build a network that is highly available most of the time.

In this first article of the series, we will set the tone by discussing Network Reliability. We will explore the relationship between redundancy and reliability, and then we will explore the concept of single points of failure. Finally, we will examine the different kinds of redundancy and how we can enhance the reliability of a network system by enhancing the various individual elements in the system.

The concept of redundancy in itself is highly related to reliability. So let’s first explore network reliability. The reliability of a network is the degree to which the network is available to perform its defined uses. So for a network to be redundant, all the components of the network must be individually reliable. Simply put, the reliability of a network is dependent on the reliability of all the individual elements of the network. Let us consider the network below;

diagram1

The basic components of the network here include computers, servers, switches, routers, ISP and the cables connecting them. All these components, both active and passive (cables) jointly contribute to the reliability of the network.

From the perspective of the user, the reliability of the network is the combined reliability of his/her pc, the server, the switch, the router, the ISP and all the cables in between. Now let’s do some mathematics here.

To keep things simple, we would assume that the cables have 100 percent reliability (they can never fail). Let us assume that the reliability of the active components are as follows:

Server: 95%
Switch: 99%
ISP: 95%
Router 99%

The total reliability of the network would be the logical AND of the reliabilities of the system. So we would have:

Total network System reliability = 95% x 99% x 95% x 99% = 88%

This shows that the reliability of a system is usually LESS than the reliability of the FUNCTIONAL UNITS that support the system.

So how do we enhance the reliability of the system? Well, we could get more systems, more PCs, more servers, more cables, more routers, more switches? Even if we had the budget to actually buy multiple devices, we still need to figure out how to make them ready so that we can minimize the mean time to repair. And as you know, the more critical the network, the more important it is to reduce the time it takes to resolve from a fault.

Assuming we added one extra device to increase the reliability of the network in case of a fault, the new network would look like this:

Topology 2

The new reliability of the each of the unit would become;

2 Servers @ 95% each: 99.75%
2 Switches @ 99%: 99.99%
2 ISPs @ 95% each: 99.75%
2 Routers @ 99%: 99.99%

Tip: The total reliability of the system would be;

99.75% x 99.99% x 99.75% x 99.99% = 99.5%

So, by adding one redundant device for each functional unit, we have increased the reliability of the system from 88 percent to 99.5 percent. However, we have also effectively doubled the costs of the network and we still have to figure out how to reduce the MTTR when faults occur. The goal of this series is to explore different technologies that can make redundant networks efficient. So hang in there, it’s gonna be a fun ride.

Single Points of Failure

The example that we explored in the previous network is a fairly simple one. In the previous network, all the devices are equally important and a failure in any of the devices can cause disruption of the entire. In more complex networks, some devices can have a higher impact than other devices. Let’s take the network below for example:

Topology 3

In this network, the aggregation switch, server switch, firewalls, router and ISP connection are the most critical bits of the network. Although users might connect directly to only one of the access switches, the failure of one of those switches would only impact the users that are directly connected to it. So the impact is relatively smaller than the other critical parts. In cases like this, the critical parts of the network are referred to as SINGLE POINTS OF FAILURE. Basically, the failure of a SPOF can affect a huge part of the entire network and directly impact services that are being run on the network.

One way to think of redundancy when we are designing networks with SPOFs is to think of the devices that would have the greatest “blast radius” in the event that they actually fail. To enhance this network, we can make provisions for multiple agg switches, firewalls, routers and internet connectivity links and we would have the network below;

Topology 4

For emphasis sake, I’d like to mention again that the mere fact that the more equipment installed in the network does not mean that the reliability would be automatically enhanced. In fact, just adding more equipment on the network can have the exact opposite effect (Trust me, L2 and L3 loops can be a nightmare). A network engineer has to make specific configuration changes to the equipments so that more than one equipment can operate together to provide a functional unit. In the next section, we will explore common kinds of redundancy.

Kinds of Redundancy

  1. Active/Standby redundancy: In this kind of redundancy, one or more devices are dedicated as backups. This means that the backup resources are completely idle while the primary or active resource. The backup resource only becomes used when it detects that the primary device has become unavailable. Common examples of these kinds of redundancies exist in various aspects of the network, as we would see in subsequent posts. There are two kinds of standby redundancy:
    1. Cold Standby Redundancy: In cold standby redundancy, the backup resource is only commissioned when the active one goes down. For instance, if you have a backup router that is cabled configured and powered off in a datacenter. When the primary router goes off, then the backup is turned on. The issue with cold standby is that it usually involves some form of human intervention and this increases the MTTR in the case of a fault.
    2. Hot Standby Redundancy: In hot standby redundancy, both the backup and primary devices are running simultaneously but the backup is not performing any actual function other than monitoring the primary device. Once a fault is detected, the backup takes over automatically. This usually requires little or no human interaction and as such, the MTTR is lower than cold standby.
  2. Active/Active Redundancy: In active/active redundancy, the devices functioning in a redundancy group are both active at the same time. In the event of a failure, once the other member(s) of the redundancy group detects the failure, they take over the function of that device.

    Going by these definitions, everyone should prefer active/active redundancy to active/standby redundancy. However, there are technology and implementation constraints that make it difficult to always have active/active redundancy scenarios everywhere on the network and we would explore these technologies as we proceed in the series.

    In some cases, multiple active/standby redundancy groups can be used to create pseudo active/active redundancy. We would explore some of these scenarios in subsequent articles.

    It is really easy to mix up load balancing and active/active redundancy. While the concepts are similar, they are not the same.

    Load balancing just focuses on (evenly) distributing a particular network function across multiple resources. When there is a mechanism to detect failure of resource and enough capacity on the other resources to take over the functions of the failed resource, then you have active/active redundancy.

    In some sense, active/active redundancy involves some form of load balancing/sharing, but not all forms of load balancing are necessarily redundant.

To wrap up this post, I have created a table of common redundancy technologies that we will be discussing in the next articles.

Network Component Active/Standby Redundancy Active/Active Redundancy
L2-Link and Switch Redundancy Mono/Common Spanning Tree Protocol Per VLAN STP, Multiple STP, Etherchannels, Switch Stacking
Gateway Redundancy Hot Standby Router Protocol, Virtual Router Redundancy Protocol Gateway Load balancing Protocol
Route Redundancy Floating Static Routes. Tracking, Routing Protocols. Equal cost load balancing, Unequal cost load balancing.
Firewall Redundancy Cisco Active/Standby Firewalls Cisco Active/Active Redundancy
ISP Redundancy Floating Static Routes BGP
Server Redundancy Policy based NAT Server Load balancing

I know there are a lot of technologies listed above but thankfully, some of them have been covered in previous posts and I would only have to refer to them. But there is still a lot to learn in this series and I am quite excited to get started on some of these technologies. In the next post we will get right to it and discuss L2 Link redundancy.

Finally, the list above is not exhaustive so feel free to drop your ideas about redundancy and reliability technologies in the comment box and I will find a way to fit it in the series. Thank you very much for reading and I look forward to the next article in this series. See you soon!