Hello everyone and welcome to the fourth article in this series on Network Reliability. In the last article, we concluded our discussion on Layer 2 redundancy technologies by examining switch stacking and link aggregation technologies. You can view the post here.

In this article, we would start exploring how we can make our networks reliable beyond a single broadcast domain. Just like we did in Layer 2 redundancy, let’s start by revisiting some basics about Layer 3 routing.

Devices identify other devices in their subnet (broadcast domain) using their subnet mask. When a device wants to reach another IP address in it the same subnet, it sends an ARP request (which is a broadcast), asking for the MAC address of that IP address. If everything has been set up correctly, the destination device responds with its MAC address, the ARP table is populated on both devices and communication can begin.

So what happens when devices need to communicate with other devices outside their subnet? They use a gateway. Basically, it’s the same process that is outlined above except that instead of ARP-ing for the MAC address of the destination, they ARP for the MAC address of the gateway (if they don’t already have it) and they send the packet to the MAC address of the gateway, who in turn, forwards this on (by repeating the process) to a next hop. This forwarding continues until the packet reaches a router that is in the same broadcast domain with the destination IP address and that router forwards it to that destination. This over-simplified summary of packet forwarding based on hop by hop independent decisions is the core of the concept of IP routing. You can read more about IP routing here.

So how does this affect the reliability of the network? Let’s take the perspective of the devices in a simple subnet, say 192.168.1.0/24. If the devices want to communicate with other devices within a subnet, a default gateway is not needed. However, for any other external communication outside that subnet would need a gateway. This means all internet communication, partner communication; etc would require a default gateway. Clearly, the gateway is a single point of failure here.

So how do we improve the reliability of the default gateway in the network? If you read the first article in this series, then you would know where this article is headed. We can add multiple redundant gateways to improve the reliability of the network. But there is a catch; devices only support one default gateway at a time. So what’s the point of having multiple gateways when you can only use one gateway? Well, that is where First Hop redundancy protocols come in!

Note: I know it is possible to tweak the operating system (windows or Linux) of a device to support multiple gateways by using static routes but that is not a scalable solution for any moderate sized network. For a more scalable solution, we would be considering First Hop Redundancy Protocols for this task.

First Hop Redundancy Protocols

First hop redundancy protocols are designed to provide redundancy to clients by representing multiple default gateways in a group with a single IP address. Now there are different first hop redundancy protocols with slightly different implementations, but the overall concept is the same. As usual, we would illustrate with a simple network diagram;

Layer3part1-1

The diagram above shows a simple network with two default gateways in the same FHRP group on a network. The goal of FHRPs is to make the network above appear to the client as if it were like this;

Layer3part1-2

So the client is totally oblivious to the change in the network because it uses the virtual IP address (In this case 192.168.2.1) as its default gateway.

Since there have been detailed posts about FHRPs on this site, I would not go into the details about the configuration and implementation of the common FHRPs (Hot Standby Redundancy Protocol, Virtual Router Redundancy Protocol and Gateway load balancing protocol) in this article. If you want to learn more about the actual configuration of these FHRPs, you can review this post and this post.

But let’s focus on the redundancy effect of FHRPs. For active standby redundancy, HSRP and VRRP use a single virtual MAC address. So when a client ARP’s for the MAC address of its default gateway, the same virtual MAC address is used but any of the routers in the group (the one which is primary at the time) can respond. And since only one router can be the primary at a time, every other router in the group would act as standby until the primary router goes offline.

Both HSRP and VRRP are Active/Standby redundancy protocols, the major difference between the two is that HSRP is Cisco Proprietary while VRRP is the IETF standard. Some other differences are highlighted in the table below;

Hot Standby Router Protocol (HSRP) Virtual Redundancy Router Protocol (VRRP)
Cisco Proprietary Industry Standard
Uses Multicast Address 224.0.0.2 or 224.0.0.102 Uses Multicast Address 224.0.0.18
Uses Virtual Mac address 0000.0c07.acXX Uses Virtual MAC Address 0000.5E00.01XX
Described in RFC 2281 Described in RFC 5798
Preemption disabled by default Preemption enabled by default
Hello Timer is 3 seconds Hello timer is 1 second

Active/Active First Hop Redundancy

Just like with spanning tree, we can achieve some form of active-active redundancy in HSRP and VRRP by running multiple instances (or in this case, multiple redundancy groups) on the same sets of routers. In that case, one router assumes the primary role for the first redundancy group while the second router acts as the primary router for the second group. This scenario achieves load balancing but we are now forced to have the two gateways IP addresses which we have to split between the users in the subnet and that is not so neat to implement. (Think about DHCP and other related services that need to be reconfigured to support that). So how do we achieve ‘seamless’ active/standby redundancy for our First hop gateways? That is where Gateway load balancing protocol comes in.

With GLBP, the routers in the redundancy group still use the same virtual IP address, but this time, multiple virtual MAC addresses are used. The way this works is to issue out the different virtual MAC addresses during the ARP process when the client is requesting for the MAC address that corresponds to the virtual IP addresses. Now there are detailed nuances about the algorithms that are used for load balancing and the roles of the routers in a GLBP scenario. But the key thing to understand as far as active-active redundancy is concerned is that by using one virtual IP and multiple MAC addresses, GLBP can share traffic among members of the redundancy group without the knowledge of the devices. And this makes things a lot easier from an implementation perspective. To learn more about the intricate workings of GLBP, you can review this post here.

Beyond the First Hop – Route Redundancy

So far, this article has been focused on how we can achieve redundancy in gateway devices (also called first hops). Although these are quite important, we also need to consider what happens beyond the first hop if we are to improve the overall reliability of a routed network. So how do we ensure reliability in multiple hops? To do this we need to understand one key concept of routing which is the order of preference of routes.

In Cisco devices, a route contains a prefix (which is a subnet address and a subnet mask), a next hop address, an administrative distance and a metric. Of all these attributes, it is the administrative distance of the route and the metric that determine whether it is preferred or not. The general rule of thumb here is that lower is better. Also, routes with lower administrative distances are always preferred regardless of the metric. It is only when the administrative distance is the same that the router breaks the tie with the metric.

Since reliability is improved by adding redundant features, route reliability is also achieved by having multiple routes and ensuring a mechanism to detect failure of a route and removing it from the routing table. As usual, the first step is to ensure that the router learns about multiple routes. And this is done via routing (whether static or dynamic). This series is not designed to explain the concepts of routing so we would be skipping a lot of low-level details and focusing on how the routing protocols (whether static or dynamic) can be used to improve reliability or resilience in a network. In preparation for the next article in the series, I would encourage you to revisit the posts on IP routing to understand how routing works in detail. I have created a list with quick links at the end of the page that would be useful for this.

In this post, we started out by revisiting routing basics and how a next hop is selected, we then explored redundancy features of First hop redundancy protocols. Finally, we introduced route redundancy by examining the attributes of a route that influences the order of selection of best routes in routers. In the next post, we would pick up right from where we left off and start exploring active standby and active-active redundancy, both with static routing protocols and dynamic routing protocols.

Thank you for reading and I look forward to sharing the next article with you soon. Until then, keep building reliable networks!

Further Reading: