Troubleshooting switching in a LAN environment is one of the most important skills a network engineer needs to have. To understand why, you will need to first learn some key differences in how a Layer 2 frame is forwarded versus how a Layer 3 packet is forwarded.
CCNA Training – Resources (Intense)
IP packets come with built-in protection against never-ending loops. You have most likely seen this feature already; you just may not have realized it. If you’ve ever pulled up the Windows command prompt and tried to ping something, then received a message about “TTL expiredin transit” – you are seeing that protection being used.
Whenever a device forwards an IP packet, it decrements a special field in the IP header, referred to as TTL or time to live, by 1. So when a router receives a frame with a TTL of 60, it changes it to 59, and then sends it on. When this field, which can start as high as 256, reaches 0, the device is not allowed to forward the frame and discards it instead. The discarding device then sends a special message to the sender of the frame, telling them that their packet expired in transit.
The most common cause of a TTL expiring in transit is when two directly connected devices have routes to the destination address of the IP packet that point at each other. In other words, Router 1 has a route saying you reach the destination of the packet by sending to Router 2. Router 2 has a route saying you reach the destination of the packet by sending to router 1. If you want to see this happening, you can use the Windows tracert function.
The two devices will send the IP packet back and forth to one another, decrementing the TTL by one each time until it reaches zero. It usually takes no more than a few milliseconds to loop a packet between two devices over 100 times.
Unfortunately, traditional LANs have no such protection mechanism. If two devices are sending a frame back and forth between one another, they’ll do so as fast as they’re able, until they’re stopped! A single looping frame on a LAN can take up the entire bandwidth of a link and close to 100 percent of the CPU resources of a switch. When all the bandwidth and CPU time is consumed by the looped traffic, the switch can’t forward legitimate traffic.
Even though LANs do not have a mechanism as robust for stopping runaway traffic as an IP-based WAN, they do employ many features designed to prevent looping traffic from occurring in the first place. When you’re troubleshooting, you have to be able to determine which of these preventative mechanisms failed, why it failed, and how to fix it.
We’ll be going through some of the most common things you can look at when troubleshooting LANs.
1. CPU Utilization Of The Spanning-Tree Process
The first thing you can check is the CPU utilization of the spanning tree process on the switch. Typically, when you look at the processes, you should see something similar to the numbers below – a relatively small number for spanning-tree.
Don’t worry if the CPU usage for all the individual processes doesn’t add up to the total CPU utilization. The total CPU value includes time taken to process traffic. Cisco doesn’t consider that an individual process, so it’s not included in the output.
The small number for spanning tree means that CPU cycles aren’t being spent continually calculating a new optimal loop-free path based on some changing information received in a BPDU. In case you are not aware, BPDU stands for “bridge protocol data unit”. A BPDU is sent out to all ports running-spanning tree on a switch, and it includes information that the spanning-tree algorithm uses to determine the most optimal forwarding configuration which also decides how traffic flows between any two given nodes on the network.
2. Spanning-Tree Topology Changes
Another troubleshooting tip is to check when the last topology change for a given spanning-tree instance occurred.
The ‘show spanning-tree detail’ command gives you a lot of helpful information. By default, you will see spanning-tree information for each VLAN defined on the switch that is running spanning-tree, since the default mode is PVST (Per VLAN Spanning Tree) – which means one spanning-tree instance per VLAN. If you are using MST (Multiple Spanning Tree), you will see a single spanning-tree instance for multiple VLANs.
A topology change is fairly self-explanatory – it just means that something in the spanning-tree topology changed. Usually, this change is either a link bouncing, a switch failing, or a port transitioning into forwarding state. Regardless of the exact cause, excessive changes indicate instability in the Layer 2 domain. When a switch experiences one of these changes, it sends a TCN (topology change notification) out to all ports on it running spanning-tree. This TCN tells other switches they need to flush their MAC tables and rerun the spanning-tree algorithm to ensure they are forwarding traffic in the most optimal way.
In the output below, you can see that the last topology change occurred 0 seconds ago from Fa0/1.
If you are seeing that large numbers of changes have occurred, or that the “last change occurred” timer resets every couple of seconds, you most likely should look at what’s attached to the port the topology change notification was received on. Keep in mind that you may need to “trace” the problem down.
For example, if another switch is connected to the port receiving the TCNs, you will want to move to that switch and repeat the process. This way, you can find out exactly where the TCNs are originating from. Often, once you’ve traced the problem as far downstream as you can, you can shut that port to resolve the problem. As always, use extreme caution when shutting down ports! The only thing worse than troubleshooting a LAN problem is trying to troubleshoot a LAN problem on switches you have just lost access to.
Another thing to evaluate from this output is the number of BPDUs sent and received. When a loop-free path is calculated, one port will always be “Designated” on the segment. This port will send BPDUs and the other side will receive them. If spanning tree timers are left to their default values, a BPDU will be sent every 2 seconds. When you see incrementing BPDUs in a direction that is not normal, you will want to look at what might have caused that. A healthy LAN is one that doesn’t have a lot of changes in how data flows. The amount of data flowing can and usually will fluctuate quite a bit, but the path that data flows over should not change often.
You should NEVER be seeing BPDUs received on a host port. If you do, this means an end user has plugged their own spanning tree capable device into an end-user port. This can be very dangerous, since you now have a unmanaged, untrusted device participating in the spanning-tree domain and/or potentially the VTP (VLAN Trunk Protocol) domain – which can cause all sorts of problems.
That spanning-tree capable device has the potential to becoming the spanning-tree root (and forcing a large amount of traffic to flow through it) or potentially erasing all your VLANs with a rogue VTP update. The best practice is to use a protective feature such spanning-tree BPDU Guard or spanning-tree BPDU Filter prevent this from occurring. When you enable BPDU Guard on a port, that port will place itself in an err-disabled state if it receives a BPDU. When you enable BPDU Filter on a port, the BPDU will be ignored, but the port will not be err-disabled.
2. Types Of Traffic Being Received
Another place you can look at to help you identify the cause of a problem in the LAN are the types of frames being received on a port.
The output from Cisco’s “show interface” command is helpful in this instance. Among other things, this shows the number of several different kinds of traffic that have been received on the port.
Keep in mind that often, the counters for the types of traffic may not have been cleared in a while. When troubleshooting a LAN issue, you’re going to want to pay attention to how quickly that counter is incrementing, and not just how many packets might have been received since it was last cleared. One of the first things many administrators do when looking at LAN problems is to clear the counters, so they can get a fresh view of what is actually occurring at the time.
The value you’re going to want to pay the most attention to is broadcasts. Broadcasts frames are typically the ones that bring to light any existing loops. When a switch receives a broadcast frame, the FF:FF:FF:FF:FF:FF MAC address used by broadcast frames is essentially an instruction to the switch to make copies of the frame and forward those copies out to every port in the same VLAN the broadcast was received on.
Making the copies and forwarding takes up CPU cycles, and also bandwidth on multiple ports. What might be a small stream of broadcast traffic inbound can become a very large stream of output traffic going out to multiple ports. And as you’ll recall from earlier in the article, there is no built-in mechanism that tells the switch to not re-forward the same broadcast over and over again – even if it saw it just a few microseconds earlier. A copy of every broadcast frame received is always going to be forwarded to every port in that VLAN. While this behavior is fine as long as there are no loops in the infrastructure, once a loop exists, that broadcast frame gets forwarded over and over again.
In addition to counters for certain types of traffic, the “show interface” output will show you how much traffic is being sent and received on the port at the current time. You can look for “input rate” and “output rate” to see this. In this case, “5 minute” means that that is the average amount of traffic over the period of the last 5 minutes. If you want a less averaged value, you can change the load interval on the interface to as low as 30 seconds. Whatever load interval you use, only you can determine what amount of traffic is normal and not normal for your infrastructure, but values that indicate the link is 99% utilized almost always indicates a serious problem of some kind or another.
Just like topology change notification in spanning tree, if you see a port with very high utilization, you should track it back and try to determine the path the loop is following. With broadcast storms, you usually can’t find a source, since the traffic is being looped. You will just end up tracing around in circles. However, finding what devices and interfaces are part of the loop is helpful in determining how to address it (shutting down an interface, removing a VLAN from a trunk, etc.).
4. Device Logs
The final place you can check are the logs of the device. The “show log” command is used to display this information. A hallmark of any loop is when you see messages which indicate that the port a single MAC address is being learned on is frequently changing. It likely will look something like this, and will probably appear multiple times:
Host 00:55:56:43:9A:84 in vlan 1 is flapping between port Gi1/2 and port Gi1/3
Under normal conditions, this would never occur. Remember, spanning tree was designed to create a single, stable, loop free path between any two given network points. If there are two simultaneously active paths between two hosts in a classic spanning tree design, there is most likely some kind of loop.
This single syslog message tells you a lot about the potential loop you’re dealing with. First it tells you what VLAN the loop is on. If you know where that VLAN is spanned to, this can help you understand what might be causing the problem. It also tells you what ports are involved in the loop. Depending how desperate you are, shutting down one of these ports can often break the loop, giving you time to figure out what happened to cause it.
You usually don’t need to pay too much attention to the specific MAC address in the syslog. Any device in the VLAN experiencing the loop can have their traffic get “caught” in the loop, but it doesn’t mean that specific device is what’s causing the problem in the first place.
While by no means a comprehensive discussion of every problem you might find on a LAN, the suggestions above hopefully provide a few more things you can check to help you identify and resolve LAN problems.