Hello everybody and welcome to the second article in the Network Engineer Series. In the first article, we explored building a single site network (switching, routing, DHCP, NAT, access-lists and other interesting technologies). You should check out the post here. In this article, we would further explore other responsibilities of a network engineer – resolving network issues.
Fault detection and fault resolution are major responsibilities of Network Engineers. In fact, if networks could automatically fix their issues, many network engineers would have no jobs. The reason why most organizations have full time Network Engineers on their staff is that they want to ensure that the network is available to support the organization’s process effectively. So as a Network Engineer, you should be able to easily identify faults with networks and resolve them in the smallest time possible, with minimal disruption to other network operations.
In the rest of this post, we would explore some of the network troubleshooting techniques that are available to a Network Engineer. We would also explore recommended approaches to network troubleshooting that can aid network management.
Remember our network from the previous article? Users and Admin are connected in different VLANs, and the servers are connected in another VLAN. The switch has a trunk connection to the router, which performs routing between the subnets (using sub-interfaces) and provides connectivity to the internet.
Physically, all the devices (PCs, servers and router) are connected to the same switch but from the logical perspective, they are connected on different networks.
The physical and logical connections are shown in Figures 1 and 2 below;
Assuming you receive a complaint from a user that is not able to access an application on the server. For example, user 192.168.2.9 cannot access the WEB server 192.168.2.33. How do you go about solving the problem?
I’d ask that you take a few minutes to think about the problem before you continue reading this article.
Okay, so you’ve had some time to think about it, and you might even have a guess what the issue(s) might be. But the information that the user has provided is not nearly enough to diagnose the problem. Unfortunately, 9 out of 10 times, this is all the information that the user can provide; it is left for the network engineer to figure out what the users are really trying to say.
Now, there are many ways to approach this issue using your own experience and your own style, but it is important to ensure that you have a systematic approach in diagnosing network issues quickly and efficiently.
An easy model for approaching networking issues is to determine the layer of the OSI model that has the problem. Remember the OSI model? The 7 layers of the Open System Interconnect model are the Physical, Datalink, Network, Transport, Session, Presentation and Application layers.
By now, you should be familiar with the OSI model. If you are not, you can review it here. When a message is sent, it goes from the 7th layer to the 1st layer while being encapsulated along the way, it is transferred across the physical medium (wireless, cable etc) and is then de-encapsulated, up the seven layers.
For proper network communication, all the layers must be fully operational. Usually, Network Engineers deal with the first 4 layers of the OSI model (Physical, Datalink, Network and Transport).
Looking back at our example, when user 192.168.2.9 cannot access the WEB server 192.168.2.33, the issue might be with any of the Layers;
Physical Layer: The physical layer deals with the physical media for transmitting data. In this scenario, a layer 1 issue might be that the physical cable at any of the ends is disconnected or there might be a duplex/speed mismatch at any of the ends.
Datalink Layer: Some Layer 2 issues with our scenario might include a problem with the Address Resolution Protocol (ARP) or ports being assigned to the wrong VLANS. Also, a virus issue might cause the Content Addressable memory of the switch to be full. Other Layer 2 security based issues might include a MAC address or ARP spoofing problem.
Network Layer: The third layer deals with IP addressing and routing. In our scenario, the IP address might be wrongly configured on either of the devices. The router might not be routing traffic between the two subnets. Another layer 3 problem might be that the default gateway on any of the devices might not have been set.
Transport layer: When trying to isolate issues on the transport layer, think about the 2 main transport layer protocols. TCP and UDP. In our scenario, an access-list on the router might be blocking TCP port 80 traffic destined for the web server. Other TCP based issues might include TCP resets, small TCP window etc.
Session Layer: A session layer issue might involve accessing a secure http site using HTTP.
Presentation Layer: In our case, a layer 6 issue might be trying to access a website not supported on Internet Explorer or trying to access a .php file instead of .html.
Application Layer: The application layer deals with the application itself. An example of an application layer problem might be that the Web Server (for example IIS or Apache) is not running on 192.168.2.33.
By trying to break down the problem into the OSI model, we have been able to identify about 16 possible causes of the problem! And this list is not exhaustive. The goal is to isolate the source of the problem as soon as possible so we can start fixing it. So how do we figure out which of these issues is our real cause?
The first recommendation here is to consider the elimination approach to problem solving. Again, a solid understanding of network concepts is required to isolate faults. Looking at our scenario, some considerations for elimination might include:
* If the user can access other servers on the network, then the issue is NOT likely to be with its physical cable, or its default gateway.
* If other users (in the same VLAN) can access the web server, then the issue is also unlikely to be with the web server’s physical connection or even the routing.
* If the user can ping the web server, then it can establish ICMP connectivity with it which validates that there is no issue with Layers 1 to 3.
By using the elimination method, a network engineer can isolate the network problem to a few issues and then try to resolve them. However, for proper network diagnostics, you should be familiar with some common network troubleshooting tools. Some of the basic diagnostic tools available to you as a network engineer include:
The Operating system command line: Whether you are troubleshooting on a PC, a MAC or a Linux box, you should be familiar with basic commands on the command line that can help you gather more information about the device that you are working on. On a PC, commands like ipconfig, ping, traceroute, netstat, arp, nslookup, etc are very useful in diagnosing network issues from the command line. For a comprehensive list of network diagnostic features on windows, you can refer to the reference on Microsoft website which can be found here. A more compact list can also be found here.
Also, you should be familiar with remote access options for managing the network devices. You can access the VTY line of a network device remotely via Telnet or SSH depending on what you have configured before. Windows comes with a telnet client by default (You might need to enable it from Programs and features in Windows Vista and above). For SSH, you would need an external terminal program like Putty.
The show commands on the Cisco CLI are super important. You should practice a lot of the show commands and their interpretations. Debugs are also a great tool in troubleshooting network issues. However, you should ensure that you manage the debugs so that the outputs do not overwhelm the command line. For more information on debugs, please check out this Intense School article.
Network analysis software: Although most network issues can be diagnosed and resolved using the command line utilities, there are some cases where the traffic might need to be analyzed using a network analysis software. Useful software that can be used for this is Wireshark. I would not explore how to use Wireshark in this post because there is another post that is dedicated to using Wireshark on this site.
One last thing to mention about resolving network issues is that sometimes network issues might be a result of an underlying technology that is not functioning properly. For example, IP addresses might not be assigned because DHCP is not functioning properly. Or a client might not be able to reach an internet address because the DNS server is not functioning properly. Again, when troubleshooting, it is important to be conscious of how different technologies are related in a fully functional network.
This article tried to provide a guide in resolving networking issues in a systematic manner. It is important to note that the suggestions that have made in this post are recommendations based on best practices and my personal experience. As you go along, you would discover a method that works for you. However, you do not want to be that network engineer that jumps into a network issue without having a plan of attack. Know your options, use your diagnostic tools and most importantly, have fun fixing network issues that you would encounter on your journey as a network engineer.