Tuesday, December 14, 2010

Switch Flooding 101 - Troubleshooting Case Study

Remember the first time you learned the basics of bridging? Dig deep in your memory and think back to the basics. With helpful verification from my co-workers and Aaron Conaway (on Twitter as @aconaway), I verified that some "crazy" behavior I saw today on our network was, in fact, "normal," albeit undesired.

I've been troubleshooting some very strange behaviors on our network lately. I suspect some (all?) of them have to do with our fairly old Cisco Catalyst 6500s with Sup2's and Sup1a's in our data center, as well as the dinosaur Catalyst 2948 access switches in our closets. There are times when our monitoring system throws alerts saying it can't ping certain devices. But minutes later, things return to normal. (Don't you just love intermittent problems?)  One tool that any good network engineer will consider when dealing with such a problem is a packet capture product such as the ever-popular Wireshark.

When I fired up Wireshark on my desktop computer, I had to filter through the muck to see what was going on. By "muck" I'm referring to the traffic I don't care about, such as the traffic my box is generating, as well as broadcast and multicast. I slowly added more and more exceptions to my capture filter (see below) to narrow the scope of my capture.

My Wireshark Capture Filter: not host [my IP address] and not host [directed broadcast for my subnet] and not broadcast and not host and not host and not host and not host and not host and not ether proto 0x0806 [for CDP] and not ether host 01:00:0c:cc:cc:cc [for HSRP] and not host and not host and not host and not host and not host and not stp and not host and not host

Once I filtered out enough to see more clearly, I noticed a TON of syslog (UDP 514) traffic destined for another host on my subnet. After scratching my head and consulting with co-workers, I started looking at the mac-address tables (or CAM tables). My upstream switch didn't have a CAM table entry for the mac address of the syslog server. Neither did it's upstream switch. In fact, the Cat 6500 directly connect to the syslog server didn't have a CAM table entry for it.

Checking the timeouts for the CAM table on one of the CatOS switches gave us this:
CatOS-Switch> (enable) sh cam agingtime

VLAN    1 aging time = 300 sec
VLAN    2 aging time = 300 sec
VLAN    9 aging time = 300 sec
VLAN   17 aging time = 300 sec
VLAN   18 aging time = 300 sec
VLAN   20 aging time = 300 sec
VLAN   21 aging time = 300 sec
VLAN   25 aging time = 300 sec

Similarly, the Cat6500 running Native IOS showed this:
NativeIOS-Switch#sh mac-address-table aging-time 
Vlan    Aging Time
----    ----------
Global  300
no vlan age other than global age configured

Apparently, this syslog server is so quiet, so stealthy, that it doesn't transmit ANY traffic for more than 5 minutes (300 sec) at a time. After 5 minutes, the CAM table entries timeout, and all traffic destined for that server gets flooded to every port in the VLAN throughout our trunked network.

One way to prevent the flooding would be to put static CAM table entries in all the affected switches. Perhaps an easier solution is to configure the syslog server to generate some traffic at least every 5 minutes or less.

I'm not sure if the flooding is causing the other strange behaviors we're seeing on our network, but this has been a good learning experience and reminder for me about the basics of Layer-2 networking.

Any other troubleshooting ideas you would use for a situation like this? Comment here and/or hit me up on Twitter (@swackhap).

1 comment:

  1. This is a typical scenario from the old CCIE written exam ;) Have to admit I always struggled to understand it completely, seeing it in a live network probably makes more sense :D

    This document might help: