I've been troubleshooting some very strange behaviors on our network lately. I suspect some (all?) of them have to do with our fairly old Cisco Catalyst 6500s with Sup2's and Sup1a's in our data center, as well as the dinosaur Catalyst 2948 access switches in our closets. There are times when our monitoring system throws alerts saying it can't ping certain devices. But minutes later, things return to normal. (Don't you just love intermittent problems?) One tool that any good network engineer will consider when dealing with such a problem is a packet capture product such as the ever-popular Wireshark.
When I fired up Wireshark on my desktop computer, I had to filter through the muck to see what was going on. By "muck" I'm referring to the traffic I don't care about, such as the traffic my box is generating, as well as broadcast and multicast. I slowly added more and more exceptions to my capture filter (see below) to narrow the scope of my capture.
My Wireshark Capture Filter: not host [my IP address] and not host [directed broadcast for my subnet] and not broadcast and not host 239.255.255.250 and not host 224.0.0.2 and not host 224.0.0.251 and not host 230.0.0.4 and not host 224.1.0.38 and not ether proto 0x0806 [for CDP] and not ether host 01:00:0c:cc:cc:cc [for HSRP] and not host 224.0.0.252 and not host 228.7.6.9 and not host 224.0.1.60 and not host 224.0.0.1 and not host 224.0.0.252 and not stp and not host 224.0.0.13 and not host 224.0.0.22
Once I filtered out enough to see more clearly, I noticed a TON of syslog (UDP 514) traffic destined for another host on my subnet. After scratching my head and consulting with co-workers, I started looking at the mac-address tables (or CAM tables). My upstream switch didn't have a CAM table entry for the mac address of the syslog server. Neither did it's upstream switch. In fact, the Cat 6500 directly connect to the syslog server didn't have a CAM table entry for it.
Checking the timeouts for the CAM table on one of the CatOS switches gave us this:
CatOS-Switch> (enable) sh cam agingtime
VLAN 1 aging time = 300 sec
VLAN 2 aging time = 300 sec
VLAN 9 aging time = 300 sec
VLAN 17 aging time = 300 sec
VLAN 18 aging time = 300 sec
VLAN 20 aging time = 300 sec
VLAN 21 aging time = 300 sec
VLAN 25 aging time = 300 sec
[snip]
Similarly, the Cat6500 running Native IOS showed this:
NativeIOS-Switch#sh mac-address-table aging-time
Vlan Aging Time
---- ----------
Global 300
no vlan age other than global age configured
Apparently, this syslog server is so quiet, so stealthy, that it doesn't transmit ANY traffic for more than 5 minutes (300 sec) at a time. After 5 minutes, the CAM table entries timeout, and all traffic destined for that server gets flooded to every port in the VLAN throughout our trunked network.
One way to prevent the flooding would be to put static CAM table entries in all the affected switches. Perhaps an easier solution is to configure the syslog server to generate some traffic at least every 5 minutes or less.
I'm not sure if the flooding is causing the other strange behaviors we're seeing on our network, but this has been a good learning experience and reminder for me about the basics of Layer-2 networking.
Any other troubleshooting ideas you would use for a situation like this? Comment here and/or hit me up on Twitter (@swackhap).
This is a typical scenario from the old CCIE written exam ;) Have to admit I always struggled to understand it completely, seeing it in a live network probably makes more sense :D
ReplyDeleteThis document might help:
http://www.cisco.com/en/US/products/hw/switches/ps708/products_tech_note09186a00807347ab.shtml