Tuesday, December 14, 2010

Switch Flooding 101 - Troubleshooting Case Study

Remember the first time you learned the basics of bridging? Dig deep in your memory and think back to the basics. With helpful verification from my co-workers and Aaron Conaway (on Twitter as @aconaway), I verified that some "crazy" behavior I saw today on our network was, in fact, "normal," albeit undesired.

I've been troubleshooting some very strange behaviors on our network lately. I suspect some (all?) of them have to do with our fairly old Cisco Catalyst 6500s with Sup2's and Sup1a's in our data center, as well as the dinosaur Catalyst 2948 access switches in our closets. There are times when our monitoring system throws alerts saying it can't ping certain devices. But minutes later, things return to normal. (Don't you just love intermittent problems?)  One tool that any good network engineer will consider when dealing with such a problem is a packet capture product such as the ever-popular Wireshark.

When I fired up Wireshark on my desktop computer, I had to filter through the muck to see what was going on. By "muck" I'm referring to the traffic I don't care about, such as the traffic my box is generating, as well as broadcast and multicast. I slowly added more and more exceptions to my capture filter (see below) to narrow the scope of my capture.

My Wireshark Capture Filter: not host [my IP address] and not host [directed broadcast for my subnet] and not broadcast and not host and not host and not host and not host and not host and not ether proto 0x0806 [for CDP] and not ether host 01:00:0c:cc:cc:cc [for HSRP] and not host and not host and not host and not host and not host and not stp and not host and not host

Once I filtered out enough to see more clearly, I noticed a TON of syslog (UDP 514) traffic destined for another host on my subnet. After scratching my head and consulting with co-workers, I started looking at the mac-address tables (or CAM tables). My upstream switch didn't have a CAM table entry for the mac address of the syslog server. Neither did it's upstream switch. In fact, the Cat 6500 directly connect to the syslog server didn't have a CAM table entry for it.

Checking the timeouts for the CAM table on one of the CatOS switches gave us this:
CatOS-Switch> (enable) sh cam agingtime

VLAN    1 aging time = 300 sec
VLAN    2 aging time = 300 sec
VLAN    9 aging time = 300 sec
VLAN   17 aging time = 300 sec
VLAN   18 aging time = 300 sec
VLAN   20 aging time = 300 sec
VLAN   21 aging time = 300 sec
VLAN   25 aging time = 300 sec

Similarly, the Cat6500 running Native IOS showed this:
NativeIOS-Switch#sh mac-address-table aging-time 
Vlan    Aging Time
----    ----------
Global  300
no vlan age other than global age configured

Apparently, this syslog server is so quiet, so stealthy, that it doesn't transmit ANY traffic for more than 5 minutes (300 sec) at a time. After 5 minutes, the CAM table entries timeout, and all traffic destined for that server gets flooded to every port in the VLAN throughout our trunked network.

One way to prevent the flooding would be to put static CAM table entries in all the affected switches. Perhaps an easier solution is to configure the syslog server to generate some traffic at least every 5 minutes or less.

I'm not sure if the flooding is causing the other strange behaviors we're seeing on our network, but this has been a good learning experience and reminder for me about the basics of Layer-2 networking.

Any other troubleshooting ideas you would use for a situation like this? Comment here and/or hit me up on Twitter (@swackhap).

Friday, December 3, 2010

Splunk "host" Field Enhancement For Syslog-ng

We are very fortunate where I work to have Splunk. It's an incredibly powerful indexing tool that can "eat all your IT data" and report on it in many different ways. We mostly use it to do simple searches for troubleshooting, but we're always building more expertise as time permits.

Splunk is set up to index syslog messages very nicely by default. It takes each syslog message and intelligently recognizes the date/time stamp, then "extracts" all the fields and names them things like "host", "eventtype", "event_desc", "error_code", "log_level", and so on.  This post focuses on the "host" field, which is the IP address of the end device (router, switch, firewall, etc).

In our environment, we send all our syslogs to a Linux server running a free open-source tool called syslog-ng. With it, we do two things: (1) save a copy of each syslog message on the local server in a flat text file named for the source IP address where it came from, and (2) forward a copy to our Splunk indexing server using TCP port 9998.

For a while I’ve noticed that our Splunk lists all syslog messages with a “host” field that is the IP of the syslog-ng server. I was able to do some research this morning and “fixed” this so now all the syslog-ng forwarded messages have their host field set to the source IP address of their original sending device (router/switch/firewall).

Here’s how I did it:
1. Created props.conf file in /san/splunk/etc/system/local with the following contents
TRANSFORMS = syslog-header-stripper-ts-host syslog-host

2. Then restarted splunk with this command:
service splunk restart

Information sources I used:

Happy Splunking!