Tuesday, December 14, 2010

Switch Flooding 101 - Troubleshooting Case Study

Remember the first time you learned the basics of bridging? Dig deep in your memory and think back to the basics. With helpful verification from my co-workers and Aaron Conaway (on Twitter as @aconaway), I verified that some "crazy" behavior I saw today on our network was, in fact, "normal," albeit undesired.

I've been troubleshooting some very strange behaviors on our network lately. I suspect some (all?) of them have to do with our fairly old Cisco Catalyst 6500s with Sup2's and Sup1a's in our data center, as well as the dinosaur Catalyst 2948 access switches in our closets. There are times when our monitoring system throws alerts saying it can't ping certain devices. But minutes later, things return to normal. (Don't you just love intermittent problems?)  One tool that any good network engineer will consider when dealing with such a problem is a packet capture product such as the ever-popular Wireshark.

When I fired up Wireshark on my desktop computer, I had to filter through the muck to see what was going on. By "muck" I'm referring to the traffic I don't care about, such as the traffic my box is generating, as well as broadcast and multicast. I slowly added more and more exceptions to my capture filter (see below) to narrow the scope of my capture.


My Wireshark Capture Filter: not host [my IP address] and not host [directed broadcast for my subnet] and not broadcast and not host 239.255.255.250 and not host 224.0.0.2 and not host 224.0.0.251 and not host 230.0.0.4 and not host 224.1.0.38 and not ether proto 0x0806 [for CDP] and not ether host 01:00:0c:cc:cc:cc [for HSRP] and not host 224.0.0.252 and not host 228.7.6.9 and not host 224.0.1.60 and not host 224.0.0.1 and not host 224.0.0.252 and not stp and not host 224.0.0.13 and not host 224.0.0.22

Once I filtered out enough to see more clearly, I noticed a TON of syslog (UDP 514) traffic destined for another host on my subnet. After scratching my head and consulting with co-workers, I started looking at the mac-address tables (or CAM tables). My upstream switch didn't have a CAM table entry for the mac address of the syslog server. Neither did it's upstream switch. In fact, the Cat 6500 directly connect to the syslog server didn't have a CAM table entry for it.

Checking the timeouts for the CAM table on one of the CatOS switches gave us this:
CatOS-Switch> (enable) sh cam agingtime

VLAN    1 aging time = 300 sec
VLAN    2 aging time = 300 sec
VLAN    9 aging time = 300 sec
VLAN   17 aging time = 300 sec
VLAN   18 aging time = 300 sec
VLAN   20 aging time = 300 sec
VLAN   21 aging time = 300 sec
VLAN   25 aging time = 300 sec
[snip]

Similarly, the Cat6500 running Native IOS showed this:
NativeIOS-Switch#sh mac-address-table aging-time 
Vlan    Aging Time
----    ----------
Global  300
no vlan age other than global age configured

Apparently, this syslog server is so quiet, so stealthy, that it doesn't transmit ANY traffic for more than 5 minutes (300 sec) at a time. After 5 minutes, the CAM table entries timeout, and all traffic destined for that server gets flooded to every port in the VLAN throughout our trunked network.

One way to prevent the flooding would be to put static CAM table entries in all the affected switches. Perhaps an easier solution is to configure the syslog server to generate some traffic at least every 5 minutes or less.

I'm not sure if the flooding is causing the other strange behaviors we're seeing on our network, but this has been a good learning experience and reminder for me about the basics of Layer-2 networking.

Any other troubleshooting ideas you would use for a situation like this? Comment here and/or hit me up on Twitter (@swackhap).

Friday, December 3, 2010

Splunk "host" Field Enhancement For Syslog-ng

We are very fortunate where I work to have Splunk. It's an incredibly powerful indexing tool that can "eat all your IT data" and report on it in many different ways. We mostly use it to do simple searches for troubleshooting, but we're always building more expertise as time permits.

Splunk is set up to index syslog messages very nicely by default. It takes each syslog message and intelligently recognizes the date/time stamp, then "extracts" all the fields and names them things like "host", "eventtype", "event_desc", "error_code", "log_level", and so on.  This post focuses on the "host" field, which is the IP address of the end device (router, switch, firewall, etc).

In our environment, we send all our syslogs to a Linux server running a free open-source tool called syslog-ng. With it, we do two things: (1) save a copy of each syslog message on the local server in a flat text file named for the source IP address where it came from, and (2) forward a copy to our Splunk indexing server using TCP port 9998.

For a while I’ve noticed that our Splunk lists all syslog messages with a “host” field that is the IP of the syslog-ng server. I was able to do some research this morning and “fixed” this so now all the syslog-ng forwarded messages have their host field set to the source IP address of their original sending device (router/switch/firewall).

Here’s how I did it:
1. Created props.conf file in /san/splunk/etc/system/local with the following contents
[source::tcp:9998]
TRANSFORMS = syslog-header-stripper-ts-host syslog-host

2. Then restarted splunk with this command:
service splunk restart

Information sources I used:

Happy Splunking!

Thursday, November 11, 2010

Solarwinds Orion Network Performance Monitor Bug

I am *scary* good at finding bugs in software. Just ask the Cisco TAC. Or in today's case, ask Solarwinds support. This is a duplicate posting that I've also added to Solarwinds' Thwack.com user community site. If you use Orion NPM and send SNMP traps to another network management tool, READ AND HEED.

Thwack Post Title: NPM 10.0.0 SP1 Bug: Alert Action To Send SNMP Traps Actually BROADCASTS On Local Network

Many thanks to Mariusz from the Support team for helping me pin this down. I wanted to share with all since this might be happening under your nose!

We have Orion NPM 10.0.0 SP1 and have the "Alert me when a node goes down" alert configured with two trigger actions:

  1. Log Alert to NetPerfMon Event Log
  2. Send SNMP Trap to two hosts (Microsoft Operations Manager and Orion NCM).
A DBA told me earlier today that he noticed a server was receiving traps from our Orion poller. He noticed this in that server's Event Viewer Application Log.

With help from Mariusz and Wireshark, we found that the Orion NPM poller was actually broadcasting SNMP traps to 255.255.255.255! It seems that the workaround is to create a different trigger action for each SNMP Trap destination.  In other words, we changed our trigger actions to this:
  1. Log Alert to NetPerfMon Event Log
  2. Send SNMP Trap to Microsoft Operations Manage
  3. Send SNMP Trap to Orion NCM
As a matter of fact, for each additional valid IP destination we added to the trigger action, it appears that the Orion poller actually generated duplicate broadcasts for each SNMP trap.

If you use this feature of Orion, I recommend you check your settings and maybe run Wireshark on your poller to be sure you're not spewing broadcasts out to your entire server subnet.

Mariusz is filing this as a bug, and I'm not sure what all versions of Orion are impacted. Feel free to add your comments to this thread.


http://thwack.com/forums/48/orion-family/9/network-performance-monitor/28193/npm-1000-sp1-bug-alert-acti/#118327

Friday, October 15, 2010

The Case of the Mysterious Disappearing VPN

Many of us in the networking world use IPSEC VPNs over the Internet. The ISP connection is, or at least can be, cheaper than alternatives like MPLS, and of course we all need to connect our networks to the Internet (unless you're the DoD, CIA, or some other secretive organization with a classified network). This mystery begins with a VPN outage.

Refer to the reference network shown below.  For these two sites, the primary connectivity is the IPSEC VPN over the Internet. The MPLS VPN is a secondary connection.

Problem: IPSEC VPN Down
At 2:44am CT the primary 10Mbps IPSEC VPN went down, but the 3Mbps MPLS worked flawlessly after route reconvergence.  As the day progressed, the level of traffic between the two sites increased and began causing performance problems for users at Site B.

As we continued to troubleshoot what had happened, we found this syslog entry in Splunk that came from FW A:

Oct 14 02:44:33 fw.fw.fw.21 Oct 14 2010 02:44:33: %ASA-4-106023: Deny protocol 47 src inside:a.a.a.1 dst outside:b.b.b.254 by access-group "inside_access_in"
(Note: IP addresses have been changed here for security reasons.)

Nobody had made any changes at 2:44am. So what changed? After digging some more into our change management system, we found this change to FW A that was made back on 9/23:

BEFORE
AFTER
Last Month - 9/23/2010 12:00:18 AM
ADDS 0, DELETES 0, CHANGES 1
access-list inside_access_in extended permit gre host a.a.a.1 host b.b.b.254
access-list inside_access_in extended permit gre host a.a.a.254 host b.b.b.254

This change was logged during a nightly config backup/compare, thus the Midnight time listing. It turns out that day we added another VPN that connects from another site (we'll call it Site C) back to Site A.  For that VPN, we chose to use a.a.a.254 as the GRE endpoint on RTR A. We prefer to use .1 addresses to manage routers, and with .1 as a GRE endpoint we can't ping it.  Unfortunately, we didn't realize the other VPN to Site B was active.  Apparently, the IPSEC security association (SA) remained active, as did the stateful firewall connection in FW A, until 2:44am CT.  So we ask ourselves again: What changed at that time?

Splunk to the Rescue
Diving more into the logs that we index with Splunk, we found visually when the problem started--it's where the histogram suddenly goes from 17 events per hour to over 1500.
Clicking on the 2AM timeframe brings up many iterations of the "Deny protocol 47" message that was shown above. Immediately prior to that stream of messages we see these three events:
  • Oct 14 02:44:26 fw.fw.fw.21 Oct 14 2010 02:44:26: %ASA-3-713123: Group = [FW B InternetIP], IP = [FW B InternetIP], IKE lost contact with remote peer, deleting connection (keepalive type: DPD)
  • Oct 14 02:44:26 fw.fw.fw.21 Oct 14 2010 02:44:26: %ASA-5-713259: Group = [FW B InternetIP], IP = [FW B InternetIP], Session is being torn down. Reason: Lost Service
  • Oct 14 02:44:26 fw.fw.fw.21 Oct 14 2010 02:44:26: %ASA-4-113019: Group = [FW B InternetIP], Username = [FW B InternetIP], IP = [FW B InternetIP], Session disconnected. Session Type: IPsec, Duration: 21d 15h:00m:15s, Bytes xmt: 181785169, Bytes rcv: 3049561298, Reason: Lost Service
Correct me if I'm wrong, but it appears there may have been some connectivity problem on the Internet that happened just long enough for dead-peer-detection (DPD) to take effect and tear down the existing session. When that happened, a new IPSEC SA was created, still using the GRE endpoint of a.a.a.1. Since the firewall was previously changed to allow a.a.a.254 instead of a.a.a.1, this traffic got denied on the inside interface of FW A and prevented the GRE tunnel from coming up.

To fix, we added a rule to FW A allowing GRE from a.a.a.1 to b.b.b.254.

Mystery solved!

Thursday, October 14, 2010

Contacts Consolidation

I don't know about you, but I have contacts everywhere. I've got Exchange with Outlook at work, Google Contacts (to go along with Gmail and Google Voice), Facebook, Twitter, and Linked In.  There may be others but I spent about 30 minutes and pulled together all my current contacts from all these sources last night. Here's how I did it:

  1. Outlook: Exported all contacts as a CSV file. Cleaned it up and imported into Google Contacts.
  2. Facebook: I found a post that explained how to use a Yahoo account to import Facebook contacts. I then exported as a CSV and, again, imported into Google Contacts.
  3. Linked In: Under the Contacts listing, there's an easy-to-use "Export Connections" link. Exported to CSV and, you guessed it, imported into Google Contacts.
  4. Twitter: Found a nice service called MyTweeple.com that has a handy tool to export all contacts to a CSV file. Imported into Google Contacts.
By now you see a pattern developing.  Since I use Gmail and Google Voice so heavily, Google Contacts is a natural repository for all my contacts.  It also allowed me to import custom column fields, like "TwitterName", so I have all my tweeps listed in my Google Contacts with their "twittername" as a Note attached to their details. 

Another great thing about Google Contacts is that it is great at finding and merging duplicate contacts. As you might guess, there are many people that I follow on multiple social networks, so merging duplicates is a must for me.

How do you keep your contacts organized?

Find me on Twitter at @swackhap.

Tuesday, October 12, 2010

Who Said Catholics Don't Have A Sense Of Humor?

CATHOLIC GOLF

Catholic or not you have to laugh at this one.






A Catholic priest and a nun were taking a

rare afternoon off and enjoying a round

of golf.


The priest stepped up to the first tee and

took a mighty swing. He missed the ball

entirely and said "Shit, I missed."


The good Sister told him to watch his

language.

On his next swing, he missed again.

"Shit, I missed."

"Father, I'm not going to play with you

if you keep swearing," the nun said tartly..

The priest promised to do better and

the round continued.

On the 4th tee, he misses again. The

usual comment followed.

Sister is really mad now and says, "Father

John, God is going to strike you dead if you

keep swearing like that."

On the next tee, Father John swings and

misses again. "Shit, I missed."

A terrible rumble is heard and a gigantic

bolt of lightning comes out of the sky and

strikes Sister Marie dead in her tracks..


read on

And from the sky comes a booming voice ......

"Shit, I missed."

Monday, September 27, 2010

Google Is Great For More Than Just Searching

I've recently been discovering (or in some cases re-discovering) some of the awesome free stuff that Google has to offer. My Google Dashboard lights up like a Christmas tree now that I'm using so many of their tools. Here are a few that I've started (re)using lately.

Gmail - After looking at the web-based interface on and off for a while, I decided to take the leap. My primary e-mail address, which uses my own domain (swackhammer.net) automatically forwards all e-mail to my Gmail account. Advantages I love include speed, ability to quickly search all e-mails for what I need, and integration with all my contacts.

Google Voice - I give out one number to everyone, then can customize what phone will ring and when based on who is calling me. Annoying call from recruiter or telemarketer? Just tell Google Voice to send them to voicemail. Or better yet, play a message that indicates your number is no longer in service. :-) And when you do get a voicemail, you can read a transcript of it via SMS or in your e-mail so you don't even have to listen to it. (Although some people's accents make for some very interesting transcripts.)

Google Contacts - Integration with Gmail and Google Voice--all your important contacts in one place, all easily reachable from any web browser.

Google Reader - RSS (Really Simple Syndication) feed-reader allows me to sign up for all the news and blogs I care about and read them at my leisure. I also use the NewsRack app on my iPhone which syncs with Google Reader. Any article I read on my iPhone gets marked as "read" so I won't waste time reading it a second time if I'm using Google Reader in a web browser.

Blogger - I've heard many people say they like WordPress better, but until I need features that WordPress offers, this works great for me.

Best of all, these services are FREE. I know, I know--you may be one of those people that hate Google and don't want them tracking your every move. I'm aware of my online footprint, and as a techie I fully understand that if someone really wants to find out more about me, they will anyway.

How do you use Google? What non-Google services do you love in place of these and why?