Monday, February 28, 2011

Scripting On-Demand Network Changes with Solarwinds Orion NCM

Getting called at 2am is never fun, even if you are the Network On-Call person.  Any chance I can  prevent a call like that, I'll take it! In this case, there's a "failover pair" of servers, one in each data center (DC). Each server has a locally unique admin/replication IP addresses on one interface that is always active and a second interface that shares the same IP address as the server in the other DC. Whichever server is active enables the  highly-available (HA) interface while the other server's HA interface is disabled. We can then make network changes to routers and switches to "switch" the server from one DC to the other. And instead of my having to manually make those changes at 2am, we can script the changes with a configuration management tool. Our tool of choice is Solarwinds Orion Network Configuration Manager (NCM).

In this particular use of NCM, there are 5 individual NCM jobs, one for each device that must be touched. The changes include enabling/disabling switch ports and adding/removing route advertisements in EIGRP and BGP.  Assume the names of the 5 jobs are AutoJob1a, AutoJob2a, ..., AutoJob5a. In addition, there are 5 jobs for the reverse direction named AutoJob1b, AutoJob2b, ..., AutoJob5b.  Each of these jobs has an NCM Job ID associated with it seen under the "Job ID" column when viewing Scheduled Jobs from the NCM GUI.

At this point, we've saved ourselves from having to individually login to each of the devices to make the required changes. But we can take it a step further by combining all the jobs and launching them from a Windows Batch (.bat) file.  On the NCM server we created the file d:\RemoteJobs\AutoJob-A.bat which contains these 5 lines, one per NCM job:

"D:\Program Files\SolarWinds\Configuration Management\configmgmtjob.exe" "D:\Program Files\SolarWinds\Configuration Management\Jobs\Job-318696.ConfigMgmtJob"
"D:\Program Files\SolarWinds\Configuration Management\configmgmtjob.exe" "D:\Program Files\SolarWinds\Configuration Management\Jobs\Job-631858.ConfigMgmtJob"
"D:\Program Files\SolarWinds\Configuration Management\configmgmtjob.exe" "D:\Program Files\SolarWinds\Configuration Management\Jobs\Job-713828.ConfigMgmtJob"
"D:\Program Files\SolarWinds\Configuration Management\configmgmtjob.exe" "D:\Program Files\SolarWinds\Configuration Management\Jobs\Job-272305.ConfigMgmtJob"
"D:\Program Files\SolarWinds\Configuration Management\configmgmtjob.exe" "D:\Program Files\SolarWinds\Configuration Management\Jobs\Job-777458.ConfigMgmtJob"


Note that the Job ID for each job shows up in the name of the .ConfigMgmtJob file that is called in each line of the .bat file.

At this point, any monkey with a login to the NCM server could just double-click on the .bat file to kick off those five NCM jobs.  But there's a better way, at least in our environment: Tidal Scheduler.  With a Tidal agent on the NCM server, Tidal can be configured to launch d:\RemoteJobs\AutoJob-A.bat or the reverse d:\RemoteJobs\AutoJob-B.bat on-demand by the Operator-On-Duty.  This allows the event to be properly audited and standardizes the action required by the Operator.  

In addition, we can configure each NCM job such that it generates an e-mail notification when it completes, so when all 5 have completed we get 5 e-mails that show exactly what commands were entered and the corresponding output from the router/switch that was modified. The e-mail can be sent to the Network team as well as the Operations team so they have a better understanding of success than simply a "completed job" message from Tidal.

In the end, instead of getting a wake-up call at 2am, the server admin team can now simply call the Operator-On-Duty and ask them to run NCM Job "AutoJob-A" or "AutoJob-B". They then use a simple traceroute to determine if the network "thinks" the server is in DC A or DC B.

Ahh, now I can go back to sleep. 

Zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz.

Wednesday, February 23, 2011

Juniper Launches QFabric To Compete In The Data Center

In case you missed it, Juniper held an official launch event at 12N Central time (US) today for their new QFabric platform. What intrigued me the most was the maximum end-to-end delay of 5 microseconds. Here are a couple links that show the marketing pump-up-the-adrenaline type of advertisement they led the event with, as well as a 2:24 video with a brief explanation of the QFabric architecture.  Also, here's an Infographic: The 7 Defining Characteristics of QFabric.

Based on what I know of the Brocade VCS/VCX platform, this sounds similar in many ways. But I'm certainly not a Brocade expert!

Did you watch the launch event? What was your take on it?

Here are some screenshots I took of the more technical slides (which I think most engineers agree are more interesting than the marketing hype).

Thursday, February 17, 2011

Life Without Caffeine

Title catch your attention? I thought so. Try to imagine it for a minute. I've been living it. Well, not quite, but I've been living on a limited caffeine intake since January 1 2011 as part of my 100-day challenge. Before Jan 1, I would drink 6-8 Diet Cokes per day. Since then, just 1. What's worth giving up so much caffeine for, you ask? This:
The Apple iPad. Better yet, by the time my 100-day challenge is done on April 10 (but who's counting), there should be an iPad 2 available.

In this world of self-indulgence it's very rewarding, albeit quite challenging, to replace instant self-gratification with self-denial. It makes the prize that much sweeter in the end.

What will you give up for 100 days, and what will your reward be? Join me in the challenge!

Friday, February 4, 2011

Splunk Field Extraction and Report for Cisco AnyConnect VPN Failures

At the peak of Snowmageddon and Icemageddon this week our remote-access VPN resources were getting some major exercise.  Our office was even closed for a day, something that doesn't happen often.  Our 100 simultaneous AnyConnect SSL VPN licenses on our Cisco ASA were being used up by 9am 3 days in a row, preventing many people from getting connected.  I've mentioned in a previous post about our secondary process, where we have users download and install the IPSEC VPN client. But for those that know the products, that's not as convenient as AnyConnect.

After the fact I was discussing options for increasing our remote access VPN capacity, all of which require money.  To justify the cost to the money holders, it's always useful to have data to back you up.  So we started asking questions:
  • How many people had problems connecting to the VPN?
  • How many times were individual users failing to connect due to our license limit?
After some digging I was able to find the perfect ASA log entry:

     %ASA-4-716023: Group name User user Session could not be established: session limit of maximum_sessions reached.

In our case it looks more like this:

     %ASA-4-716023: Group <SSLVPNUsers> User <swackhap>  IP <24.107.10.23> Session could not be established: session limit of 100 reached.

With our Splunk log analysis tool we were able to dig even deeper to analyze the data and get some good statistics to justify our request for added VPN capacity. Within Splunk, I first ran a search for the above log entry:
So in this case you can see we had 1071 occurrences of that log entry.  But how many people were affected? Splunk normally does a great job extracting fields of data it considers to be useful. But in our case we want to extract the actual userIDs, such as ea900503 and nbf shown above, and Splunk hasn't done it for us.

To extract a new field in Splunk, simply click on the small gray box with the downward facing triangle to the left of the event, then select "Extract Fields" as shown below.
In the "Example values" box I typed the two sample userIDs and clicked Generate, but in this particular case Splunk failed to generate a regex. So, I was forced to come up with one on my own.  
After messing around with a free tool called RegExr, and after much wailing and gnashing of teeth, I was able to come up with a regular expression to extract the proper field:

     (?:Group <SSLVPNUsers> User <)(?P<AnyConnectUser>[^>]*)

In Splunk, I clicked the gray Edit button and entered my own regex, then saved the new field extraction.  Now we're able to see "AnyConnectUser" as an interesting field on the left side of the search screen. (You may have noticed it in earlier screenshots, since I had already created the field extraction before writing this blog post.)
Clicking on the "AnyConnectUser" field shows a list of the top 10 hits, including the number of occurrences for each.  (Note that I've obfuscated many of the usernames for security). But at this point we still don't know how many users had problems connecting (we just know it's more than 100).  So we use some more Splunk magic--generate a report based on the search.
Clicking on "top values overall" brings up the report generation wizard.
After creating and saving the report, we can now get to it anytime from the main Search screen under the "Searches & Reports" drop-down menu:
Here's the finished product:
After scrolling down we can see a table of the raw data:
We can then go to the last page of the table, scroll to the bottom, and see the total number of users that had at least one failure connecting to the VPN:
We had 194 users experience VPN connection problems due to our existing license limit.

Hit me up on Twitter (@swackhap) if you have questions or ideas on how to do this better.  Or leave a comment below.