Many of us in the networking world use IPSEC VPNs over the Internet. The ISP connection is, or at least can be, cheaper than alternatives like MPLS, and of course we all need to connect our networks to the Internet (unless you're the DoD, CIA, or some other secretive organization with a classified network). This mystery begins with a VPN outage.
Refer to the reference network shown below. For these two sites, the primary connectivity is the IPSEC VPN over the Internet. The MPLS VPN is a secondary connection.
Problem: IPSEC VPN Down
At 2:44am CT the primary 10Mbps IPSEC VPN went down, but the 3Mbps MPLS worked flawlessly after route reconvergence. As the day progressed, the level of traffic between the two sites increased and began causing performance problems for users at Site B.
As we continued to troubleshoot what had happened, we found this syslog entry in Splunk that came from FW A:
Oct 14 02:44:33 fw.fw.fw.21 Oct 14 2010 02:44:33: %ASA-4-106023: Deny protocol 47 src inside:a.a.a.1 dst outside:b.b.b.254 by access-group "inside_access_in"
(Note: IP addresses have been changed here for security reasons.)
Nobody had made any changes at 2:44am. So what changed? After digging some more into our change management system, we found this change to FW A that was made back on 9/23:
Last Month - 9/23/2010 12:00:18 AM
ADDS 0, DELETES 0, CHANGES 1
access-list inside_access_in extended permit gre host a.a.a.1 host b.b.b.254
access-list inside_access_in extended permit gre host a.a.a.254 host b.b.b.254
This change was logged during a nightly config backup/compare, thus the Midnight time listing. It turns out that day we added another VPN that connects from another site (we'll call it Site C) back to Site A. For that VPN, we chose to use a.a.a.254 as the GRE endpoint on RTR A. We prefer to use .1 addresses to manage routers, and with .1 as a GRE endpoint we can't ping it. Unfortunately, we didn't realize the other VPN to Site B was active. Apparently, the IPSEC security association (SA) remained active, as did the stateful firewall connection in FW A, until 2:44am CT. So we ask ourselves again: What changed at that time?
Splunk to the Rescue
Diving more into the logs that we index with Splunk, we found visually when the problem started--it's where the histogram suddenly goes from 17 events per hour to over 1500.
Clicking on the 2AM timeframe brings up many iterations of the "Deny protocol 47" message that was shown above. Immediately prior to that stream of messages we see these three events:
- Oct 14 02:44:26 fw.fw.fw.21 Oct 14 2010 02:44:26: %ASA-3-713123: Group = [FW B InternetIP], IP = [FW B InternetIP], IKE lost contact with remote peer, deleting connection (keepalive type: DPD)
- Oct 14 02:44:26 fw.fw.fw.21 Oct 14 2010 02:44:26: %ASA-5-713259: Group = [FW B InternetIP], IP = [FW B InternetIP], Session is being torn down. Reason: Lost Service
- Oct 14 02:44:26 fw.fw.fw.21 Oct 14 2010 02:44:26: %ASA-4-113019: Group = [FW B InternetIP], Username = [FW B InternetIP], IP = [FW B InternetIP], Session disconnected. Session Type: IPsec, Duration: 21d 15h:00m:15s, Bytes xmt: 181785169, Bytes rcv: 3049561298, Reason: Lost Service
Correct me if I'm wrong, but it appears there may have been some connectivity problem on the Internet that happened just long enough for dead-peer-detection (DPD) to take effect and tear down the existing session. When that happened, a new IPSEC SA was created, still using the GRE endpoint of a.a.a.1. Since the firewall was previously changed to allow a.a.a.254 instead of a.a.a.1, this traffic got denied on the inside interface of FW A and prevented the GRE tunnel from coming up.
To fix, we added a rule to FW A allowing GRE from a.a.a.1 to b.b.b.254.