I really don’t know how well I’ll be able to organize this so that someone other than me can follow. Hopefully you won’t be too lost by the end. There’s a lot I intend to cover in a relatively short amount of text. Or maybe not a short amount. It involves an issue I ran into this week, and my thoughts on the overall weakness of any test or instruction of troubleshooting.
Anyway, let’s start with the problem.
“Hey Eric, I’m Seeing Freezes On [Certain Servers].”
It was an interesting start to the week, having run into an issue with a change I had made over the weekend. Unfortunately, when something like that happens, you tend to assume that the next problem has the same core cause. In this case, not so much.
Having run into issues previously with the subtle differences between IPv4 and IPv6 when accessing the servers in question myself, I at the very least managed to narrow down the issue to something on the IPv6 side. Connectivity with IPv4 back and forth seemed solid, but a ping6 from the server I was focusing on back to my workstation indicated varying levels of packet-loss.
A review of our current network layout – it’s a 2/3/2 core/distribution/access, so there’s a “core” VLAN (switched by the core routers) where all of the distribution and core routers exchange routes. There are also a fair number of VLANs crossing the core (something we’re in the process of changing) which can cause certain issues … not the least of which is “where the heck is this VLAN routed out of?!” We use OSPF as well, which may have in some small way contributed to the strangeness of the issue.
Anyway, it took (sadly), three hours of packet captures to finally figure out that the missing packets (when it came to ping6) were the ones that would get routed on a particular Nexus 7000 from the VLAN all the servers were on (all the packets were there) to the “core VLAN” (not all the packets would show up).
“Um. Where Are They Going?”
Doing a “show ipv6 route” indicated that there were two load-balanced routes on the 7000 to the destination – the two HSRP’d routers responsible for the destination VLAN. For those of you not familiar with how OSPFv3 works (specifically with IPv6, because, yes Virginia, you can theoretically use OSPFv3 with IPv4 if it’s supported by the hardware), the destination addresses for an IPv6 route use the link-local address. We had some administrative sanity here: we manually set the link-local addresses on all routers so that they are more easily identifiable.
Don’t Make Assumptions.
I made one big one here, which was: If the router can see the route (a Layer 3 thing), it should have a path to that route (a Layer 2 thing). Not having any clue as to why in the hell packets would just not show up on the core VLAN, I gave up and cried to my local CCIE.
I Shouldn’t Have Stopped Digging
A simple “show ipv6 neighbors” on the core VLAN would have indicated the issue: one of the destination routers did not have a neighbor entry. For those that don’t speak IPv6, this is akin to there being a route in the routing table but with no ARP entry for the next hop’s IP address.
We’re still not sure why this happened. In an ideal universe, the darn router should have done a Neighbor Solicitation for the address when it decided that it should send a packet to that address. It would also see multicasts from the address … but considering those multicasts were destined for the DR/BDR of the OSPF routers, it ignored them rather than add the neighbor information into its IPv6 neighbor tables. It would, obviously, take an update from the DR that says “Hey, there’s a connection to that IPv6 link local address that you can send packets to” even though there was no entry in the neighbor tables.
We managed to put a band-aid on it before fully diagnosing out the issue – if you used ping6 on the errant router to ping the link-local address of the destination router that was missing, it would go through the proper neighbor solicitation process, and (apparently) keep the neighbor in its tables properly. We also did a supervisor switchover later on, as we’ve run into other issues with IPv6 on the Nexus line, and this tends to fix those issues as well.
And Now, Why Troubleshooting Is Hard To Teach/Test
For the CCNP and CCIE, “troubleshooting” (and not wrongly so) focuses on something relatively predictable: misconfiguration. That will help with many of the real-world issues that you’ll come across, because often times your biggest enemy is whoever put the network together to start with (even if it’s you). A finger flub here, a missed or extra command there, or even someone who thought they knew what they were doing can mess up your day.
But the two big things that you’ll also run into that are hard to teach and reproduce are “code errors” and “hardware issues” (I say “issues” because “failures” in my mind tend to be more obvious). And it can hamstring you a little bit if you go into every troubleshooting issue assuming that the code is rock solid, and that the hardware is perfectly functional. We’ve run into quite a few issues with packet loss that have been fixed by replacing a cable or an optic.
The only thing I can hope is that the instructors out there who are teaching the future network engineers are saying things like “on the test, it’s probably this. But in the real world, things might seem a little more chaotic, so don’t discount that the hardware might be failing or the OS revision might have a bug in it.”