Eric Stewart: Running Off At The Mouth

When Packet Captures Lie: vPC Settings To Watch Out For

by Eric Stewart on Apr.05, 2018, under Networking, Technology

Two posts in as many days – amazing!

One of my worst weaknesses as a network admin is that (mostly due to a weird conflux of laziness and time restrictions) I tend to not read up on a topic as much as I should before I implement; or at the very least, I don’t retain what I do read and put certain settings in without considering that there might be serious ramifications. Such is the case with vPC; I don’t know it as well as I should, and I trust it even less, but it’s being used in our data center (DC) networks heavily. So it was that during the load balancer migration I’m in the middle of, I came across a case where vPC made a packet capture lie.

Now, generally, it’s a rule: Packet captures do not lie. They always tell the truth. They tell you so much about the data flowing. The right packet capture (with accompanying indications of a lack of errors on ports) can show that “slowness” isn’t a network problem, but rather a client or server issue.

So it was that I found myself in the following situation:

Our new load balancers are vPC connected to two Cisco 7700s. There’s several VLANs involved in the trunk between the 7700s and the two load balancers (LBs). While most of them are bridged VLANs for Server Load Balancing, one of the VLANs is a single VLAN for both route injection (providing AnyCast addresses for certain services, such as load balanced DNS caches) and Global Server Load Balancing (GSLB, or load balancing across DCs using DNS). Originally planned to be a group of point to point connections, unfortunately they turned into (in each DC) /29: 2 IPs for each of the LBs, two for the 7700s, a floating IP for GSLB sync between DCs, and another floating IP (that can’t be the same IP) for actual DNS services. So in the end the /29 worked out better for us.

As I was reviewing our existing GSLB setups, I came across a few that were implemented that don’t really cross DCs. it turns out there are servers that could use load balancing (or failover services) that weren’t in our “regular” (read: behind the firewall and load balancers) load balancing setup. So, even though all of the servers involved were in our “local” DC (GSLB has the concept of “local” vs “remote” for DCs), we were using GSLB to provide load balancing services behind a Fully Qualified Domain Name (FQDN) rather than a Virtual IP (VIP).

The servers in question were run by an IT group that’s fairly independent – they run their own equipment and networking (including making choices for money reasons that cause headaches for us). Their equipment is (mostly) down a couple of vPC connections, but the 7700s provide the routing for both the /29 for the LBs (default route for each LB is the corresponding IP for the 7700) as well as the subnet that the IT group uses.

So I get the FQDN configured on the new LBs just like the old LBs had. I run a few digs against it, and notice that it’s not giving me a rotating list of addresses. It’s only giving me the backup address. I poke around a bit, and it turns out the new LBs (actually just the “local” ones – the ones in the “remote” DC were fine) think all the servers for the FQDN are down.

I start asking (let’s just call him) John, the guy responsible for the servers, if he’s got a local firewall. “No.” Check a packet capture and see packets going down the vPC port channel to his equipment. Do you see the returns coming out of your equipment? “No.”

Well, to some extent (mainly due to the communication methods used – Slack and email) John wasn’t all that clear in what he was seeing in his captures. Eventually it became apparent that John wasn’t even seeing the echo requests on his boxes.

I, possibly a little too flippantly, use the phrase “smoking crack” when I think someone is, for whatever reason, not providing accurate information (usually either due to laziness, unfamiliarity with the tools, or general lack of understanding of how networking works – the irony of the previous statement striking me pretty hard as a reread this during an edit review). And it didn’t help that in other cases John had been known to, well, be “smoking crack.” So the fact that he was saying he wasn’t seeing the requests, when i was clearly seeing them go down the port-channel to his equipment, made me think he was doing it again.

As the history of any given situation is relevant to why things are the way they are “now” … a little history:

So when those nice new 7700s came in and my boss made the second dumbest mistake he’s ever made by handing them to me (the first was hiring me in the first place), I went through the process of configuring them to replace some very old Cat 6500s. I became somewhat familiar with vPC, as well as how the old LBs did their route injection. Back then, I read up on “peer-gateway”, a setting Cisco says is used to allow “a vPC switch to act as the active gateway for packets that are addressed to the router MAC address of the vPC peer.” Not really knowing any better (as with HSRP you shouldn’t need this functionality, though it just occurred to me that this case will prove I do need it), I included the peer-gateway option. Turns out, if you have a VLAN that has a device doing HSRP and OSPF with, say, a couple of old Load Balancers, you need to “peer-gateway exclude-vlan” that VLAN so that a given 7700 doesn’t interfere with a given load balancer’s attempt to form an OSPF neighborship with the other 7700 (and you also need to set up an ACL to stop the two load balancers from becoming neighbors with each other, because that complicates things).

So it was with little thought that I added the VLAN that the new LBs and 7700s were going to do their BGP peering (which will allow us to more cleanly do route injection) to the list of peer-gateway excluded VLANs. But remember – each LB peers with only one of the 7700s, not both, and there’s no need to HSRP since we configure the new LBs with a default route of a specific 7700 IP (and also since all IPs are now in use for one reason or another, there isn’t an additional IP to configure as a “gateway” anyway for HSRP).

So when John seemed to be smoking crack and I couldn’t figure out what the heck was going on, I went to my boss, who looked at it and promptly agreed that John was probably smoking crack. But he also took the time to go up to John’s Brocade (not sure who owns them now) switches and verify that, indeed, the echo requests were actually not being received by those switches. Alas, John was not hallucinating. Not this time, anyway.

So what was going on?

Somewhere between the vPC port channel on the 7700s and the Brocades at the other ends of some strands of fiber, the packets were being lost.

Now, there’s very little between the vPC port channel (which is configured in this particular DC to be two physical ports – one on each 7700 – that then are connected via those strands of fiber to John’s Brocade).  So it was quite puzzling.  It took me looking at a config and taking a shot in the dark, and my boss walking away from the computer to get some food, for him to realize what the problem was, and for me to get lucky with that shot and fix it.

With peer gateway configured (and the relevant VLAN not being “excluded”), a given 7700 vPC pair will respond to frames directed to the MAC address of its partner.  This would prevent needless crossing of the vPC peer link, because when things cross the vPC peer link, the become subject to some “not always clear” loop prevention mechanisms.

Without peer gateway – or, with peer gateway exclusion – a vPC switch will not forward packets out a vPC interface that crossed the vPC peer link.  So purely by chance it would seem, the exit interface of the vPC link for LB#1 (which has a default route pointing to 7700#1) ended up sending the echo requests out to 7700#2.  Not being an HSRP connection (if directed at the shared MAC address – as classical default gateway traffic would for an HSRP setup – either the standby will go ahead and route the connection, or vPC allows traffic to flow … I’m not 100% sure of this), and being peer gateway excluded, 7700#2 sends the request on to 7700#1 over the vPC peer link.  7700#1 moves the packet from the source VLAN to the destination VLAN, and, since it has a nice vPC port to the downstream device, says it’s going to send it down.  But it’s really not supposed to according to the vPC rules.  The captures clearly show the packets headed out of the exit port-channel interface, but they do not show up on the entrance interface of the downstream device.

Between $BOSSBIGBRAIN and I, the only thing we can figure is that a port-channel is essentially a virtual interface.  And you can’t (as far as I know – but then I didn’t actually try) do packet captures on member ports of a port-channel – only on the port channel itself (I should try this later to verify what I’ve been told).  So the vPC loop prevention mechanism would appear (at least on the NX-OS version these 7700s are running) to kick in after the port-channel’s virtual interface gets the traffic, but before the actual physical exit interface gets it.

Aggravatingly inconvenient, because this essentially makes a packet capture lie.

I’m sure given infinite time or a little more practice, one might be able to find some logging or debug setting that will tell you when vPC decides to drop packets.  But, see the beginning of this post for my excuse.

Removing the VLAN for the load balancers from the “peer-gateway exclude-vlan” list fixed the issue.

:, , ,

Hi! Did you get all the way down here and not find an answer to your question? The two preferred options for contacting me are:
  • Twitter: Just start your Twitter message with @BotFodder and I'll respond to it when I see it.
  • Reply to the post: Register (if you haven't already) on the site, submit your question as a comment to the blog post, and I'll reply as a comment.

Leave a Reply

You must be logged in to post a comment.