Eric Stewart: Running Off At The Mouth

Where You Capture Is As Important As What You Capture: Devil In The Details

by Eric Stewart on Apr.13, 2018, under Networking, Technology

So yes – spending a lot of time with the new load balancers and finding out all sorts of things about their operation I didn’t know about, that (at least in this case) were fairly easy to fix.

The thing is carrying actual, important, production traffic now.  Most if it is short lived web-based stuff; short enough that the transfer completes well before a failed session establishment on the firewall is noticed.  But someone was noticing, and it took me doing many packet captures to figure out exactly what was going wrong (and realize it was actually my fault).

The configuration of our load balancer, if you don’t remember from previous posts, has bridged VLANs for a vast majority of the VIPs – a couple of pairs, depending on what security tier the servers/VIPs belong to.  But some of the VIPs are going to use route health injection to provide AnyCast based services.  For those (and also used by the GSLB services), there’s another, not-bridged, all by itself VLAN.  The default route for the load balancer is also set to the routers on the other side of this VLAN.  Mind you, all VLANs go through the same LACP (vPC) trunk to the load balancer.

After moving a few VIPs, I get reports from one of the VIP owners that they’re encountering issues with the services sometimes working, sometimes not.  I poke around a bit, see nothing wrong, and make the assumption that it’s got to be the member servers or the client.  This went on for a lot longer than it should have – I should have started doing the captures well before I eventually did.  It took an “issue group” being formed to look at the problem before I started to do any captures at all.  And I see exactly what I expect to see when doing the captures at the interfaces on the 7000s that lead to the active load balancer:

  • Traffic going in, being translated to a member server, and sent out to said member server
  • Member server responses coming out and being translated and sent out to the client
  • At some point (when a failure is reported), the member server sending a few retransmits, which the load balancer dutifully translates and sends out

From that capture, it all looks like the client stops responding.  It didn’t help that a particular cloud provider was the location of said client.

Now, a note about the captures to this point:

I was doing simple scrolling text captures, watching the traffic, not in a GUI.  This means I could see the content type, some of the content, and IPs.  Note: not VLANs or MACs.

I sent my conclusion to the group, and headed to lunch.  While out at lunch, I realized that our firewall in the other DC was the firewall that provided the tunnel to the cloud provider in question, so I Slacked the group that before we sent a complaint to the cloud provider, I should probably do my capture there, as that was the location essentially closest to the cloud provider, and ensure that the behavior I saw in the DC where the servers were was the same behavior I saw there.

Well, they had already sent the complaint in.  Oops.

Oops, because what I saw at the tunnel endpoint was the client sending retransmits, and none of the server retransmits reaching the tunnel endpoint.

Freaked out by the difference in traffic (some of which I still can’t 100% explain), then did a traffic capture at the firewall interface in the original DC, trying to match it up with the traffic I’d see at the tunnel endpoint.  I was essentially trying to figure out where the “disconnect” was … where the traffic differed.

Well, faithful readers, let me tell you … what I saw at the firewall of the VIP/member servers freaked me out a little.

See, the firewall is essentially a firewall on a stick – all traffic going in (assuming an allowed and proper flow) should also come out.  I should see all traffic twice if I configure the SPAN session liberally (I.E., with “both” instead of either “rx” or “tx”).  And since the firewall is also the default gateway for the servers and VIP, you’d think there’d be no reason for the traffic to magically go a different route.

So it was when I did the traffic capture on the firewall that I saw all the traffic from the client twice (as it went to the firewall and then as the firewall sent it back out to the VIP/member servers).  But I saw none of the servers’/VIP’s responses.

Since the firewall wasn’t seeing the return traffic, the session would be terminated, causing dropped connections and an interruption in flow.

“How could that happen?”

By this point I informed the group that there was something I spotted that was wrong and that the issue was probably our error.  I just needed to figure it out.

Still thinking in terms of Layer 2 at this point, I reset my captures to be of the load balancer connection, but this time writing the captures out to a file for later dissection on full blown Wireshark.  Luckily the vPC connection in this case sent all of the traffic over the same leg, so I only had to look at one file.  First thing I looked at was the MAC addresses.  See, what I would have expected was that traffic going to the VIP (and possibly the members, but that’s less important at this point) would be sourced from the MAC of the firewall.  As such, the responses should have a destination MAC of the firewall.  But if I wasn’t seeing the frames at the firewall, either something was colossally wrong or the MAC was going to be some other MAC.

And it was.  It was the MAC of the Nexus 7000 the load balancer was using as it’s default router.  And the VLAN?  The “exiting” traffic was going out the VLAN that was to be used for route health injection and GSLB.  The load balancer, not really knowing for sure where it should send the traffic, just used its default route to route the traffic.

Again, Doing Layer 2 In A Layer 3 Configuration

Not necessarily the most ideal config, as I’ve admitted in previous posts.  Still supported, and luckily, there’s a configuration option, described as:

Use receive hop for response to client

Which I was told about by our sales engineer who was very quick in getting back to me after my panicked email to him.  This configuration option (near as I can tell) retains the MAC address the member server uses and bridges the traffic properly to the “outside” VLAN (I’m stupid) sends the traffic back out the direction the initial connection was received on (the load balancer tracks connectivity to a great detail, including what MAC/IP/port the connection was received on, just in case it has to do even more NAT than it already does as part of its load balancing).  Issue resolved … which had several (actually quite welcome) effects:

  • We had inklings of issues with rising session counts on our firewall.  Now with the traffic properly flowing, the firewall was seeing the tear down properly and wasn’t holding sessions open longer than needed.
  • It also appears as if a similar session issue was happening on the load balancer itself (which makes sense).  The number of active connections dropped significantly for the VIPs that have noticeable session counts.

The lessons?

  • Do your captures everywhere that’s relevant: close the network exit point, close to the client, but also – close to the gateway location of at least one of the subjects of the capture.
  • Don’t always look at just the scrolling IPs.  Sometimes the Devil is hiding in the MAC address or VLAN.
  • Sometimes, it actually is the network.

I was thanked heavily for fixing this issue, but I had to eat crow – this was something I personally feel I should have spotted …

Like, last year when I was doing the load balancer evaluation.

:, ,

Hi! Did you get all the way down here and not find an answer to your question? The two preferred options for contacting me are:
  • Twitter: Just start your Twitter message with @BotFodder and I'll respond to it when I see it.
  • Reply to the post: Register (if you haven't already) on the site, submit your question as a comment to the blog post, and I'll reply as a comment.

Leave a Reply

You must be logged in to post a comment.