Eric Stewart: Running Off At The Mouth

Perl’s Net::SNMP and get_table() – Bulk vs Next Request

by Eric Stewart on Sep.27, 2013, under Networking, Technology

We use Nagios with a whole mess of custom scripts to monitor our network equipment.  Some of the custom Perl scripts get very detailed.  We have scripts that probe via SSH, but a vast majority of our scripts use SNMP.  Thing is, there’s a large number of SNMP queries (many using “get_table” to return a table of SNMP information) we can make on a given device, but if something goes wrong with any of those queries, the scripts sometimes “bomb out” with error messages that, initially, make little sense and offer no practical solutions.

So today I had a switch (new, I think; our Operations group has a handy way of adding equipment that usually doesn’t require my intervention), pop up but immediately report:

UNKNOWN: SNMP error: The message size exceeded the buffer maxMsgSize of 1452
  (ciscoMemoryPoolFree)

All things being equal, my initial assumption when I see an “UNKNOWN” usually is that the SNMP request was failing, usually due to something being wrong in the device’s Nagios configuration (host address, community string, something).  But here’s the thing: “ciscoMemoryPoolFree” isn’t the first query using “get_table” we make of a device with this particular Nagios check.  Not only that, but an snmpwalk of the equipment in question returned:

SNMPv2-SMI::enterprises.9.9.48.1.1.1.6.1 = Gauge32: 392404512
SNMPv2-SMI::enterprises.9.9.48.1.1.1.6.7 = Gauge32: 7276396

Not a lot of data there that would necessarily explain why it would overrun the buffer in question.  This was using IPv6, with a fairly new Cisco Catalyst 3850.  Not the first 3850 we’ve installed, but the first one using IPv6.  None of our other v6 switches (Brocade or Cisco) displayed any odd behavior, so this may very well be a bug in Cisco’s v6 implementation on the 3850 (not the first time we’ve seen this happen; the issues with v6 management addresses on 3750 switches is another story for another time).

So I fire up a very simple Wireshark capture and watch the packets fly while the check runs, and I see (sanitized for my protection, of course):

... (other stuff much like the next couple of lines)
 1.253271 (NAGIOSv6) -> (SWITCHv6) SNMP getBulkRequest
     SNMPv2-SMI::enterprises.9.9.48.1.1.1.5
 1.256986 (SWITCHv6) -> (NAGIOSv6) SNMP get-response
     SNMPv2-SMI::enterprises.9.9.48.1.1.1.5.1
     SNMPv2-SMI::enterprises.9.9.48.1.1.1.5.7
     SNMPv2-SMI::enterprises.9.9.48.1.1.1.6.1
     SNMPv2-SMI::enterprises.9.9.48.1.1.1.6.7
     SNMPv2-SMI::enterprises.9.9.48.1.1.1.7.1
     SNMPv2-SMI::enterprises.9.9.48.1.1.1.7.7
     SNMPv2-SMI::enterprises.9.9.61.1.1.1.1.28
     SNMPv2-SMI::enterprises.9.9.61.1.1.1.1.166
     SNMPv2-SMI::enterprises.9.9.61.1.1.1.1.172
     SNMPv2-SMI::enterprises.9.9.61.1.1.1.1.224
 1.258028 (NAGIOSv6) -> (SWITCHv6) SNMP getBulkRequest
     SNMPv2-SMI::enterprises.9.9.48.1.1.1.6
 1.285458 (SWITCHv6) -> (NAGIOSv6) IPv6 IPv6 fragment (nxt=UDP (0x11) off=0 id=0x19)
 1.285478 (SWITCHv6) -> (NAGIOSv6) SNMP get-response
     SNMPv2-SMI::enterprises.9.9.48.1.1.1.6.1
     SNMPv2-SMI::enterprises.9.9.48.1.1.1.6.7
     SNMPv2-SMI::enterprises.9.9.48.1.1.1.7.1
     SNMPv2-SMI::enterprises.9.9.48.1.1.1.7.7
     SNMPv2-SMI::enterprises.9.9.61.1.1.1.1.28
     SNMPv2-SMI::enterprises.9.9.61.1.1.1.1.166
     SNMPv2-SMI::enterprises.9.9.61.1.1.1.1.172
     SNMPv2-SMI::enterprises.9.9.61.1.1.1.1.224
     SNMPv2-SMI::enterprises.9.9.61.1.1.1.1.324
     SNMPv2-SMI::enterprises.9.9.61.1.1.1.1.376
     SNMPv2-SMI::enterprises.9.9.68.1.2.1.1.2.1
     SNMPv2-SMI::enterprises.9.9.68.1.2.1.1.2.196
     SNMPv2-SMI::enterprises.9.9.68.1.2.1.1.2.1002
     SNMPv2-SMI::enterprises.9.9.68.1.2.1.1.2.1003
     SNMPv2-SMI::enterprises.9.9.68.1.2.1.1.2.1004
     SNMPv2-SMI::enterprises.9.9.68.1.2.1.1.2.1005
     SNMPv2-SMI::enterprises.9.9.68.1.2.1.1.2.3202
     SNMPv2-SMI::enterprises.9.9.68.1.2.1.1.2.3230

And this is where the capture ends and the check reports its error.

The “IPv6 fragment” line is disconcerting and may be involved in the overall error, but what was more striking to me was that in the packet capture, not only was the bulk request returning what I asked for, but it was returning a heck of a lot more.

The “Aha!” moment for me though (at least the thing that clued me in to the idea that this should be fixable) was when I ran the capture on an snmpwalk for the OID in question:

0.000000 (NAGIOSv6) -> (SWITCHv6) SNMP get-next-request
     SNMPv2-SMI::enterprises.9.9.48.1.1.1.6
 0.003726 (SWITCHv6) -> (NAGIOSv6) SNMP get-response
     SNMPv2-SMI::enterprises.9.9.48.1.1.1.6.1
 0.003834 (NAGIOSv6) -> (SWITCHv6) SNMP get-next-request
     SNMPv2-SMI::enterprises.9.9.48.1.1.1.6.1
 0.013041 (SWITCHv6) -> (NAGIOSv6) SNMP get-response
     SNMPv2-SMI::enterprises.9.9.48.1.1.1.6.7
 0.013101 (NAGIOSv6) -> (SWITCHv6) SNMP get-next-request
     SNMPv2-SMI::enterprises.9.9.48.1.1.1.6.7
 0.016590 (SWITCHv6) -> (NAGIOSv6) SNMP get-response
     SNMPv2-SMI::enterprises.9.9.48.1.1.1.7.1

So wait!  An snmpwalk actually “walks” the table starting at the OID requested, and then does “get-next-request”s until it gets an OID that doesn’t have anything to do with what it’s looking for.  “snmpgetbulk”, however, would return something similar to the first capture above.  Possibly more efficient in the sense of “give me a table of OIDs under this value”, but apparently somewhere between Cisco’s 3850 and IPv6, an issue pops up (at least for the OID we query at this point in the script) where it’s possible to get more data back than we want by default.

The next step at this point is to start Googling (say, for “perl get_table snmpbulkrequest”).  That brought me to CPAN’s Net::SNMP page.  Here’s the relevant bit:

This method performs repeated SNMP get-next-request or get-bulk-request (when using SNMPv2c or SNMPv3) queries to gather data from the remote agent on the host associated with the Net::SNMP object.

The -maxrepetitions argument can be used to specify the max-repetitions value that is passed to the get-bulk-requests when using SNMPv2c or SNMPv3. If this argument is not present, a value is calculated based on the maximum message size for the Net::SNMP object. If the value is set to 1 or less, get-next-requests will be used for the queries instead of get-bulk-requests.

And yes, we’re “using SNMPv2c or SNMPv3”, so that tracks. Somewhere along the line there’s an issue with the maximum message size when get-bulk-requests is used, so our solution was to specify

-maxrepetitions => 1

for the get_table function at that point in our script.  Good thing, too; I was not looking forward to figuring out how to mimic the behavior using some other functions in a loop.

Some additional notes regarding SNMP and Perl coding:

  • All this should explain why you should, when going through the result of an SNMP query, use some kind of conditional statement to verify that the OID you’re working on is actually related to the OID you’re attempting to query.  We frequently use some regular expression matching, like:
    foreach my $oid (keys(%{$result})) {
       if($oid =~ /$target_oid/) {
          $key = $';

    which might allow us to match together values from other SNMP queries with the one in question.

  • It also makes me think that you might be able to make use of that “extra” data, rather than having to send another query for information that you might have already pulled from a previous one.  Thing is, it also seems like get-bulk-requests might be unreliable for those purposes … especially if the number of entries you’re looking for when using a “get_table” fluctuates from case to case, or device to device.
  • If you could be assured that the “keys” for a “$result” were in proper OID order, you could shave off a few microseconds of execution time by putting in an “else last;” case in the foreach example above, stopping the loop from going through the $result cases that are of no interest to you at the time.  Thing is, I’m not sure I trust all of our equipment quite that much, I’m headed out the door for a short vacation, and it’s also Friday.  You never change anything you don’t have to change on a Friday.

If you run into something seemingly related, I’d be interested in knowing the particulars.  Feel free to contact me (leaving a comment here, tweeting me, or hopping on the Reddit thread about this issue, etc.) and let me know the particulars of your case.  I’d be interested to see if there’s other hardware or Cisco IOS versions that behave similarly.

:, , ,

Hi! Did you get all the way down here and not find an answer to your question? The two preferred options for contacting me are:
  • Twitter: Just start your Twitter message with @BotFodder and I'll respond to it when I see it.
  • Reply to the post: Register (if you haven't already) on the site, submit your question as a comment to the blog post, and I'll reply as a comment.

Leave a Reply

You must be logged in to post a comment.