Eric Stewart: Running Off At The Mouth

Proving It’s Not The Network: Catchpoint and Kentik

by Eric Stewart on Oct.05, 2022, under Networking, Technology

It’s often been said that “50% of a network engineer’s job is proving it’s not the network”; I personally follow that up with “25% of your job is then determining what it is and explaining how to fix it to people who should know better.”  Catchpoint and Kentik presented to Network Field Day their solutions for making that first 50% easier to do (or, in the case where it might just be the network, finding out what’s going on and then fixing it).  Catchpoint’s presentation was an hour and a half; Kentik presented for two hours.

Network Observability – What Is It?

One NFD29 attendee called it “the new word for monitoring.”  Tony Ferrelli of Catchpoint suggested that it was “measuring lots of things … measure more things from different perspectives.” Justin Ryburn of Kentik said “Data is the life blood of observability,” and a later portion of the Kentik presentation framed observability as knowing why something is happening as opposed to just knowing what is happening. For me, it seems a lot like “mass monitoring with a little bit of AI/ML thrown in.” I will admit that traditional monitoring is typically a “ping/query SNMP” thing, but observability seems to incorporate things like NetFlow information as well as more involved testing.

Catchpoint: An end-to-end focus

Catchpoint’s solution includes user based tools to provide an end to end, multi-layered view of connectivity.  Catchpoint has over 1024 public nodes across the Internet to provide information about reachability from site to site – you can’t necessarily run iperf (at least, not on the public nodes), but there are a myriad of other tests you can run in an effort to determine where an issue might actually be.  Catchpoint also collects BGP route table data from multiple locations to provide information about pathing, including changes in pathing over time.  The intention is to use multiple perspectives to examine a problem to figure out what might actually be going wrong, even when it’s outside of your network.  Public nodes allow for visibility from across the Internet to things that you usually can’t see from within your own network, at least not to the details that you can with Catchpoint.  Catchpoint provides tools that can be used by multiple IT teams to determine what might be the source of an issue – not just the networking guys.  Private nodes can be used in order to provide tests that can be used for performance statistics.  Catchpoint tests hit all layers of the process – no just the end to end test, but the DNS requests, API specific tests, website tests – including determining what portion of a page took the longest to render.  While web tests are a major focus of Catchpoint’s tools, you can script tests of other protocols and Catchpoint will use the response data as a metric of the test.

Catchpoint isn’t a replacement for your regular monitoring; they don’t necessarily monitor your internal routing protocol (such as OSPF) or your equipment.  Catchpoint’s focus is essentially the user experience, and focuses on the data network engineers don’t usually have – their edge to the other end, or perhaps remote user to your edge.  They do provide user experience diagnostic tools that will look at things from end to end, and in that sense can be used to determine that the problem might just be an internal one.

Catchpoint would be happy to accept your BGP table for routing diagnosis if you’d like to share it, and was collecting not only BGP data from public looking glass routers, but they are also getting BGP information from customers and others.

Kentik: Focus on end-to-edge

Kentik’s method of operation is to collect data from multiple sources and be able to use it intelligently and gain “actionable insights.”  While they do have BGP data and through testing can potentially indicate where issues are external to your network, a majority of Kentik’s solution focuses on information that can be gathered from your equipment.  They do have “synthetic tests” (transaction testing, page load testing, etc), and can gather information from them, but they heavily utilize “passive telemetry” – data that’s already there (think things like interface statistics, flow information, etc).  This kind of information is what a network engineer would rely on to diagnose local problems effectively, and in that sense, may be more immediately useful for fixing your own stuff.  Kentik’s solution provides the ability to get an overview as well as drill down to specifics, all the way down to a specific interface.

Kentik Synthetics includes containers that you can deploy (potentially, onto switches) to get statistics from a point closer to the user.

Their BGP data is mostly what is publicly available or what is available from customers (especially those they have BGP peerings with).

Conclusion

When Catchpoint’s presentation was immediately followed by Kentik, I’ll admit that there was some … confusion? Catchpoint presented an “end-t0-end” system that provided visibility into areas outside of the customer network. This would seem to make it a lot easier to prove “it’s not our network,” especially since you might be able to collect data that could clearly indicate that the issue was someone else’s network or service. One of the Kentik representatives clearly uttered the phrase “we focus on end-to-edge.” It took watching the videos as a refresher to spot the misunderstanding I walked away with: Catchpoint, for the most part, was more of an “edge-to-end” solution, focused on those external data sources rather than internal ones. As a clearer example, if your campus network has an uplink that’s dropping packets or has a high error rate, Kentik would seem to be the solution that’s more likely to spot that issue than Catchpoint.  But if you’re experiencing issues reaching an external/cloud based service, Catchpoint is likely to be able to tell you things like “the BGP path between you and that site went through a strange shift recently” or “the page you’re attempting to reach references a third party server that is not responding.”  In that sense, these two solutions seem to be actually somewhat complementary, but there is definitely some crossover, and one would expect that in the long run the companies will be (if they aren’t already) competitors.

Catchpoint’s solution seemed to be useful enough for the tools to be passed to other IT groups (such as Desktop Support).  Kentik’s solution contained useful visualizations that could be used to explain issues to, say, leadership, but would definitely be something that a network engineer would need to provide context – you wouldn’t necessarily want leadership looking at it and making their own conclusions.

Both solutions provide data regarding web page loading and what might be causing delays in page loading, and both included BGP route information, but it appeared as if Catchpoint had more complete BGP information, and could provide data as to what paths to you might look like from around the world – though the question was asked “Can you provide information as to whether BGP is routing symmetrically?” and Catchpoint did answer, basically, “Not really.”  Catchpoint had a tool that could be installed on client computers to gather analytical data about web performance from the client’s view (as well as their data collector’s point of view); Kentik didn’t mention specifically, but it would appear as if their tests were limited specifically to originate from their data collector.

Both solutions also seem to have their main interface for visualizing what their collecting in the cloud, but had data collection/test hosts that you would install on prem to collect the local data/run local tests, then send it on to their databases.

Finally – realize this is weeks after the presentations and I’ve been using the videos available on the event site to refresh my memory.  Obviously, I’ve had to draw some conclusions since there were potential questions that, had I thought about it then, I would have asked … and had I been a little better prepared, might have put together to ask both Catchpoint and Kentik. Which one do I think I’ll be looking into in the future for $JOB? That’s a good question.  Prior to researching this post, I would have reflexively said Catchpoint, and it’s not like I’m excluding it yet. However, there are things we are doing locally that could heavily use some modernization, and Kentik looks more like a solution for that problem.

Questions for the future

This section is likely going to be edited as things occur to me, but at the moment:

Kentik specific:

  • Can I implement specific equipment polling for things like UPS runtime or environmental values, optical light levels, or other SNMP values in order to, say, replace my current graphing solution?
  • Is there a way to script polled data to obtain it via SSH? (some platforms have issues collecting certain data that, while available via SNMP, seems to hit polling limits or drive up the management CPU enough to trigger alerts)
  • How much data (even if it’s subject to summation over time) is retained? (some graphing systems would cover a year, but keeping up to five years worth of data I could see on a graph might be useful)
:, , ,

Hi! Did you get all the way down here and not find an answer to your question? The two preferred options for contacting me are:
  • Twitter: Just start your Twitter message with @BotFodder and I'll respond to it when I see it.
  • Reply to the post: Register (if you haven't already) on the site, submit your question as a comment to the blog post, and I'll reply as a comment.

Leave a Reply

You must be logged in to post a comment.