In this post, Nina Bargisen explores how peering coordinators can use combined NetFlow and BGP analysis tools to work around different capacity upgrades when money or delivery times pose a challenge.
Peering is great for quality and cost savings, but how do you ensure your network is up to the task?
Capacity planning is a classic task that should be very familiar to network operators. Providing the right capacity at the right time is crucial to finding a balance between cost and quality in your network.
The obvious capacity constraint on an edge router is bandwidth, but handling the ever-increasing amount of prefixes can push CPU and memory to their limits when the BGP process is crunching the data. So in a normal operational cadence, you will be monitoring your CPU and memory usage as well as the link loads on your equipment.
But, in the wake of the global pandemic and now the war in Europe in 2022, delivery times on the equipment have started to climb. So what can be done to manage the capacity with the possibility of delayed upgrades?
The truth is in the traffic
Let’s first look at the situation with external interfaces running full.
When external interfaces run full, and we cannot upgrade the capacity, the only solution is to try to move network traffic elsewhere. It can be acceptable for some types of traffic to just let the interface run full and drop packets, but most of us want to avoid that solution as much as possible.
To do this efficiently, we want to chase a set of prefixes that will move precisely the right amount of traffic. You will also need to know where it is best to move it from a quality and an economic perspective.
So, where do we move the traffic?
We already discussed the importance of keeping track of our connectivity cost when making peering decisions. More so than ever, it is vital to balance the cost with the most optimal route and, to some extent, quality.
A selection of potential other paths to check out is:
- Another direct connection to the ASN
- A peering ASN that is upstream to the ASN
- A transit connection with available capacity within the commit
- The cheapest transit connection if we do not have any room within the commit
We can always debate whether moving traffic to peer or a transit with room within the commit is the best option. Price-wise, these are similar, so choosing the best path is now the more important choice.
In the examples below, you can learn more about how to validate that the traffic will move to the selected path.
What to move and how?
The foundation of this analysis is to know our traffic per prefix and ASN. We will need a good NetFlow tool that can optimally give us enough insight – in both directions.
For inbound traffic:
Select prefixes (or ASNs) with combined destinations that give us the right amount of traffic (make sure to pick a long enough period) – remove these from the announcements of the eBGP sessions on the particular interface.
Now, where will the traffic move to?
If we have other direct connections to the ASN, traffic will move to the best route from the source of the traffic unless the ASN in question is a CDN. In this case, the traffic will most likely move the closest connection to this CDN for the destinations in the network. If the addresses in the prefixes are spread all over the network, it might be challenging to predict how much traffic is moving where. If they are used in more well-defined areas of the network, for example, in a metro, we can better predict where the traffic will move.
If we do not have any other connections to the ASN or ASNs, then we need to understand how the traffic sources we remove will reach the network. If the source ASN has a looking glass, that is the best way to find out. If not (which is unfortunately quite common these days), we can do some qualified guessing by using tools like Kentik Market Intelligence or BGP analysis tools. With this, we can determine which providers the ASN uses and work out the potential paths to the network. Inbound traffic is complex, so be prepared for surprises and keep a good eye on the quality of the traffic after the move.
For outbound traffic:
Select prefixes with destinations that combined give us the right amount of traffic. Filter those on the eBGP session on the particular interface, so we do not accept them from our peers or transits over the interface.
It is much easier to predict where traffic moves for outbound traffic since we are in control, or at least have full knowledge, of the routing policies out of the network. If we do not have access to an internal looking glass, we can check what BGP routes to the filtered prefixes exist on the other edge routers. Recall that the BGP tables will contain all the BGP paths known to the router and show alternatives if the best is running via the interface we are trying to offload. We may need to tweak the BGP policies to ensure the best route after filtering.
To see how Kentik can help with your traffic engineering, check out the Traffic Engineering workflow.
Overloaded CPU and memory due to BGP processes
In this case, we will only consider the outbound traffic. We will look for ASNs announcing large amounts of prefixes but where we are sending only a little to them. If we have a traffic volume issue simultaneously, this may also solve that issue.
The goal is to reduce the number of prefixes the router needs to process. If we do not have volume issues, we want to preserve as much outbound traffic as possible on the interfaces.
Examples of the kind of ASNs we can chase:
- ASNs with a large number of prefixes with little to no traffic
- Provider networks and their customers where you peer with both parties.
- Provider networks, where only one or a few downstreams are significant destinations
ASNs with minimal traffic and a large number of prefixes
The NetFlow analysis lets us know which networks are candidates to be filtered, but getting the number of received prefixes can sometimes take time and effort. This is straightforward if we have a peering tool available with received routes per peer or have CLI access and can just look it up. But if we don’t have either, we will need help from the same tools we used when estimating how the inbound traffic would move around. In that case, you can look up both transited and originated routes per ASN, and the sum is a reasonable estimate of what each ASN is announcing to you.
Note that if you are peering on a route-server at an internet exchange, this estimate is the best guess unless you want to parse through all the received routes and count prefixes per ASN with a script.
Peering with provider networks and their customers
How do we identify those?
Breaking our NetFlow traffic into AS paths will not cut it since our router has already picked the best route to install into the FIB. We will not send any traffic on the usable paths, so this approach is not viable.
The most precise way, however, would be to analyze the AS paths from all the received routes on your router. Some good scripting will give us what we need here.
Another way to get started is to identify the ASNs with more significant amounts of prefixes. This reduces your candidate list enough so that a manual check of whether we have a relationship with one of their providers using the BGP analysis tools will suffice.
Once you know which provider/customer pairs we peer with, you can evaluate which to keep and which to either filter or de-peer to offload the router. Maybe it’s best to keep a provider who aggregates several other ASNs. Maybe it is best to keep the customers and de-peer the provider. Or maybe we can filter most of the prefixes and only use a provider to reach a small subset of their customers.
While evaluating, we want to make sure to consider the back-up path for the ASNs we want to reach and the alternative paths to those you want to filter, just like we did when moving traffic to offload traffic.
Monitoring the quality of the traffic
From this point onwards, the main concern is maintaining high-quality traffic. When we do not directly connect to an ASN that we have traffic to or from, monitoring the quality becomes critical.
Monitoring for latency, jitter, and packet loss is a good indication of traffic quality. We can set up continuous testing from agents deployed at the peering locations in the network and create alerts when degradation happens.
We can also dig into the traceroutes when alerted and see if traffic uses the paths we were planning.
We have covered some suggestions on what can be done when necessary upgrades are impossible. Still, there are several more things to consider when going down this path, like the [risk of blackholing](https://www.kentik.com/blog/why-you-need-to-monitor-bgp/ “Learn more about blackholing in “Why You Need to Monitor BGP"") and the relationship to our peers.
Join us soon as we dive into these issues and more in the next post in our ongoing peering series.
Don’t want to wait? Sign up for a Kentik demo today.