Every organization has an automation goal, and it’s no doubt that network automation is not only essential to avoid costly outages, but also helps organizations scale without putting people in the work path. This is how Google is able to manage millions of servers running billions of containers each day, and how cloud-native companies have constructed their applications on top of new infrastructure underpinned by Kubernetes.
The problem is that every organization has a storied history of automation tools, meaning we already have at least a dozen of them in our organizations across various silos and stacks, some of which are commercial and some are open source.
Within the network domain, these older tools are often the NCCM style tools that automate repetitive tasks (examples include ManageEngine, Micro Focus, Solarwinds, and other vendor-specific tools such as those from Cisco, Arista, and Juniper). There have also been some new entrants, and some great open-source thanks to contributions by DigitalOcean in the form of Netbox. Aside from these NCCM tools many organizations are also adopting network orchestration tools that promote infrastructure as code, and DevOps cultures and methodologies.
In Gartner’s recent Market Guide for Network Automation, 2018 (if you’re a Gartner subscriber you can get it here), a survey of 205 network professionals shows adoption of Linux tools (such as Chef/Puppet/Ansible) for network automation as the most common approach (at about a third of respondents).
These DevOps tools can typically manage multiple types of infrastructure. Most commonly seen on the network side is Ansible, but these implementations are often augmented by Napalm or generic Netconf/YANG in Python. Ansible is essentially custom code or scripting (known as playbooks), but they can be purpose-built and integrated with other libraries such as netmiko or nornir for those wanting to avoid making a larger time investment to learn Ansible. Clearly Python is the winner across the board here.
You may ask a few typical questions when looking at these investments of time and or money:
This is a big change, as network engineers are going to have to make a significant skill jump from people thinking in terms of packets, routing, devices, and terminals to checking in code.
But the advantage is—once you adopt these practices—you can use various techniques to implement continuous validation contained in a release or testing pipeline. This means making sure configurations are accurate, secure, compliant with other policies, or generally of higher quality.
The result of that is fewer network outages due to poor syntax or basic semantics causing misconfigurations, which are the most common cause of outages. Open source tools like Batfish can be used for both basic and even more advanced validation, with other alternatives as well.
The right question to answer is: “Why aren’t your infrastructure engineers learning some coding skills?”
In today’s environment, this is no longer an option, and people who do not have development skills are are not future-proofing themselves, since they cannot scale the organization effectively over time. If the network team doesn’t have these skills, a good exercise is to offer training as a professional development activity. They will thank you, and the team will scale better to meet demands.
The challenge with DevOps tools is oftentimes these are great solutions for those with a greenfield network within a cloud or data center environment. Most organizations, however, have existing technical debt in the form of mixed legacy and modern equipment.
The net result is islands of automation as highlighted by this EMA research poll:
There are too many automation tools, each of which is used for specific gear, environments, or use cases. If you believe you don’t need another automation tool, the real answer is that you probably do, if you want to enable advanced use cases around CI and CD.
Ultimately we at Kentik believe that automation is a key area for the integration of telemetry to drive several use cases. The automated network management capabilities promised by large vendors, again and again, are focused on the small problems. Examples include how to automate the fixing of my WiFi issues (reboot/resetting ports), or how can I push an ACL out.
While these are valid use cases, they’re issues that our legacy NCCM tools can already solve. These are not the problems which network operators spend time on—we’ve already automated them for the most part. The concept and goal of a fully closed-loop system sounds wonderful, but in today’s heterogeneous networks it’s not yet feasible, especially as we augment our infrastructures with the public cloud. We can drive more efficient operations by providing network professionals with better information, more easily accessible, from within their existing workflows or goals.
No matter how far along you are in your transformation or optimization, there will always be other organizations who are more advanced, and plenty who are far behind. Everyone is grappling with multiple stacks and complexity behind them. You are not alone, nor are you behind everyone. The industry is evolving, and as more operators have more provisioning, operations, debugging, and remediation workflows automated and orchetrated, we’ll move as an industry closer to being to enable the promise of closed-loop automation.
In terms of real-world action your peers are taking today, many of their workflows are routed through the tools they use to interact within the networking group and across their enterprise today.
For real-time collaboration, they are often using Slack, Microsoft Teams, or other common chat platforms to augment email and other messaging platforms. Every organization has adopted something at this point to increase collaboration and teamwork. Many have extended these platforms, and the more advanced products in this space have many bots and integrations.
One such integration is into tooling.
We have hooked Kentik up to these systems to be able to answer questions about the network devices, paths, networks, and even to pull information about telemetry and alarms. This is extremely useful when coupled with integrations to automation systems like Ansible and Netbox. Kentik has been working with a consulting partner, Network to Code, who has built this exact type of system.
As this integration matures, we will be releasing more details as to how you can get this in your environment or get help to have it implemented for your specific uses. You can get a preview at Ansiblefest. Please reach out to schedule a meeting with me, as I will be attending.
For organizations who haven’t made the full leap into ChatOps, or require work integrated across groups, we see automation-related workflows integrating with ticketing systems, notification systems, and other such tools to be kept alerted of things beginning to go wrong, or—in the worst case—once there is a major problem.
This type of automation is essential for any network operator and these, in turn, can drive or at least trigger automation, even if partly human-driven once initiated. This is precisely why ServiceNow has these capabilities in a single platform, and how that can be very beneficial to teams. Kentik integrates with ServiceNow for this reason as a standard notification channel and, for many of our customers, ServiceNow is a critical hub of response to technical issues and their remediation.
Advanced users of Kentik integrate with our flexible APIs to access any data within Kentik to drive their own custom solutions. This could be verifying connectivity, traffic patterns, performance, and usage.
As an organization, we are driving towards making these types of integrations easier and enabling best practices: Using expertise from customers and team members who run the most demanding and complex networks, augmented by machine learning to provide automated analysis. As these methods advance, we will be creating a new closed-loop system to make teams more efficient.
Customers today drive automation from Kentik to make operational decisions such as changing routing, deflecting threats, or scrubbing DDoS traffic. These advanced use cases are only possible with our real-time view of network traffic and anomalies within that traffic.