Hello,
I had worked pretty extensively with ELK version 6.x a couple of years ago. I tried out version 7.1x yesterday and was blown away by the improvements that have been added to the platform. Truly great stuff.
The one area I am very excited about is the alarming functionality. The GUI in the observability section for alarms looks very promising. This got me thinking about the intent of the ELK stack and its role in network management.
I wanted to share some thoughts and ask for some guidance or even more of a validation that ELK in the way it has been evolving over the years is in line with functionality needed to perform network service assurance.
Network Management
I have had years of experiences with tools like CA eHealth, CA Performance Center, Zabbix, solar winds and etc. The typical use cases of network fault management and performance management are stables of the industry.
Fault:
- Receive SNMP trap, syslog or poll SNMP agent on the end device.
- Evaluate the message against a none list of messages or counter values to decide if this is an alarm
- Evaluate rate and frequency of messages over time as to not deal with signals that are transitory in nature and do need immediate action.
- Typical fault management functions like dampening/smoothing and X&Y are staples
- Open ticket representing the trouble in the incident management platform of choice (Big IT, Jira, slack whatever....)
Performance Management
- Creating reports and triggers that initiate a physical network build out. For instance some core link is running at 80% + average bandwidth. This link would not allow for very much burst traffic. We would need to add some capacity
- Sometimes business customers have stipulations that they need to receive a PDF report showing their link utilizations to as billing is sometimes involved
Observations:
It really looks like ELK could provide both fault and performance management functions in the new incarnation thus making the legacy tools less attractive given their high cost of ownership. Of course ELK needs a robust collection layer using standard network protocols like SNMP, gRPC/telemetry, syslog, etc... for this to work. Of course the vendors have secret sauce in terms of SNMP device discovery profiles/support out of the box so the customer does not have to do this himself and they have some fancy correlation use cases (Example: when a WAN router goes down and the devices behind it are isolated form the rest of world)
With that said, for the simple use cases once the data is in the DB it feels like the platform is now fully featured in this regard. Elastic has always been able to scale so I know we could achieve large data throughput with required resiliency.
The big Questions/ELK Intent:
- Should people use ELK in this way?
- Is it well adapted for use cases where I collect 10k devices every 5 minutes which translates to 500k data points being driven to the DB every 5 min?
- Would I be able to run 50 different simple threshold alarms on my metrics? (Example: simple interface discards/errors % threshold crossing for the last 1 hour means we need to open a ticket)
- Is Elasticsearch really the correct DB choice to be used almost exclusively as a timeseries for operations metrics and faults? Some of my colleagues have told me to look at InfluxDB over ELK for this purpose. Anyone have an opinion?
Conclusion
Have any of you ran ELK at scale for the purpose of network management? What have been your experience?
Thanks for reading and I would be happy to talk about "classic network management" use cases if any one is interested.