Hi all! I am interning at an organisation and have been asked to research automation in the ELK stack - I don't really know what I'm doing so any help would be appreciated!
One of the increasingly large challenges my organisation faces with automation is assessing whether it is working given the scale we are now operating at.
Our users used to be on top of our processes and as we automated steps within these they were there to check on the outcome and fix problems. This model has moved on now and users don’t look at individual processes and instead just sit across the top and are therefore not able to spot things as easily. If things error in a software sense this is caught by normal monitoring and dealt with by a support team. If however something we rely on in the real world has changed we may not notice.
To combat this we have started feeding system MI into an ELK stack and created some dashboards to analyse this data. E.g. dashboard showing how many automation events of a certain type happen per day. If this drops off a cliff we can then dig in to find the reason.
My question is, could we automate the analysis of the data so it’s not someone looking at a graph each day? Are people doing this (monitoring automation at scale) in other ways, and what are these?
Thanks so much for any help or pointers you might be able to offer!