Creating a status-aware system using Elasticsearch

Hi
I am new to ELK so be nice :slight_smile:
I am trying to use ELK to keep track of the alarm status of our network. Our current system is so old and user-unfriendly that we are looking into some Open-source tools instead.
So far I have tried this: Alarm traps (SNMP) are aggregated by a SIEM system which writes the traps to a log file, and the logs are sent to Elasticsearch using Logstash, with each field nicely separated, and I have used a time-based index. So far, so good. The problem I have now is that I don't easily see if an alarm has been cleared (a clear trap has been received).
Each alarm trap has an alarm ID field, and a field which indicates if it is a "Trigger", "Update" or "Clear" trap, and an alarm can be active for minutes, days or even weeks. What I would like is to somehow be able to query for alarms where there is no "Clear" trap received yet, and that should give me all active alarms.

What would be the best way to accomplish this?
Is there a way to write a query for this?
Should I use other indices (one index for each alarm ID, but that would create a LOT of indices) or should I only create a new document for a "Trigger" trap and update documents based on their status when we receive an "Update" or "Clear" trap (I am afraid such a query can be heavy on the system if I have to search through all indices from day one)?

Thanks in advance for any hints or tips you may provide.

How many unique alarmIDs do you have? Thousands is probably manageable in one query but millions may require multiple requests.
Essentially you want to use a terms aggregation on the alarmID field as the root-level grouping and then under that have a top_hits aggregation with size=1 and sort order of date, descending. That should give you the last observed status for every alarm ID. A pipeline aggregation could then strip results of non-clear alarmIds.

This is a rudimentary form of behavioural analysis ("lastKnownStatus = ?") but for more advanced analysis of behaviours over time it may be worth considering an entity centric index for alarms to record last status and other attributes of interest eg rates of change, time to clear etc

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.