_DataId Health Monitoring

Looking to build a logic which alerts whenever there is a log stoppage detected on Elastic through _dataId. I appreciate any suggestions here.

Hello @saratsekhar

Welcome to the Community!!

Can you please share more details about the data / alert needed?

If you have datastream/index pattern we can create a dataview , we can create a rule & count the number of records with _dataId , if count < 1 in last 15 minutes than an alert should be triggered that no data in Elastic.

Thanks!!

Thank you,

I need to create an alert that detects complete log stoppage from critical servers. Rather than monitoring specific hosts, I want to track by dataId.

I attempted to use an index-based threshold rule with the condition set to trigger when document count falls below 1 within a specified time interval. However, this approach has two problems:

  1. High alert volume - The rule generates excessive alerts

  2. Lack of specificity - When an alert fires, I cannot identify which specific _dataId caused the stoppage

How can I configure this to properly track log stoppage per dataId and include that information in the alert?

Hello @saratsekhar

So if i understand it correctly _dataId is your server name ? If yes, what is the unique count of critical servers for which data is received & monitor is in place?

Thanks!!

Hello,

dataId is unique across log type like windows, linux, firewall, database, procxy etc. So need an alert whenever these dataIDs stop sending any log.

1 Like

@Tortoise Index based threshold rule is recommended here?

Hello @saratsekhar

Index based threshold can be used but it will not serve your need...If i understand you usecase...

Index => ABC
In this index we continuously receive data
there is one field _dataId > this will have various values example > windows/mac/os/android

Now the rule should run and tell that in last 15 minutes we have not received data for windows/mac if the count of these records are 0 ?

actually if the source is fixed (windows/mac/os/android) in that case we will have to go for Watcher as shared here :

because in rule it will alert that last 15 minutes there is no data but for which source there is no data that output will not be possible.

Thanks!!

Thanks for providing the information, last statement is applicable for Threshold rule or a Watcher?

because in rule it will alert that last 15 minutes there is no data but for which source there is no data that output will not be possible.

Hello @saratsekhar
Below statement was for Rule :

because in rule it will alert that last 15 minutes there is no data but for which source there is no data that output will not be possible

As for Watcher example if you see there is fixed source which is added from where you expected the data to be received & if any of the source does not have data than it will say which source has no data.

Thanks!!

thanks @Tortoise , could you please advise on the steps to create this in Watcher?

Hello @saratsekhar

Please review the watcher shared in below link with similar usecase but has different values :

Thanks!!

1 Like

Alerting on missing data can be a little tricky sometimes, what worked for me was creating a custom ESQL rule to calculate the lag between the current time (the rule execution time) and the time of the last event.

FROM index
| STATS last_timestamp = MAX(event.ingested) by event.dataset
| EVAL lag = DATE_DIFF("minute", last_timestamp, NOW())
| WHERE lag >= 15
| LIMIT 100

In this case this would calculate the lag between the tima of the rule exation and the time of the last indexed event, in t his case I'm using the event.ingested field, it would also group by a particular field, in this case event.dataset, and it would return all event.dataset where the lag is equal o higher than 15 minutes.

I need to run this rule with a look back window at least wice the lag time, in this case 30 minutes, so I will get at least one alert.

You may try to adapt this query to your data to see if it works.

3 Likes

Thanks for sharing the information :), will give a try.