Hi All,
I'm attempting to debug a somewhat strange issue. Where I have a query which runs every 60 seconds to check a set of logs and if there are no logs for 90 seconds then trigger an alert (this is done via a Kibana rule).
These logs get generated (and in theory indexed) on a regular interval, every ~20 seconds, so the only time this alert should fire is when logs are not getting generated for some reason. However, I have had on a number of occasions now, false positive alerts where the rule thinks there are no logs, but if I check the logs they exist for the alerting window (I am looking at both the @timestamp
and event.ingested
times.
I have a hypothesis that these false positives happen because the Elasticsearch node which holds the index becomes overload (CPU maxed out) for a period of time and thus causes a shard refresh to take longer than expected, thus causing the query to not see the logs even though they technically exists. What I haven't found is a way to prove/disprove this hypothesis, as by the time I actually get to look at the logs in question they exist. Does anyone know if there is a way to have Elasticsearch log when a shard refresh takes longer than a specific duration? (Or have another idea for trying to debug this issue?)
For reference I'm using an Elastic Stack on 8.5.3