Is there a slow log (or something similar) for shard refresh durations?

Hi All,

I'm attempting to debug a somewhat strange issue. Where I have a query which runs every 60 seconds to check a set of logs and if there are no logs for 90 seconds then trigger an alert (this is done via a Kibana rule).

These logs get generated (and in theory indexed) on a regular interval, every ~20 seconds, so the only time this alert should fire is when logs are not getting generated for some reason. However, I have had on a number of occasions now, false positive alerts where the rule thinks there are no logs, but if I check the logs they exist for the alerting window (I am looking at both the @timestamp and event.ingested times.

I have a hypothesis that these false positives happen because the Elasticsearch node which holds the index becomes overload (CPU maxed out) for a period of time and thus causes a shard refresh to take longer than expected, thus causing the query to not see the logs even though they technically exists. What I haven't found is a way to prove/disprove this hypothesis, as by the time I actually get to look at the logs in question they exist. Does anyone know if there is a way to have Elasticsearch log when a shard refresh takes longer than a specific duration? (Or have another idea for trying to debug this issue?)

For reference I'm using an Elastic Stack on 8.5.3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.