My cluster stopped ingesting due to low disk space but beats did not seem to recover all missed events

Newbie here. I had an event the other day due to not minding my index lifecycle policies. The cluster went yellow as there were unallocated shards (seemed like shards had moved completely off the first node to hit the disk space high mark). Indexing/ingestion was stopped completely. I restarted the entire cluster, deleted some old time-based indices, and fixed my ILM policies. After everything re-balanced and the cluster was green, I noticed that it seems like not all events from winlogbeat were recovered. I spot checked a local log, and the beat had paused due to the low disk space. The logs were spammed with failed to publish events: temporary bulk send failure. After I cleared the issue, the beats started pushing events again, and the index rate made it seem like they had recovered missed events. However, when I graph the events created vs timestamp per hour, it seems like many events were still missed during this period. The only events entering my cluster at this point are from beats.

Here is event.created vs @timestamp:

Here is my indexing rate over time:

And here is a closeup showing the spike, then lowering of the index rate back down to normal for us:

I was under the impression that the beats (winlogbeat and auditbeat on windows at this point only) would be able to recover from where they left off. Is there anything I should know about how things should behave in a situation such as this?


Bump. Any ideas?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.