Excessively high bulk work_queue counts right before OOM

So going over the Prometheus data that my ELK cluster spewed out right before all of my coordinators died by running out of heap space, I saw that the there was a huge excursion in the number of tasks in the bulk work_queue. I haven't fiddled with any of the settings for this queue, so the queue size limit is still 200, but according to Prometheus, after bumping up against the limit for quite some time, the queue size then jumped to more than 1,300 right before the nodes crashed.

Obviously the cluster was overwhelmed with incoming logs, and we're working to address that by providing more resources, but my question is: Why did this result in an OOM crash instead of the grace rejection messages with "429 too many requests?" Is there a "queue before the queue" that we may have overwhelmed with retries as this comment to an issue seems to suggest (in the past we have noticed that high load events like this sometimes results in duplicate log entries)?

Thanks for any insight you can provide!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.