Elasticsearch Cleint Nodes OOM Killed by Gargantuan Query

Hi all!

I think we may have discovered a bit of an edge case or bug with elasticsearch and I'm just looking to confirm this known issue, or potentially document a new bug. One of my analysts has constructed a query consisting of ~3 million terms in a terms query within the must clause of a bool query. After submitting this gargantuan query, my client nodes almost immediately get OOM killed by their host system.

I am fully aware that this is a... interesting method of attempting to retrieve data. I have worked with this person to get a working query, but the interesting part to me is the fact that the only sign of a problem (aside from all my ES client node service being dead) is the OOM killer message in the syslog. I would hope that a message like "jesus dude what is this query, it killed me" would appear in my node logs, or some other sort of representative message. In fact, I have no logs at all related to this. I was only able to piece the "root cause" together with the OOM killer timestamps, and requests to an alternate data store.

I am running Elasticsearch 6.5.1, and the user was directly connecting to ES. Maybe this has been fixed in a newer version?

Thanks for your time!

There is no way the node could log anything if shut down by the OOM killer. By the time you get into this state it's too late, the OS takes over and stops the process without any opportunity to clean it up.

However, if the OOM killer got to you before the JVM reported an OutOfMemoryError then I suspect your heap size is set too high. It absolutely must be set to less than 50% of your total memory since the JVM can use around double the configured maximum heap size, but even 50% of your total memory allows no space for the OS and other processes on the same system.

Also, there have been changes in this area in Elasticsearch 7 to make it much better at pushing back on unreasonable searches.

1 Like

Thanks for the information here. Upgrading to ES7 is definitely on our horizon, i will dig into the change logs. We had 8GB of 16GB configured for our heap, so maybe our problem revolves around that configuration.

Thanks for taking the time to reply.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.