[SOLVED] Whole cluster down after sorting on _id

Hi,

I just had a massive issue with my ES prod cluster. I use an ES plugin for IntelliJ IDE that allows to see search result as a table, which can be ordered by clicking on the columns: it then add a sort parameter corresponding to the column.
Issue is, I missed clicked on the _id column (right next to the column I wanted to sort on), and sorting on it was not deactivated in the plugin.
I do know that sorting on _id is not recommanded, but I expected ES to just return me an error.
Instead of that, the whole cluster went down (the search was with a wildcard that matches indexes that are on my 6 nodes), progressively, node by node (only one survived).

I tried the same on my dev ES (just one node, in docker): it crashed too, but after restarting and doing the same test, I had a CircuitBreakingException, which is fine: java.util.concurrent.ExecutionException: CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [5313833239/4.9gb], which is larger than the limit of [5085934387/4.7gb]]

Issue is, why my ES prod cluster went down for that? Shouldn't that have been caught and the same kind of exception returned? Is there a possible config issue with that?

I'm using ES 6.7. Here is the last log I had before one node crashed: [2021-02-16T14:18:07,063][INFO ][o.e.m.j.JvmGcMonitorService] [lwg-es-1] [gc][89 - Pastebin.com
Thanks for any help.

If you upgrade to 7.6+ you can disable loading fielddata on the _id field:

With that change, it would indeed return you an error instead of taking the whole cluster down.

1 Like

Hi,

Thanks for your answer, I'll do that.
But I thought that such heavy requests would get killed if the cluster or a node was about to go down...
Preventing the issue on the _id field is a fine workaround for my specific case, but knowing that the whole cluster can go down with a request as "basic" as this is kind of scary.
Isn't there anything we can do in the config to prevent such massive failure to happen?

Thanks.

Watertight protection against this sort of thing is basically impossible, but note that 6.7 was released almost 2 years ago and is already well past EOL. In the meantime we've added many more layers of protection against such harmful requests, such as the option I linked, so you will have a much better experience if you upgrade.

1 Like

Ok, I will do that soon then, thanks for your help!