Can a heavy query cause index corruption?

alienmind · February 15, 2016, 2:53pm

Hi

I've been running some aggresive tests (on purposely reduced hardware specs) and faced the following issue. My cluster is:

ES 2.2.0 on CentOS 6.7
4x nodes with 4GB RAM (2GB for the JVM), 8 cores.
Extended the bulk thread queue size from 50 to 1000
20x clients running on the REST api, sending bulk indexing requests with 1000 records each

Then, I've run a few heavy aggregates. The idea is to force ES to retrieve the data from disk (avoiding FS cache whatsoever) to measure some sort of "worst case scenario"

One of the nodes crashed due to OutOfMemory exception during one of the aggregates.** This caused corruption on 100% shards of that node**! Now the cluster is relocating shards but it's painfully slow.

I find hard to believe that a query - as heavy as it is - can cause full corruption of a node. This would be a no-go for production in the project I'm working at.

Is there a way to ensure that the queries memory does not take down the full node? I've been wondering if lowering the three available circuit breakers so they don't sum more than 100% will help (https://www.elastic.co/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html) :

But I find this really annoying - lowering these limits "just in case" doesn't seem reasonable.

Any other ideas / tips?

The full trace can be found here: http://pastie.org/pastes/10722735/text?key=8jrwowaumantpoqjvctq

warkolm · February 15, 2016, 3:50pm

2GB of heap isn't very much at all

alienmind · February 15, 2016, 4:11pm

Yes, it is not, on purpose. But it gives an idea of what will happen on production when I will have 16GB of heaps and the query is 8x more complex (I've queried 1 month of data and I expect to be able to do the same over 1 year)

mikemccand · February 15, 2016, 4:15pm

OutOfMemoryError should never result in corruption.

Can you share the full stack traces of both the original OOME you hit and the resulting corruption exceptions?

alienmind · February 15, 2016, 4:29pm

The stack trace is on post #1 (pastie)

mikemccand · February 15, 2016, 9:44pm

The stack trace is on post #1 (pastie)

Woops, sorry, I missed that.

OK I looked at the exceptions, and this is not actually index corruption.

Rather, ES has decided that this shard is in an unknown state ("failed engine"), no longer in sync with its peers, and therefore must hard-close it and recopy the shard to sync up again, which is what you see happening.

I think you do need to use the circuit breakers to guard against this.

Topic		Replies	Views
Elasticsearch 2.4.0 crashing during heavy bulk index loads Elasticsearch	18	4440	July 5, 2017
Corruption when indexing large number of documents (4 billion+) Elasticsearch	6	909	July 6, 2017
Index corruption when upload large number of documents (4billion+) Elasticsearch	5	1038	July 6, 2017
CorruptIndexException after node restart Elasticsearch	5	1033	September 26, 2017
Using the Bulk Indexing API, if my node crashes, my elasticsearch heap memory does not get freed Elasticsearch	6	800	July 6, 2017

Can a heavy query cause index corruption?

Related topics