OOM on aggregation and lot of time out exceptions

Hey All,

I am running three node cluster setup with ES 2.2.0. Each is 32 core 64GB instance and ES has around 17GB of RAM allocated. I have around 350million records. All my three nodes are performing very badly with all possible kind of exception.

Few of them are below

  1. ProcessClusterEventTimeoutException[failed to process cluster event (put-mapping [myindex-2016-05-01T12]) within 30s]

    1. Caused by: QueryPhaseExecutionException[Query Failed [Failed to execute main query]]; nested: OutOfMemoryError[Java heap space];
      at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:409)
      at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:113)
      at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:364)
      at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:376)
      at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
      at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
      at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:350)
      at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)
      Caused by: java.lang.OutOfMemoryError: Java heap space
      at org.apache.lucene.util.automaton.RunAutomaton.(RunAutomaton.java:144)
      at org.apache.lucene.util.automaton.ByteRunAutomaton.(ByteRunAutomaton.java:32)
      at org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247)
      at org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:133)
      at org.apache.lucene.search.FuzzyTermsEnum.initAutomata(FuzzyTermsEnum.java:175)
      at org.apache.lucene.search.FuzzyTermsEnum.getAutomatonEnum(FuzzyTermsEnum.java:151)
      at org.apache.lucene.search.FuzzyTermsEnum.maxEditDistanceChanged(FuzzyTermsEnum.java:210)
      at org.apache.lucene.search.FuzzyTermsEnum.bottomChanged(FuzzyTermsEnum.java:204)
      at org.apache.lucene.search.FuzzyTermsEnum.(FuzzyTermsEnum.java:142)
      at org.apache.lucene.search.FuzzyQuery.getTermsEnum(FuzzyQuery.java:155)
      at org.apache.lucene.search.MultiTermQuery.getTermsEnum(MultiTermQuery.java:318)

    2. 01-May-2016 12:25:41,293 INFO [transport] (elasticsearch[Dan Ketch][generic][T#2238]) [Dan Ketch] failed to get local cluster state for {#transport#-2}{}{}, disconnecting...: ReceiveTimeoutTransportException[[][][cluster:monitor/state] request_id [40958] timed out after [15000ms]]
      at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:645) [elasticsearch-2.2.0.jar:2.2.0]
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_72]
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_72]
      at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_72]

4. java.util.concurrent.TimeoutException: Failed to acknowledge mapping update within [30s]

5. [2016-05-01 12:29:01,806][WARN ][transport ] [Node2] Received response for a request that has timed out, sent [17789ms] ago, timed out [2788ms] ago, action [cluster:monitor/nodes/stats[n]], node [{Node0}{HQbDpWZ7RcGIOEoKslkR2Q}{}{}{master=true}], id [369038]

And I have below setting in elasticsearch,

---------------------------------- Cache Size --------------------------------

indices.fielddata.cache.size: 70%
indices.breaker.fielddata.limit: 75%

---------------------------------- Thread pool --------------------------------

threadpool.index.queue_size: 2000
threadpool.search.queue_size: 2000
threadpool.bulk.queue_size: 2000
bootstrap.mlockall: true
indices.store.throttle.max_bytes_per_sec: 100mb

I am aggregating some index for last 24hours records every 10min since new data will be added every min.
Can someone please suggest what's going wrong here. I see OOm bcz of aggregation but how can I avoid this?


You are giving 70% of your heap to fielddata, this is not good.

Neither are these, given that threadpools are held in the JVM memory.

So with the fielddata and over inflated queues, it's no wonder you are OOMing.

I'd suggest you need to look into doc values ASAP. In the short term you either need to reduce your query size or increase the resources available to ES, by either adding more nodes to the cluster or by adding more RAM to them.

What should be the idle fielddata.cache.size? I added this to avoid OOM. There is lot of aggregation happening in my case. How should I give the configuration.

Hitting the filter cache limit won't cause an OOM, it'll actually stop that from happening thanks to the circuit breakers.

Sure, will take a look into circuit breakers. but I'm not understanding why each node is getting timeout. All my nodes are going crazy. A single cat request is taking like a min to respond.

Your cluster is overloaded.
How much data is that 350M docs? How many indices and shards?

Below is the index detail. I do aggregation only on index6, index9 and index10. For aggregation, I query only for last 24 hours.

So the total data which I have now is for a week.

green open index1                       5 1     10232      0    3.1mb   1.5mb
green open index2                       5 1      4132      0    4.9mb   2.4mb
green open index3                       5 1     2211427      0 1021.5mb 510.6mb
green open index4                      5  1    45930380 142328   10.8gb   5.4gb
green open index5                       5 1      3353      0   13.9mb   6.9mb
green open index6                       5 1    183713123      0   61.2gb  30.7gb
green open index7                       5 1       175    163  321.6kb   158kb
green open index8                        5 1   6387812      0    3.6gb   1.8gb
green open index9                        5 1 294701932      0   78.6gb  39.1gb
green open index10                      5 1  50690374      0   21.2gb    10gb
green open index11                      5 1   3046745      0      2gb     1gb
green open index12                      5 1    495746      0    1.9gb 993.8mb
green open index13                      5 1     10232      0      3mb   1.5mb
green open index14                      5 1    140825     15    171mb  85.5mb
green open index15                      5 1       326     33  379.4kb 165.9kb