Outofmemory exception on ES 1.7.0

Hi, We are seeing an out of memory exception on the Elasticsearech JVM , as it is running out of heap space after a while.

We are using elastic search version 1.7.0, with 1.7.0_79 version of java, when we looked at the heap dump we could see that around 80% of the retained heap size was with around 1.7 million clusterState objects in the heap. Any pointers on why the heap is being retained by this objects and not being released? How often are the clusterState objects created?

We are running a 4 node ES cluster and each ES JVM is being run with a heap space of 8g

ES_HEAP_SIZE=8g

Any help is much appreciated. Thanks.

Class Name                                                                                                                    | Shallow Heap | Retained Heap | Percentage
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor @ 0x750acc598                                        |           88 | 3,281,330,168 |     81.40%
|- java.util.concurrent.PriorityBlockingQueue @ 0x750b96760                                                                   |           40 | 3,281,329,448 |     81.40%
|  |- java.lang.Object[1707954] @ 0x7f670bc00                                                                                 |    6,831,832 | 3,281,329,304 |     81.40%
|  |  |- org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable @ 0x74fc721b8|           40 |       532,728 |      0.01%
|  |  |- org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable @ 0x74fc6fc38|           40 |       529,784 |      0.01%
|  |  |- org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable @ 0x74fc70eb0|           40 |       521,008 |      0.01%
|  |  |- org.elasticsearch.cluster.ClusterState @ 0x708454380                                                                 |           56 |       400,104 |      0.01%
|  |  |- org.elasticsearch.cluster.ClusterState @ 0x7e6174ff8                                                                 |           56 |       400,104 |      0.01%
|  |  |- org.elasticsearch.cluster.ClusterState @ 0x7bd933908                                                                 |           56 |       400,104 |      0.01%
|  |  |- org.elasticsearch.cluster.ClusterState @ 0x7cacb4780                                                                 |           56 |       400,104 |      0.01%
|  |  |- org.elasticsearch.cluster.ClusterState @ 0x73c460d50                                                                 |           56 |       400,104 |      0.01%
|  |  |- org.elasticsearch.cluster.ClusterState @ 0x7292082e8                                                                 |           56 |       400,104 |      0.01%
|  |  |- org.elasticsearch.cluster.ClusterState @ 0x73b610838                                                                 |           56 |       400,104 |      0.01%
|  |  |- org.elasticsearch.cluster.ClusterState @ 0x79fadbb98                                                                 |           56 |       400,104 |      0.01%
|  |  |- org.elasticsearch.cluster.ClusterState @ 0x7200bf040                                                                 |           56 |       400,104 |      0.01%
|  |  |- org.elasticsearch.cluster.ClusterState @ 0x7306d3430                                                                 |           56 |       400,104 |      0.01%
|  |  |- org.elasticsearch.cluster.ClusterState @ 0x7146e6fb0                                                                 |           56 |       400,104 |      0.01%
|  |  |- org.elasticsearch.cluster.ClusterState @ 0x72b5c88d0                                                                 |           56 |       400,104 |      0.01%    
|  |  |- org.elasticsearch.cluster.ClusterState @ 0x724eac8f8                                                                 |           56 |       400,104 |      0.01%
|  |  |- org.elasticsearch.cluster.ClusterState @ 0x71d312540                                                                 |           56 |       400,104 |      0.01%
|  |  '- Total: 25 of 1,435,266 entries; 1,435,241 more                                                                       |              |               |           
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------

You don't happen to have a lot of mappings do you?

No, just 2 , the default and a custom mapping. I do however see a update_mapping [logs] (dynamic) Info log whenever a new index is created, index creation is being done on an hourly basis.

That's a lot of indices, why so many?

If you have a lot of index, your cluster state will be huge and will consume space in memory, actually a lot of space. Remember in ES, a lot cluster state sync happens across nodes and all nodes maintain the state, keeping it light is important. If you have marvel enabled, the large # of indices will impact that as well. Marvel client will keep sending index stats for all those indices and consume a lot of memory.

We are indexing on an hourly basis because of the volume of the logs, our hourly index sizes frequently go beyond 5gb. We have been running with this setup on ES 1.2 successfully, but now seeing an issue on upgrading to 1.7.

5GB for an index is not big, I'd just stick with a daily one with (at least) 4 shards.

To add to that, if you have the default of 5 shards and 1 replica, you have 240 shards per day, which is a lot. And a shard is a Lucene instance that requires resources to be maintained. Over sharding, which is what you are doing, is going to be playing a part in this OOM.

Thanks for the response @warkolm. I will try out by indexing on a daily basis, also, when I said 5gb as an index size, that's the size for an hour, so if we index on a daily basis, in the worst case the index size could go upto 120GB, would that cause any other problem?

Nope, that's fine.

Seems reasonable to me as well