I'm having big problems with Java Heap size and my 3 Elastic nodes running out of space. I've been running with the default 1gig setup but as that started to fill, I've increased it to 4gig (half of the available memory on the server), but I'm getting more issues now that I've made the adjustment. The cluster will run for about half a day before filling the Java heap space, eventually timing out and then stopping altogether.
I'm running these 3 nodes on Windows 2012 R2, have used the elasticsearch-service.bat manager to adjust the -Xms6g and -Xmx6g options as well as changing the "Initial memory pool" and the "Maximum memory pool" to 6144MB. I also changed the java.options file just for good measure (I realise this file is only used by Linux based systems) yet still my logs are full of overhead errors before finally quitting.
[2018-01-31T18:30:21,491][INFO ][o.e.m.j.JvmGcMonitorService] [server1] [gc][6178] overhead, spent [382ms] collecting in the last [1s]
[2018-01-31T18:30:22,507][WARN ][o.e.m.j.JvmGcMonitorService] [server1] [gc][6179] overhead, spent [539ms] collecting in the last [1s]
[2018-01-31T18:35:13,640][INFO ][o.e.m.j.JvmGcMonitorService] [server1] [gc][6467] overhead, spent [361ms] collecting in the last [1s]
[2018-01-31T18:35:14,656][INFO ][o.e.m.j.JvmGcMonitorService] [server1] [gc][6468] overhead, spent [503ms] collecting in the last [1s]
[2018-01-31T18:35:15,672][INFO ][o.e.m.j.JvmGcMonitorService] [server1] [gc][6469] overhead, spent [427ms] collecting in the last [1s]
Can anyone point out anything that I might be missing? None of my configs have changed in any significant way, all I've done is increase the heap space for Java.
So I've used Cerebro to get a better overview of the issue. I can now watch the JVM heap size on each node slowly increasing to sometimes 90% on one of the nodes. I can see that I have 42,087,876 docs, 1782 shards spread across 3 nodes and a total size of 58GB, each with now 4GB of JVM heap. Based on the above, can you confirm that the reason I am seeing all these timeouts is down to the sheer number of shards/docs?
If this is the case, what are the steps to reduce them? I only have 179 indices. Is it a config issue that I've screwed up on the initial build?
Lets say I wanted to just start again. Wipe out all my indices and let Logstash continue throwing data at ES and just accept my losses. How do I prevent this from happening in the future? From what I read, I'll need to specify the shard size in a template in LS. So am I right in thinking that I should design a template for every index? I assumed I should just use the one that Logstash chooses for me based on the content I throw at it. The only specific template I use is one I cobbled together for our Palo firewalls (shown below for ref). I guess I'm at a loss as to understand exactly how 3 nodes can't take the small amount of data I'm pushing at them. I've even reduced the curator cleanup to delete indices older than 7 days so with only 5 indices incoming and no more than 7 days of each of them.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.