Overhead and heap issues

Hi all,

I'm having big problems with Java Heap size and my 3 Elastic nodes running out of space. I've been running with the default 1gig setup but as that started to fill, I've increased it to 4gig (half of the available memory on the server), but I'm getting more issues now that I've made the adjustment. The cluster will run for about half a day before filling the Java heap space, eventually timing out and then stopping altogether.
I'm running these 3 nodes on Windows 2012 R2, have used the elasticsearch-service.bat manager to adjust the -Xms6g and -Xmx6g options as well as changing the "Initial memory pool" and the "Maximum memory pool" to 6144MB. I also changed the java.options file just for good measure (I realise this file is only used by Linux based systems) yet still my logs are full of overhead errors before finally quitting.

[2018-01-31T18:30:21,491][INFO ][o.e.m.j.JvmGcMonitorService] [server1] [gc][6178] overhead, spent [382ms] collecting in the last [1s]
[2018-01-31T18:30:22,507][WARN ][o.e.m.j.JvmGcMonitorService] [server1] [gc][6179] overhead, spent [539ms] collecting in the last [1s]
[2018-01-31T18:35:13,640][INFO ][o.e.m.j.JvmGcMonitorService] [server1] [gc][6467] overhead, spent [361ms] collecting in the last [1s]
[2018-01-31T18:35:14,656][INFO ][o.e.m.j.JvmGcMonitorService] [server1] [gc][6468] overhead, spent [503ms] collecting in the last [1s]
[2018-01-31T18:35:15,672][INFO ][o.e.m.j.JvmGcMonitorService] [server1] [gc][6469] overhead, spent [427ms] collecting in the last [1s]

Can anyone point out anything that I might be missing? None of my configs have changed in any significant way, all I've done is increase the heap space for Java.

Thanks for any help you can offer.

How many shards do you have? What version are you on?

V6.0.0
active_primary_shards: 866,
active_shards: 1732

Is the first step an upgrade?

I think you have too many shards given the size of the cluster and heap space available. Read this blog post around shards and sharding practices.

So I've used Cerebro to get a better overview of the issue. I can now watch the JVM heap size on each node slowly increasing to sometimes 90% on one of the nodes. I can see that I have 42,087,876 docs, 1782 shards spread across 3 nodes and a total size of 58GB, each with now 4GB of JVM heap. Based on the above, can you confirm that the reason I am seeing all these timeouts is down to the sheer number of shards/docs?

If this is the case, what are the steps to reduce them? I only have 179 indices. Is it a config issue that I've screwed up on the initial build?

Thanks.

It certainly looks that way. If you have time based data, use _shrink to reduce the counts for older indices.

Lets say I wanted to just start again. Wipe out all my indices and let Logstash continue throwing data at ES and just accept my losses. How do I prevent this from happening in the future? From what I read, I'll need to specify the shard size in a template in LS. So am I right in thinking that I should design a template for every index? I assumed I should just use the one that Logstash chooses for me based on the content I throw at it. The only specific template I use is one I cobbled together for our Palo firewalls (shown below for ref). I guess I'm at a loss as to understand exactly how 3 nodes can't take the small amount of data I'm pushing at them. I've even reduced the curator cleanup to delete indices older than 7 days so with only 5 indices incoming and no more than 7 days of each of them.

Palo template

{
  "template" : "palo-firewall-traffic*",
  "settings" : {
    "index.refresh_interval" : "5s"
  },
  "mappings" : {
    "_default_" : {
       "_all" : {"enabled" : true},
       "dynamic_templates" : [ {
         "message_field" : {
           "match" : "message",
           "match_mapping_type" : "string",
           "mapping" : {
             "type" : "text", "norms" : false, "index" : true
           }
         }
       }, {
         "strings" : {
           "match" : "*",
           "match_mapping_type" : "string",
           "mapping" : {
             "type" : "text", "norms" : false, "index" : true,
               "fields" : {
                 "text" : {"type" : "keyword", "index" : true, "ignore_above" : 256}
               }
           }
         }
       } ],
       "properties" : {
         "@version": { "type": "keyword", "index": true},
         "geoip"  : {
           "type" : "object",
             "dynamic": true,
             "properties" : {
               "location" : { "type" : "geo_point" }
             }
         },
         "SourceGeo"  : {
           "type" : "object",
             "dynamic": true,
             "properties" : {
               "location" : { "type" : "geo_point" }
             }
         },
         "DestinationGeo"  : {
           "type" : "object",
             "dynamic": true,
             "properties" : {
               "location" : { "type" : "geo_point" }
             }
         }
       }
    }
  }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.