ES used heap % grows slowly until system becomes unresponsive

How's it going guys/gals.

I am running a production elasticsearch ( elasticsearch-1.5.0.jar ) cluster consisting of 6 nodes.
One such node is a so-called "load-balancer" node ( lb.es ). It is not a data node, and all the http and transport client traffic goes through here. Afterwards, it is routed to the data nodes ( esdata-001 through esdata-002). The data nodes are also all possible masters, with the minimum master nodes size being 3.

Each data node has 31G ram allocated to it ( 31, so that I don't hit the 32GB limit with the address size differences... ) out of a total of 128G available.
ES_HEAP_SIZE=31G

The load balancer node has 14G of 31G available RAM allocated for the JVM.
ES_HEAP_SIZE=14G

On this cluster we index about 300 million documents per day. The documents indexed are apache log files and we use logstash to do the indexing, rotating the index once per day on midnight. We delete the index which is older than 30 days ever day, also.

The max insert rate is around 20,000 documents/sec and the average one is about 5-10 thousands of docs per second.

We also run 1-5 queries per second which extract data from indexes ( usually indexes from within the last 5 days, so these are cache warmed ) and then this data is processed via scripts etc.

Anyway, the system works fine but as I monitor the es heap size I see that it grows on average across all the nodes at a certain rate ( after optimizations its about 1-2% per day starting from around 50% size average across nodes ).

When the used heap % reaches 80->90% the number of lines in the slowlog for both indexing and search increase ( search seems to have more lines on average at this point, probably due to the fact that I reduced the number of shards from 100 to 30 per index, which in turn increased the indexing performance, so its a tradeoff).

As the slowlog lines increase eveutually we get to a place where the system is more or less unresponsive. After that its a question of hours before the heap is completely full and the system starts thrashing and ultimatelly becomes unresponsive.

Anyway - I currently do scheduled cluster restarts ( which take about 10 minutes ) whenever the average heap % is around 85%. That way the system experiences about 10 minutes of downtime per week but its really a pain in the ass.

This is where I need you help ( as I have applied pretty much all the suggestions given by the documentation and most of the production-geared help sites...)

Please let me know if you have any ideas that I can test on production ( I also have one free machine where I can run other types of tests, so I can set that up as a test environment, if needed.)

I can attach any and all needed conf files, statistics etc - just let me know.

Here's the esdata-001 through esdata-005 elasticsearch.yml contents:

cluster.name: elasticsearch
node.name: "elasticsearch-data-master001"

# possible master + data node
node.master: true
node.data: true
node.max_local_storage_nodes: 1

index.number_of_shards: 30 # was 100, changed to 30 on 06.08.2015 
index.number_of_replicas: 0

index.merge.scheduler.max_thread_count: 1

index.merge.policy.merge_factor: 20
index.merge.policy.min_merge_docs: 1000
index.merge.policy.max_merge_docs: 100000000

path.data:  /data/elasticsearch
network.host: x.x.x.x

discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: [  "x.x.x.x", "x.x.x.x", "x.x.x.x", "x.x.x.x", "x.x.x.x", "x.x.x.x" ]
discovery.zen.minimum_master_nodes: 3

index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.query.trace: 500ms

index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.fetch.info: 800ms
index.search.slowlog.threshold.fetch.debug: 500ms
index.search.slowlog.threshold.fetch.trace: 200ms

index.indexing.slowlog.threshold.index.warn: 10s
index.indexing.slowlog.threshold.index.info: 5s
index.indexing.slowlog.threshold.index.debug: 2s
index.indexing.slowlog.threshold.index.trace: 500ms

monitor.jvm.gc.young.warn: 1000ms
monitor.jvm.gc.young.info: 700ms
monitor.jvm.gc.young.debug: 400ms

monitor.jvm.gc.old.warn: 10s
monitor.jvm.gc.old.info: 5s
monitor.jvm.gc.old.debug: 2s

script.disable_dynamic: true

bootstrap.mlockall: true

threadpool.search.type: fixed
threadpool.search.size: 96 # 32 * 3 #http://www.elastic.co/guide/en/elasticsearch/guide/master/_don_8217_t_touch_these_settings.html
threadpool.search.queue_size: -1 # increased on 22.04.2015 ( to prevent EsRejectedExecutionException and similar )

threadpool.bulk.type: fixed
threadpool.bulk.size: 32
threadpool.bulk.queue_size: 1000

threadpool.index.type: fixed
threadpool.index.size: 32
threadpool.index.queue_size: 1000

indices.memory.index_buffer_size: 30%
indices.memory.min_shard_index_buffer_size: 12mb
indices.memory.min_index_buffer_size: 96mb

indices.fielddata.cache.size: 30% # decreased from 40% on 13.07.2015 ( see  http://www.elastic.co/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html )
#indices.fielddata.cache.expire: 5m # removed this on 13.07.2015 ( see http://www.elastic.co/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html )

indices.cache.filter.size: 10%
indices.cache.filter.expire: 6h

indices.breaker.fielddata.limit: 50% # added on 13.07.2015 ( default is 60% of heap ) 
indices.breaker.request.limit: 20% # added on 13.07.2015 ( default is 40% of heap )
indices.breaker.total.limit: 40% # added on 13.07.2015 ( default is 70% of heap as a combination of the previous two)

indices.store.throttle.type: "none"
index.refresh_interval: -1 # was 30s
index.translog.flush_threshold_ops: 50000 # defaults to unlimited

index.compound_format : 0 # added on 15.07.2015
index.compound_on_flush : false # added on 15.07.2015

cluster.routing.allocation.node_concurrent_recoveries: 30

indices.recovery.concurrent_streams: 10
indices.recovery.concurrent_small_file_streams: 10
indices.recovery.file_chunk_size: 512kb
indices.recovery.translog_ops: 5000
indices.recovery.translog_size: 512kb
indices.recovery.compress: true
indices.recovery.max_bytes_per_sec: 40mb

Here's the elasticsearch-lb elasticsearch.yml contents:

cluster.name: elasticsearch
node.name: elasticsearch-lb # todo -o sasho: currently only supports adding a single nodata nomaster node
node.master: false
node.data: false

index.number_of_shards: 30 # was 100, changed to 30 on 06.08.2015   
index.number_of_replicas: 0
index.merge.scheduler.max_thread_count: 1
index.merge.policy.merge_factor: 20
index.merge.policy.min_merge_docs: 1000
index.merge.policy.max_merge_docs: 100000000

path.conf: /etc/elasticsearch
bootstrap.mlockall: true
network.host: x.x.x.x

http.port: 9200
http.enabled: true

discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: [  "x.x.x.x", "x.x.x.x", "x.x.x.x", "x.x.x.x", "x.x.x.x", "x.x.x.x" ]
discovery.zen.minimum_master_nodes: 3

index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.query.trace: 500ms
index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.fetch.info: 800ms
index.search.slowlog.threshold.fetch.debug: 500ms
index.search.slowlog.threshold.fetch.trace: 200ms
index.indexing.slowlog.threshold.index.warn: 10s
index.indexing.slowlog.threshold.index.info: 5s
index.indexing.slowlog.threshold.index.debug: 2s
index.indexing.slowlog.threshold.index.trace: 500ms

monitor.jvm.gc.young.warn: 1000ms
monitor.jvm.gc.young.info: 700ms
monitor.jvm.gc.young.debug: 400ms
monitor.jvm.gc.old.warn: 10s
monitor.jvm.gc.old.info: 5s
monitor.jvm.gc.old.debug: 2s

script.disable_dynamic: true

bootstrap.mlockall: true

threadpool.search.type: fixed
threadpool.search.size: 96 # 32 * 3 #http://www.elastic.co/guide/en/elasticsearch/guide/master/_don_8217_t_touch_these_settings.html
threadpool.search.queue_size: -1 # increased on 22.04.2015 ( to prevent EsRejectedExecutionException and similar )

threadpool.bulk.type: fixed
threadpool.bulk.size: 32
threadpool.bulk.queue_size: 1000

threadpool.index.type: fixed
threadpool.index.size: 32
threadpool.index.queue_size: 1000

indices.memory.index_buffer_size: 30%
indices.memory.min_shard_index_buffer_size: 12mb
indices.memory.min_index_buffer_size: 96mb

indices.fielddata.cache.size: 30% # decreased from 40% on 13.07.2015 ( see  http://www.elastic.co/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html )
#indices.fielddata.cache.expire: 5m # removed this on 13.07.2015 ( see http://www.elastic.co/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html )

indices.cache.filter.size: 10%
indices.cache.filter.expire: 6h

indices.breaker.fielddata.limit: 50% # added on 13.07.2015 ( default is 60% of heap ) 
indices.breaker.request.limit: 20% # added on 13.07.2015 ( default is 40% of heap )
indices.breaker.total.limit: 40% # added on 13.07.2015 ( default is 70% of heap as a combination of the previous two)

indices.store.throttle.type: "none"
index.refresh_interval: -1 # was 30s
index.translog.flush_threshold_ops: 50000 # defaults to unlimited

index.compound_format : 0 # added on 15.07.2015
index.compound_on_flush : 0 # added on 15.07.2015

cluster.routing.allocation.node_concurrent_recoveries: 30

indices.recovery.concurrent_streams: 10
indices.recovery.concurrent_small_file_streams: 10
indices.recovery.file_chunk_size: 512kb
indices.recovery.translog_ops: 5000
indices.recovery.translog_size: 512kb
indices.recovery.compress: true
indices.recovery.max_bytes_per_sec: 40mb

http.cors.enabled: true
http.cors.allow-origin: http://x.x.x.x

Here's the mapping for today's index:

{
  "logstash-2015.08.18" : { "mappings" : { "apache" : {
        "dynamic_templates" : [ {
          "message_field" : {
            "mapping" : { "index" : "analyzed", "omit_norms" : true, "type" : "string" },
            "match" : "message",
            "match_mapping_type" : "string"
          }
        }, {
          "string_fields" : {
            "mapping" : { "index" : "analyzed", "omit_norms" : true, "type" : "string",
              "fields" : {
                "raw" : { "ignore_above" : 256, "index" : "not_analyzed", "type" : "string" }
              }
            },
            "match" : "*",
            "match_mapping_type" : "string"
          }
        } ],
        "_all" : { "enabled" : true },
        "properties" : {
          "@timestamp" : { "type" : "date", "format" : "dateOptionalTime" },
          "@version" : { "type" : "string", "index" : "not_analyzed" },
          "agent" : {
            "type" : "string",
            "norms" : { "enabled" : false },
            "fields" : {
              "raw" : { "type" : "string", "index" : "not_analyzed", "ignore_above" : 256 }
            }
          },
          "auth" : {
            "type" : "string",
            "norms" : { "enabled" : false },
            "fields" : {
              "raw" : { "type" : "string", "index" : "not_analyzed", "ignore_above" : 256 }
            }
          },
          "bytes" : { "type" : "long" },
          "clientip" : {
            "type" : "string",
            "norms" : { "enabled" : false },
            "fields" : {
              "raw" : { "type" : "string", "index" : "not_analyzed", "ignore_above" : 256 }
            }
          },
          "clientip2" : {
            "type" : "string",
            "norms" : { "enabled" : false },
            "fields" : {
              "raw" : { "type" : "string", "index" : "not_analyzed", "ignore_above" : 256 }
            }
          },
          "geoip" : {
            "dynamic" : "true",
            "properties" : {
              "location" : { "type" : "geo_point" }
            }
          },
          "host" : {
            "type" : "string",
            "norms" : { "enabled" : false },
            "fields" : {
              "raw" : { "type" : "string", "index" : "not_analyzed", "ignore_above" : 256 }
            }
          },
          "hostname" : {
            "type" : "string",
            "norms" : { "enabled" : false },
            "fields" : {
              "raw" : { "type" : "string", "index" : "not_analyzed", "ignore_above" : 256 }
            }
          },
          "ident" : {
            "type" : "string",
            "norms" : { "enabled" : false },
            "fields" : {
              "raw" : { "type" : "string", "index" : "not_analyzed", "ignore_above" : 256 }
            }
          },
          "logsource" : {
            "type" : "string",
            "norms" : { "enabled" : false },
            "fields" : {
              "raw" : { "type" : "string", "index" : "not_analyzed", "ignore_above" : 256 }
            }
          },
          "original_timestamp" : { "type" : "date", "format" : "dateOptionalTime" },
          "rawrequest" : {
            "type" : "string",
            "norms" : { "enabled" : false },
            "fields" : {
              "raw" : { "type" : "string", "index" : "not_analyzed", "ignore_above" : 256 }
            }
          },
          "referrer" : {
            "type" : "string",
            "norms" : { "enabled" : false },
            "fields" : {
              "raw" : { "type" : "string", "index" : "not_analyzed", "ignore_above" : 256 }
            }
          },
          "response" : { "type" : "long" },
          "type" : {
            "type" : "string",
            "norms" : { "enabled" : false },
            "fields" : {
              "raw" : { "type" : "string", "index" : "not_analyzed", "ignore_above" : 256 }
            }
          }
        }
      }
    }
  }
}

And here's an example query of the type that we send:

{
    'filter': {
        'bool': {
            'should': [
                {'bool': {'must': [{'term': {'_cache': True, 'execution': 'bool', 'hostname.raw': 'FAKEHOSTNAME.COM'}}]}}, 
                {'bool': {'must': [{'term': {'_cache': True, 'execution': 'bool', 'hostname.raw': 'FAKEHOSTNAME1.COM'}}]}},
                {'bool': {'must': [{'term': {'_cache': True, 'execution': 'bool', 'hostname.raw': 'FAKEHOSTNAME2.com'}}]}}
            ]
        }
    },
    'fields': ['@timestamp', 'host', 'type', 'logsource', 'hostname', 'clientip', 'ident', 'auth', 'original_timestamp', 'rawrequest', 'response', 'bytes', 'referrer', 'agent', 'clientip2']
}

if you are running either 100 or 30 shards on a 6 node cluster, you are running WAY too many. Reduce that to one per data node, or maybe 2 if you want room to add more nodes in later.

That's what I read ( http://cpratt.co/how-many-shards-should-elasticsearch-indexes-have/ ), however the search performance dropped ( I have a lot more search slowlog lines now than before the change from 100 to 30 ) and I'm afraid that if I drop the number of shards to 7 ( for example ) that the search performance will degrade even more. Can you comment on this or send me some resource?

Each shard is a lucene instance that requires resources to be maintained, having hundreds of these for small slice of data is seriously inefficient and will be contributing to your heap issues.

I wouldn't play around with those index.merge.policy settings unless you know what you are doing, even then the defaults are there for a reason. Same thing goes with threadpools, setting that search one to -1 is just asking for problems as it's unbounded. Also indices.store.throttle.type - again don' touch unless you are really sure you know what the impact is.

If you are running into resource problems then simply increasing these settings isn't going to help, it's just delaying the inevitable. Add more resources, more nodes, more heap!

For your .raw/not analysed fields, move them to doc values and you will get some more heap back.

I don't really have an option of adding more nodes or more heap. More nodes for financial reasons, and more heap due to the fact that I've already hit the 32G limit. Also I'm hitting the physical limit for the hard drive, so its not like I can run two data nodes on the same machine comfortably... I'm not sure what more resources means, can you clarify?

I'm not sure what you mean by "hundreds of these for a small slice of data." I have 30 per index and the documents are not large but the index as a whole is rather large. I am willing to test with 7 shards, I'm just afraid of making drastic changes on a running production cluster without a better idea of the consequences and its kind of hard to tell.

I'll think about reverting the merge policy numbers. I arrived at these numbers after some testing which improved performance at an earlier stage when the cluster was crashing much more violently and much more often, so I have kept them.

As for the throttle type to "none" - I did this to improve indexing speed, as my employer had very specific requirements as to the maximum indexing speed that the cluster should be able to support ( 20000 docs/s ). At the time when I was building the cluster and testing these values seemed to make more sense than the default value of "merge". Again, if you can give me some concrete reason for why this setting is dangerous and how exactly its stuffing the used heap space, please let me know and I'll consider changing these values on the live system.

regarding threadpool.search.queue_size: -1 - I think I can change this back, also.

A lot of that isn't just about heap, it's about sanity and understanding the impacts that these changes can have on your system, it's stability and usability.

I'd suggest you upgrade to 1.7.1 as you will get better performance out of that. What java version are you on? If it's not 1.8u40 or above then do that as well, as we have seen good gains from that release onwards.
Also reduce your heap to 30.5GB, not 31GB, that's a figure that's come from Oracle themselves. You can comfortably increase your heap on the client node to ~75% of total system RAM as you don't need to worry about caching on the FS level, but that won't help the heap problems on your data nodes.

How much data do you have across your indices? What's the average size of an index?

Ultimately if you can't add more resources (ie heap/nodes) then you are limited by what you can do.

Also, what are these help sites you are referring to?

This one for example:


The one I linked earlier, and some others, such as this one:
http://edgeofsanity.net/article/2012/12/26/elasticsearch-for-logging.html

The elastic mailing list, this forum and the main documentation.

Ok, cause right here;

It explicitly says "don't touch these settings" ,for a reason :stuck_out_tongue:

I admit that my heap space explanation is very surface level. There seems to be a correlation between these numbers and when we start experiencing slow indexing and search rates - that's why I mention them, as they seem to be a good metric to monitor and a good indicator of when I need to do a restart. That said, they don't explain anything...

The java version is already 1.8.0_45. I was considering upgrading to 1.7.1 but I didn't see anything in the changelog that seemed relevant but its still a good idea. ( Have you done such an upgrade before? Do you have any idea if I will need to reindex everything with the new version? If so, then it's probably not something that I want to do right now? ) Seems like its not that difficult and luckily both 1.5.0 and 1.7.1 use lucene 4, so its a question of doing a rolling upgrade https://www.elastic.co/guide/en/elasticsearch/reference/current/setup-upgrade.html#rolling-upgrades ( which can still go horribly wrong, so I'm not in a big rush to do it before I can ensure some backups.. )

The average index size is 300000000 docs ~280GB ( with replicas )

From that same site (Don’t Touch These Settings! | Elasticsearch: The Definitive Guide [master] | Elastic ):

Search gets a larger threadpool, and is configured to # cores * 3.

The number of cores 32 and *3 its 96 - so I think I'm doing a default setting :smiley: ( although earlier I had some crazy number heh )

** update: see 4 posts down

You don't need to reindex for an upgrade, there's some docs on the upgrade process.

Again, having that many shards for such small indices and only 5 nodes doesn't make any sense. I can guarantee that your issues would have been worse with 100 shards because you are allocating >300% more just to maintain lucene instances.

I know I have said it before but you are basically balancing between usability, performance and storage; You can't get all 3 but have to pick and choose which are the most important.

Thanks for the assist so far Mark. I am considering all your suggestions ( and will likely try the ones I mentioned ) but I'll leave the thread open so that others can also give their suggestions ( n > 2 minds are better heh ). I'm currently at the end of my shift, so I'll check in on this thread tomorrow.

I checked my threadpool values again against this page: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html

I think my search default size is wrong - according to that page it should be (32*3)/2+1 = 49, so I guess there are some inconsistencies in the docs...

UPDATE: I updated the cluster config, so that now the data nodes have 7 shards each. I'm monitoring the slow log but it doesn't look like there was a spike in search slow logs, so what I was afraid of happening earlier likely isn't happening. I will monitor the cluster for another week or so to see how the used heap % values are advancing and whether we'll hit the same fail conditions as earlier. So far so good.

Great to hear!