Elastic Heap Size issue

Hi,

I have elastic running on 2 nodes with 32 core, 16 core and 64 GB, 32 GB ram respectively. Heap allocated to elastic node 1 is 24 GB (node with 64 GB ram).
Heap allocated to elastic node 2 is 15 GB (node with 32 GB ram).

There is no segregation for Data node and Master node as of now.

Currently data size is 500 GB (including both nodes)

ElasticSearch Version: 2.3.0

Problem:

  1. Heap memory unexpectedly starts increasing on 1 node. And never goes down until I restart that particular node.
  2. Increase in Heap memory does results in increase of heap on other node, but this goes back to normal when node 1 is restarted.

Actions done so far:

  1. Tried deleting data reducing it to half i.e. from 1TB to 500Gb
  2. Cleared field cache
  3. Tried changes below parameters
    indices.cache.filter.size: 15%
    index.merge.scheduler.max_thread_count: 1
    index.translog.flush_threshold_size: 1gb
    index.refresh_interval: 30s
    indices.fielddata.cache.size: 20%
    indices.breaker.request.limit: 40%
    indices.breaker.total.limit: 70%
    action.auto_create_index: true
    indices.breaker.fielddata.limit: 45%

Below are the runtime params being used while starting elastic on node 1
-Xms24g -Xmx24g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -Dfile.encoding=UTF-8 -Djna.nosys=true

Below is the output of /_cluster/stats?pretty=1&clear=true&indices=true when heap was almost full on node 1 (64 GB RAM)
{
"timestamp" : 1527512323521,
"cluster_name" : "temp1",
"status" : "green",
"indices" : {
"count" : 34,
"shards" : {
"total" : 148,
"primaries" : 75,
"replication" : 0.9733333333333334,
"index" : {
"shards" : {
"min" : 2,
"max" : 20,
"avg" : 4.352941176470588
},
"primaries" : {
"min" : 1,
"max" : 10,
"avg" : 2.2058823529411766
},
"replication" : {
"min" : 0.0,
"max" : 1.0,
"avg" : 0.9705882352941176
}
}
},
"docs" : {
"count" : 1214534951,
"deleted" : 6776708
},
"store" : {
"size_in_bytes" : 564568010389,
"throttle_time_in_millis" : 0
},
"fielddata" : {
"memory_size_in_bytes" : 0,
"evictions" : 0
},
"query_cache" : {
"memory_size_in_bytes" : 0,
"total_count" : 5958392,
"hit_count" : 132952,
"miss_count" : 5825440,
"cache_size" : 0,
"cache_count" : 14379,
"evictions" : 14379
},
"completion" : {
"size_in_bytes" : 0
},
"segments" : {
"count" : 3840,
"memory_in_bytes" : 2497665031,
"terms_memory_in_bytes" : 2233316015,
"stored_fields_memory_in_bytes" : 222013960,
"term_vectors_memory_in_bytes" : 0,
"norms_memory_in_bytes" : 696000,
"doc_values_memory_in_bytes" : 41639056,
"index_writer_memory_in_bytes" : 23480688,
"index_writer_max_memory_in_bytes" : 4599208143,
"version_map_memory_in_bytes" : 3179664,
"fixed_bit_set_memory_in_bytes" : 0
},
"percolate" : {
"total" : 0,
"time_in_millis" : 0,
"current" : 0,
"memory_size_in_bytes" : -1,
"memory_size" : "-1b",
"queries" : 0
}
},
"nodes" : {
"count" : {
"total" : 2,
"master_only" : 0,
"data_only" : 0,
"master_data" : 2,
"client" : 0
},
"versions" : [ "2.3.0" ],
"os" : {
"available_processors" : 48,
"allocated_processors" : 48,
"mem" : {
"total_in_bytes" : 0
},
"names" : [ {
"name" : "Linux",
"count" : 2
} ]
},
"process" : {
"cpu" : {
"percent" : 5
},
"open_file_descriptors" : {
"min" : 3141,
"max" : 3434,
"avg" : 3287
}
},
"jvm" : {
"max_uptime_in_millis" : 20946108,
"versions" : [ {
"version" : "1.8.0_91",
"vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
"vm_version" : "25.91-b14",
"vm_vendor" : "Oracle Corporation",
"count" : 2
} ],
"mem" : {
"heap_used_in_bytes" : 30288790016,
"heap_max_in_bytes" : 41561948160
},
"threads" : 565
},
"fs" : {
"total_in_bytes" : 2620321349632,
"free_in_bytes" : 1478245064704,
"available_in_bytes" : 1400792936448,
"spins" : "true"
},
"plugins" : [ {
"name" : "cloud-aws",
"version" : "2.3.0",
"description" : "The Amazon Web Service (AWS) Cloud plugin allows to use AWS API for the unicast discovery mechanism and add S3 repositories.",
"jvm" : true,
"classname" : "org.elasticsearch.plugin.cloud.aws.CloudAwsPlugin",
"isolated" : true,
"site" : false
}, {
"name" : "delete-by-query",
"version" : "2.3.0",
"description" : "The Delete By Query plugin allows to delete documents in Elasticsearch with a single query.",
"jvm" : true,
"classname" : "org.elasticsearch.plugin.deletebyquery.DeleteByQueryPlugin",
"isolated" : true,
"site" : false
}, {
"name" : "kopf",
"version" : "2.0.1",
"description" : "kopf - simple web administration tool for Elasticsearch",
"url" : "/_plugin/kopf/",
"jvm" : false,
"site" : true
} ]
}
}

Can you upgrade?

@warkolm Upgrade is an option. But will it solve this issue?
Don't we any other resolution for this?

Elasticsearch by default assumes all data nodes are equal so the smaller node is likely to be under more pressure.

@Christian_Dahlqvist Thanks for replying if that is the case why do my bigger node heap gets full from time to time while smaller node's heap is normal.

Its not like always one of the them is going down, its random.

Do you have monitoring installed so you can show how heap usage varies over time?

How full does it get (it is expected to get to about 75% full before GC kicks in)? Does it crash the node?

It reaches above 75% and never goes back to normal until I restart elastic on that node. And yes if I wait enough to see if it goes back to normal of his own then after sometime heap gets completely full and node crashes. I can see GC running continuously in logs when heap size reaches 75% keeping that node in halt state.

No monitoring installed as of now.

I noticed that you have been overriding some of the default values. How did you arrive at these values? What happens if you stick with the defaults?

It'll give you a better idea of what's happening, as you can leverage the Monitoring functionality in X-Pack.

I changed those values while experimenting my own debugging skills to solve this issue. This issue was occuring at default values.

But X-Pack is paid.

Parts of it are, but Monitoring is totally free - https://www.elastic.co/subscriptions

Anyone?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.