Elasticsearch operational issue due to garbage collector

running a 4 node elasticsearch cluster. which has one client node and 3 data nodes. data nodes are also configured as master eligible node.

elasticsearch version is 2.3.5

lucene version is 5.5.0

JAVA is openjdk version 1.8.0_141

below are the JAVA_OPTS values with which elasticsearch is running.

-Xms10g -Xmx10g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC

when ever GC is triggered and at that time data nodes leaves the cluster. below are the GC logs:

[2017-09-28 11:52:52,186][WARN ][monitor.jvm ] [10.0.1.85] [gc][old][504680][614] duration [39.8m], collections [84]/[40.9m], total [39.8m]/[5h], memory [9.9gb]->[9.9gb]/[9.9gb], all_pools {[young] [532.5mb]->[532.4mb]/[532.5mb]}{[survivor] [66.4mb]->[66.4mb]/[66.5mb]}{[old] [9.3gb]->[9.3gb]/[9.3gb]}
[2017-09-28 11:52:53,186][WARN ][monitor.jvm ] [10.0.1.85] [gc][old][504681][615] duration [29.5s], collections [1]/[30.5s], total [29.5s]/[5h], memory [9.9gb]->[6.1gb]/[9.9gb], all_pools {[young] [532.4mb]->[47.1mb]/[532.5mb]}{[survivor] [66.4mb]->[66.4mb]/[66.5mb]}{[old] [9.3gb]->[6gb]/[9.3gb]}

during this time cluster doesn't process the queries. and queries fails with

[2017-09-28 12:34:49,232][DEBUG][action.admin.cluster.health] [10.0.1.85] no known master node, scheduling a retry
[2017-09-28 12:35:19,233][DEBUG][action.admin.cluster.health] [10.0.1.85] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
[2017-09-28 12:35:19,234][WARN ][rest.suppressed ] path: /_cluster/health, params: {}
MasterNotDiscoveredException[null]
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$5.onTimeout(TransportMasterNodeAction.java:226)
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:236)
at org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:804)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

and data nodes are unable to discover the master nodes, below are the logs

[2017-09-28 12:33:48,614][WARN ][discovery.zen.ping.unicast] [10.0.1.85] failed to send ping to [{#zen_unicast_3#}{10.0.1.89}{10.0.1.89:9300}]
ReceiveTimeoutTransportException[[][10.0.1.89:9300][internal:discovery/zen/unicast] request_id [648779] timed out after [3750ms]]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:679)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

really require suggestions on this, how we can optimise it.

Can you provide the output of the cluster stats API?

{
"timestamp" : 1506621190822,
"cluster_name" : "pediatric",
"status" : "green",
"indices" : {
"count" : 17,
"shards" : {
"total" : 170,
"primaries" : 85,
"replication" : 1.0,
"index" : {
"shards" : {
"min" : 10,
"max" : 10,
"avg" : 10.0
},
"primaries" : {
"min" : 5,
"max" : 5,
"avg" : 5.0
},
"replication" : {
"min" : 1.0,
"max" : 1.0,
"avg" : 1.0
}
}
},
"docs" : {
"count" : 15497625,
"deleted" : 0
},
"store" : {
"size" : "25.2gb",
"size_in_bytes" : 27117215234,
"throttle_time" : "0s",
"throttle_time_in_millis" : 0
},
"fielddata" : {
"memory_size" : "0b",
"memory_size_in_bytes" : 0,
"evictions" : 0
},
"query_cache" : {
"memory_size" : "0b",
"memory_size_in_bytes" : 0,
"total_count" : 0,
"hit_count" : 0,
"miss_count" : 0,
"cache_size" : 0,
"cache_count" : 0,
"evictions" : 0
},
"completion" : {
"size" : "0b",
"size_in_bytes" : 0
},
"segments" : {
"count" : 901,
"memory" : "133.7mb",
"memory_in_bytes" : 140246775,
"terms_memory" : "117.4mb",
"terms_memory_in_bytes" : 123143587,
"stored_fields_memory" : "11.6mb",
"stored_fields_memory_in_bytes" : 12214120,
"term_vectors_memory" : "0b",
"term_vectors_memory_in_bytes" : 0,
"norms_memory" : "4.5mb",
"norms_memory_in_bytes" : 4791360,
"doc_values_memory" : "95.4kb",
"doc_values_memory_in_bytes" : 97708,
"index_writer_memory" : "0b",
"index_writer_memory_in_bytes" : 0,
"index_writer_max_memory" : "83mb",
"index_writer_max_memory_in_bytes" : 87040000,
"version_map_memory" : "0b",
"version_map_memory_in_bytes" : 0,
"fixed_bit_set" : "0b",
"fixed_bit_set_memory_in_bytes" : 0
},
"percolate" : {
"total" : 0,
"time" : "0s",
"time_in_millis" : 0,
"current" : 0,
"memory_size_in_bytes" : -1,
"memory_size" : "-1b",
"queries" : 0
}
},
"nodes" : {
"count" : {
"total" : 4,
"master_only" : 0,
"data_only" : 0,
"master_data" : 3,
"client" : 0
},
"versions" : [ "2.3.5" ],
"os" : {
"available_processors" : 26,
"allocated_processors" : 26,
"mem" : {
"total" : "30.4gb",
"total_in_bytes" : 32718340096
},
"names" : [ {
"name" : "Linux",
"count" : 4
} ]
},
"process" : {
"cpu" : {
"percent" : 0
},
"open_file_descriptors" : {
"min" : 191,
"max" : 845,
"avg" : 657
}
},
"jvm" : {
"max_uptime" : "5.1h",
"max_uptime_in_millis" : 18645205,
"versions" : [ {
"version" : "1.8.0_141",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "25.141-b16",
"vm_vendor" : "Oracle Corporation",
"count" : 4
} ],
"mem" : {
"heap_used" : "1.2gb",
"heap_used_in_bytes" : 1315919824,
"heap_max" : "32.7gb",
"heap_max_in_bytes" : 35206856704
},
"threads" : 241
},
"fs" : {
"total" : "1.4tb",
"total_in_bytes" : 1584931799040,
"free" : "572.3gb",
"free_in_bytes" : 614515998720,
"available" : "497.2gb",
"available_in_bytes" : 533935030272
},
"plugins" : [ {
"name" : "head",
"version" : "master",
"description" : "head - A web front end for an elastic search cluster",
"url" : "/_plugin/head/",
"jvm" : false,
"site" : true
}, {
"name" : "sql",
"version" : "2.3.5.0",
"description" : "Query elasticsearch using SQL",
"url" : "/_plugin/sql/",
"jvm" : true,
"classname" : "org.elasticsearch.plugin.nlpcn.SqlPlug",
"isolated" : true,
"site" : true
} ]
}
}

Do you have any graphs of how heap usage varies over time? It seems like something uses up a lot of heap, leading to long GC. Heap usage however currently seem to be low and all seems fine as far as I can tell. I have however never used the SQL plugin you have installed so have no idea how this affects heap usage.

generally, we do not face any issue during normal operation.

we are using elastic node with Hadoop clusters to store the output of jobs which were executed on 1 0r 2 TB of data. and also we do not face this issue on daily basis. it comes once in 7 or 8 days.

I suspect issue due to JAVA opts arguments for garbage collector.

-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC

openJDK 1.8 is used.

I am preparing the bash script to gather the heap usage of all nodes. let me know what are parameters I should consider for analysing the heap usage.

It would be interesting to see how it develops over time, so I would recommend installing monitoring. Based on the point in time stats you provided it looks fine.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.