Elasticsearch operational issue due to garbage collector

chitender · September 28, 2017, 5:38pm

running a 4 node elasticsearch cluster. which has one client node and 3 data nodes. data nodes are also configured as master eligible node.

elasticsearch version is 2.3.5

lucene version is 5.5.0

JAVA is openjdk version 1.8.0_141

below are the JAVA_OPTS values with which elasticsearch is running.

-Xms10g -Xmx10g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC

when ever GC is triggered and at that time data nodes leaves the cluster. below are the GC logs:

[2017-09-28 11:52:52,186][WARN ][monitor.jvm ] [10.0.1.85] [gc][old][504680][614] duration [39.8m], collections [84]/[40.9m], total [39.8m]/[5h], memory [9.9gb]->[9.9gb]/[9.9gb], all_pools {[young] [532.5mb]->[532.4mb]/[532.5mb]}{[survivor] [66.4mb]->[66.4mb]/[66.5mb]}{[old] [9.3gb]->[9.3gb]/[9.3gb]}
[2017-09-28 11:52:53,186][WARN ][monitor.jvm ] [10.0.1.85] [gc][old][504681][615] duration [29.5s], collections [1]/[30.5s], total [29.5s]/[5h], memory [9.9gb]->[6.1gb]/[9.9gb], all_pools {[young] [532.4mb]->[47.1mb]/[532.5mb]}{[survivor] [66.4mb]->[66.4mb]/[66.5mb]}{[old] [9.3gb]->[6gb]/[9.3gb]}

during this time cluster doesn't process the queries. and queries fails with

[2017-09-28 12:34:49,232][DEBUG][action.admin.cluster.health] [10.0.1.85] no known master node, scheduling a retry
[2017-09-28 12:35:19,233][DEBUG][action.admin.cluster.health] [10.0.1.85] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
[2017-09-28 12:35:19,234][WARN ][rest.suppressed ] path: /_cluster/health, params: {}
MasterNotDiscoveredException[null]
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$5.onTimeout(TransportMasterNodeAction.java:226)
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:236)
at org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:804)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

and data nodes are unable to discover the master nodes, below are the logs

[2017-09-28 12:33:48,614][WARN ][discovery.zen.ping.unicast] [10.0.1.85] failed to send ping to [{#zen_unicast_3#}{10.0.1.89}{10.0.1.89:9300}]
ReceiveTimeoutTransportException[[][10.0.1.89:9300][internal:discovery/zen/unicast] request_id [648779] timed out after [3750ms]]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:679)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

really require suggestions on this, how we can optimise it.

Christian_Dahlqvist · September 28, 2017, 5:49pm

Can you provide the output of the cluster stats API?

chitender · September 28, 2017, 5:53pm

{
"timestamp" : 1506621190822,
"cluster_name" : "pediatric",
"status" : "green",
"indices" : {
"count" : 17,
"shards" : {
"total" : 170,
"primaries" : 85,
"replication" : 1.0,
"index" : {
"shards" : {
"min" : 10,
"max" : 10,
"avg" : 10.0
},
"primaries" : {
"min" : 5,
"max" : 5,
"avg" : 5.0
},
"replication" : {
"min" : 1.0,
"max" : 1.0,
"avg" : 1.0
}
}
},
"docs" : {
"count" : 15497625,
"deleted" : 0
},
"store" : {
"size" : "25.2gb",
"size_in_bytes" : 27117215234,
"throttle_time" : "0s",
"throttle_time_in_millis" : 0
},
"fielddata" : {
"memory_size" : "0b",
"memory_size_in_bytes" : 0,
"evictions" : 0
},
"query_cache" : {
"memory_size" : "0b",
"memory_size_in_bytes" : 0,
"total_count" : 0,
"hit_count" : 0,
"miss_count" : 0,
"cache_size" : 0,
"cache_count" : 0,
"evictions" : 0
},
"completion" : {
"size" : "0b",
"size_in_bytes" : 0
},
"segments" : {
"count" : 901,
"memory" : "133.7mb",
"memory_in_bytes" : 140246775,
"terms_memory" : "117.4mb",
"terms_memory_in_bytes" : 123143587,
"stored_fields_memory" : "11.6mb",
"stored_fields_memory_in_bytes" : 12214120,
"term_vectors_memory" : "0b",
"term_vectors_memory_in_bytes" : 0,
"norms_memory" : "4.5mb",
"norms_memory_in_bytes" : 4791360,
"doc_values_memory" : "95.4kb",
"doc_values_memory_in_bytes" : 97708,
"index_writer_memory" : "0b",
"index_writer_memory_in_bytes" : 0,
"index_writer_max_memory" : "83mb",
"index_writer_max_memory_in_bytes" : 87040000,
"version_map_memory" : "0b",
"version_map_memory_in_bytes" : 0,
"fixed_bit_set" : "0b",
"fixed_bit_set_memory_in_bytes" : 0
},
"percolate" : {
"total" : 0,
"time" : "0s",
"time_in_millis" : 0,
"current" : 0,
"memory_size_in_bytes" : -1,
"memory_size" : "-1b",
"queries" : 0
}
},
"nodes" : {
"count" : {
"total" : 4,
"master_only" : 0,
"data_only" : 0,
"master_data" : 3,
"client" : 0
},
"versions" : [ "2.3.5" ],
"os" : {
"available_processors" : 26,
"allocated_processors" : 26,
"mem" : {
"total" : "30.4gb",
"total_in_bytes" : 32718340096
},
"names" : [ {
"name" : "Linux",
"count" : 4
} ]
},
"process" : {
"cpu" : {
"percent" : 0
},
"open_file_descriptors" : {
"min" : 191,
"max" : 845,
"avg" : 657
}
},
"jvm" : {
"max_uptime" : "5.1h",
"max_uptime_in_millis" : 18645205,
"versions" : [ {
"version" : "1.8.0_141",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "25.141-b16",
"vm_vendor" : "Oracle Corporation",
"count" : 4
} ],
"mem" : {
"heap_used" : "1.2gb",
"heap_used_in_bytes" : 1315919824,
"heap_max" : "32.7gb",
"heap_max_in_bytes" : 35206856704
},
"threads" : 241
},
"fs" : {
"total" : "1.4tb",
"total_in_bytes" : 1584931799040,
"free" : "572.3gb",
"free_in_bytes" : 614515998720,
"available" : "497.2gb",
"available_in_bytes" : 533935030272
},
"plugins" : [ {
"name" : "head",
"version" : "master",
"description" : "head - A web front end for an elastic search cluster",
"url" : "/_plugin/head/",
"jvm" : false,
"site" : true
}, {
"name" : "sql",
"version" : "2.3.5.0",
"description" : "Query elasticsearch using SQL",
"url" : "/_plugin/sql/",
"jvm" : true,
"classname" : "org.elasticsearch.plugin.nlpcn.SqlPlug",
"isolated" : true,
"site" : true
} ]
}
}

Christian_Dahlqvist · September 28, 2017, 6:07pm

Do you have any graphs of how heap usage varies over time? It seems like something uses up a lot of heap, leading to long GC. Heap usage however currently seem to be low and all seems fine as far as I can tell. I have however never used the SQL plugin you have installed so have no idea how this affects heap usage.

chitender · September 28, 2017, 6:20pm

generally, we do not face any issue during normal operation.

we are using elastic node with Hadoop clusters to store the output of jobs which were executed on 1 0r 2 TB of data. and also we do not face this issue on daily basis. it comes once in 7 or 8 days.

I suspect issue due to JAVA opts arguments for garbage collector.

-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC

openJDK 1.8 is used.

chitender · September 29, 2017, 7:04am

I am preparing the bash script to gather the heap usage of all nodes. let me know what are parameters I should consider for analysing the heap usage.

Christian_Dahlqvist · September 29, 2017, 7:06am

It would be interesting to see how it develops over time, so I would recommend installing monitoring. Based on the point in time stats you provided it looks fine.

system · October 27, 2017, 7:06am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch - Garbage Collection Issues Elasticsearch	6	1894	February 27, 2019
Elasticsarch Garbage Collection Issue Elasticsearch	15	633	May 21, 2020
Garbage collector question Elasticsearch	13	672	April 19, 2018
Garbage collector taking more time with increase in search requests Elasticsearch	2	398	July 6, 2017
Help with GC configuration Elasticsearch	5	652	July 6, 2017

Elasticsearch operational issue due to garbage collector

Related topics