Elasticsearch operational issue due to garbage collector


(chitender kumar) #1

running a 4 node elasticsearch cluster. which has one client node and 3 data nodes. data nodes are also configured as master eligible node.

elasticsearch version is 2.3.5

lucene version is 5.5.0

JAVA is openjdk version 1.8.0_141

below are the JAVA_OPTS values with which elasticsearch is running.

-Xms10g -Xmx10g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC

when ever GC is triggered and at that time data nodes leaves the cluster. below are the GC logs:

[2017-09-28 11:52:52,186][WARN ][monitor.jvm ] [10.0.1.85] [gc][old][504680][614] duration [39.8m], collections [84]/[40.9m], total [39.8m]/[5h], memory [9.9gb]->[9.9gb]/[9.9gb], all_pools {[young] [532.5mb]->[532.4mb]/[532.5mb]}{[survivor] [66.4mb]->[66.4mb]/[66.5mb]}{[old] [9.3gb]->[9.3gb]/[9.3gb]}
[2017-09-28 11:52:53,186][WARN ][monitor.jvm ] [10.0.1.85] [gc][old][504681][615] duration [29.5s], collections [1]/[30.5s], total [29.5s]/[5h], memory [9.9gb]->[6.1gb]/[9.9gb], all_pools {[young] [532.4mb]->[47.1mb]/[532.5mb]}{[survivor] [66.4mb]->[66.4mb]/[66.5mb]}{[old] [9.3gb]->[6gb]/[9.3gb]}

during this time cluster doesn't process the queries. and queries fails with

[2017-09-28 12:34:49,232][DEBUG][action.admin.cluster.health] [10.0.1.85] no known master node, scheduling a retry
[2017-09-28 12:35:19,233][DEBUG][action.admin.cluster.health] [10.0.1.85] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
[2017-09-28 12:35:19,234][WARN ][rest.suppressed ] path: /_cluster/health, params: {}
MasterNotDiscoveredException[null]
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$5.onTimeout(TransportMasterNodeAction.java:226)
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:236)
at org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:804)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

and data nodes are unable to discover the master nodes, below are the logs

[2017-09-28 12:33:48,614][WARN ][discovery.zen.ping.unicast] [10.0.1.85] failed to send ping to [{#zen_unicast_3#}{10.0.1.89}{10.0.1.89:9300}]
ReceiveTimeoutTransportException[[][10.0.1.89:9300][internal:discovery/zen/unicast] request_id [648779] timed out after [3750ms]]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:679)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

really require suggestions on this, how we can optimise it.


(Christian Dahlqvist) #2

Can you provide the output of the cluster stats API?


(chitender kumar) #3

{
"timestamp" : 1506621190822,
"cluster_name" : "pediatric",
"status" : "green",
"indices" : {
"count" : 17,
"shards" : {
"total" : 170,
"primaries" : 85,
"replication" : 1.0,
"index" : {
"shards" : {
"min" : 10,
"max" : 10,
"avg" : 10.0
},
"primaries" : {
"min" : 5,
"max" : 5,
"avg" : 5.0
},
"replication" : {
"min" : 1.0,
"max" : 1.0,
"avg" : 1.0
}
}
},
"docs" : {
"count" : 15497625,
"deleted" : 0
},
"store" : {
"size" : "25.2gb",
"size_in_bytes" : 27117215234,
"throttle_time" : "0s",
"throttle_time_in_millis" : 0
},
"fielddata" : {
"memory_size" : "0b",
"memory_size_in_bytes" : 0,
"evictions" : 0
},
"query_cache" : {
"memory_size" : "0b",
"memory_size_in_bytes" : 0,
"total_count" : 0,
"hit_count" : 0,
"miss_count" : 0,
"cache_size" : 0,
"cache_count" : 0,
"evictions" : 0
},
"completion" : {
"size" : "0b",
"size_in_bytes" : 0
},
"segments" : {
"count" : 901,
"memory" : "133.7mb",
"memory_in_bytes" : 140246775,
"terms_memory" : "117.4mb",
"terms_memory_in_bytes" : 123143587,
"stored_fields_memory" : "11.6mb",
"stored_fields_memory_in_bytes" : 12214120,
"term_vectors_memory" : "0b",
"term_vectors_memory_in_bytes" : 0,
"norms_memory" : "4.5mb",
"norms_memory_in_bytes" : 4791360,
"doc_values_memory" : "95.4kb",
"doc_values_memory_in_bytes" : 97708,
"index_writer_memory" : "0b",
"index_writer_memory_in_bytes" : 0,
"index_writer_max_memory" : "83mb",
"index_writer_max_memory_in_bytes" : 87040000,
"version_map_memory" : "0b",
"version_map_memory_in_bytes" : 0,
"fixed_bit_set" : "0b",
"fixed_bit_set_memory_in_bytes" : 0
},
"percolate" : {
"total" : 0,
"time" : "0s",
"time_in_millis" : 0,
"current" : 0,
"memory_size_in_bytes" : -1,
"memory_size" : "-1b",
"queries" : 0
}
},
"nodes" : {
"count" : {
"total" : 4,
"master_only" : 0,
"data_only" : 0,
"master_data" : 3,
"client" : 0
},
"versions" : [ "2.3.5" ],
"os" : {
"available_processors" : 26,
"allocated_processors" : 26,
"mem" : {
"total" : "30.4gb",
"total_in_bytes" : 32718340096
},
"names" : [ {
"name" : "Linux",
"count" : 4
} ]
},
"process" : {
"cpu" : {
"percent" : 0
},
"open_file_descriptors" : {
"min" : 191,
"max" : 845,
"avg" : 657
}
},
"jvm" : {
"max_uptime" : "5.1h",
"max_uptime_in_millis" : 18645205,
"versions" : [ {
"version" : "1.8.0_141",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "25.141-b16",
"vm_vendor" : "Oracle Corporation",
"count" : 4
} ],
"mem" : {
"heap_used" : "1.2gb",
"heap_used_in_bytes" : 1315919824,
"heap_max" : "32.7gb",
"heap_max_in_bytes" : 35206856704
},
"threads" : 241
},
"fs" : {
"total" : "1.4tb",
"total_in_bytes" : 1584931799040,
"free" : "572.3gb",
"free_in_bytes" : 614515998720,
"available" : "497.2gb",
"available_in_bytes" : 533935030272
},
"plugins" : [ {
"name" : "head",
"version" : "master",
"description" : "head - A web front end for an elastic search cluster",
"url" : "/_plugin/head/",
"jvm" : false,
"site" : true
}, {
"name" : "sql",
"version" : "2.3.5.0",
"description" : "Query elasticsearch using SQL",
"url" : "/_plugin/sql/",
"jvm" : true,
"classname" : "org.elasticsearch.plugin.nlpcn.SqlPlug",
"isolated" : true,
"site" : true
} ]
}
}


(Christian Dahlqvist) #4

Do you have any graphs of how heap usage varies over time? It seems like something uses up a lot of heap, leading to long GC. Heap usage however currently seem to be low and all seems fine as far as I can tell. I have however never used the SQL plugin you have installed so have no idea how this affects heap usage.


(chitender kumar) #5

generally, we do not face any issue during normal operation.

we are using elastic node with Hadoop clusters to store the output of jobs which were executed on 1 0r 2 TB of data. and also we do not face this issue on daily basis. it comes once in 7 or 8 days.

I suspect issue due to JAVA opts arguments for garbage collector.

-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC

openJDK 1.8 is used.


(chitender kumar) #6

I am preparing the bash script to gather the heap usage of all nodes. let me know what are parameters I should consider for analysing the heap usage.


(Christian Dahlqvist) #7

It would be interesting to see how it develops over time, so I would recommend installing monitoring. Based on the point in time stats you provided it looks fine.


(system) #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.