Hey y'all,
We are running Elasticsearch 7.9.3 with the help of helm-charts/elasticsearch and lately we have been having a pod go down out of the blue. When I check on the health of Elasticsearch via CURL _cluster/health I get the following, which infers all nodes are ok :
"cluster_name" : "elasticsearch",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 3,
"active_primary_shards" : 43,
"active_shards" : 86,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
I've checked the pod logs to see what happens before it dies, but this is all I get:
[gc][147189] overhead, spent [2.9s] collecting in the last [5.2s]
[gc][young][147189][871] duration [2.9s], collections [1]/[5.2s], total [2.9s]/[56.1s], memory
[762.6mb]->[150.7mb]/[1gb], all_pools {[young] [612mb]->[1mb]/[0b]}{[old] [150.1mb]->
[150.1mb]/[1gb]}{[survivor] [534.3kb]->[614.8kb]/[0b]}
[gc][146383] overhead, spent [669ms] collecting in the last [1.1s]
[gc][137869] overhead, spent [261ms] collecting in the last [1s]
[gc][126549] overhead, spent [287ms] collecting in the last [1s]
Any ideas? I did some searching and saw that the above messages were related to heap management, but I'm not so sure.