One of three pods stop but Elasticsearch thinks it's still running

lahpman · November 20, 2020, 4:21pm

Hey y'all,

We are running Elasticsearch 7.9.3 with the help of helm-charts/elasticsearch and lately we have been having a pod go down out of the blue. When I check on the health of Elasticsearch via CURL _cluster/health I get the following, which infers all nodes are ok :

      "cluster_name" : "elasticsearch",
      "status" : "green",
      "timed_out" : false,
      "number_of_nodes" : 3,
      "number_of_data_nodes" : 3,
      "active_primary_shards" : 43,
      "active_shards" : 86,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0,
      "delayed_unassigned_shards" : 0,
      "number_of_pending_tasks" : 0,
      "number_of_in_flight_fetch" : 0,
      "task_max_waiting_in_queue_millis" : 0,
      "active_shards_percent_as_number" : 100.0

I've checked the pod logs to see what happens before it dies, but this is all I get:

    [gc][147189] overhead, spent [2.9s] collecting in the last [5.2s]
    [gc][young][147189][871] duration [2.9s], collections [1]/[5.2s], total [2.9s]/[56.1s], memory 
    [762.6mb]->[150.7mb]/[1gb], all_pools {[young] [612mb]->[1mb]/[0b]}{[old] [150.1mb]-> 
    [150.1mb]/[1gb]}{[survivor] [534.3kb]->[614.8kb]/[0b]}
    [gc][146383] overhead, spent [669ms] collecting in the last [1.1s]
    [gc][137869] overhead, spent [261ms] collecting in the last [1s]
    [gc][126549] overhead, spent [287ms] collecting in the last [1s]

Any ideas? I did some searching and saw that the above messages were related to heap management, but I'm not so sure.

rugenl · November 22, 2020, 1:21am

2.9 of 5.2 seconds and .669 of 1.1 is a lot if time the node will do nothing. It does "stop the world" for that time. Is it possible that it's just pausing and recovering or do you see a full node restart in the log? How much heap do you have allocated?

lahpman · November 23, 2020, 1:05pm

Hey thanks for the response,

I don't think it attempts a restart, it just sort of stops, and when checking pod status it says it's running but it is no longer "Ready". When I try to exec into the pod it's unresponsive, and the exec command fails.

As for the heap, we are using the helm-chart default, which is 1 gig.

rugenl · November 23, 2020, 4:40pm

How old is this cluster? Did it work OK for awhile, but more data made it unhealthy? Limited to 1G heap, it may be time to add nodes.

lahpman · November 23, 2020, 5:44pm

We started off on Elasticsearch 6.8.X last summer (2019) and upgraded to 7 in January. We had no real issue until our upgrade to 7.9 (from 7.4.X I believe). From 7.9.0 we thought maybe it was the memory leak reported here https://github.com/elastic/elasticsearch/issues/61512 . So we did an upgrade to 7.9.3 and the issue still seems to occur.

system · December 21, 2020, 5:44pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Detecting Data Loss Elasticsearch	6	1613	January 21, 2018
ElasticSearch 2-out-of-4 Master Replica goes down Elasticsearch	8	246	November 3, 2022
Elasticsearh Error : Pods not coming back up Elasticsearch docker	9	842	February 4, 2022
No alive nodes found in cluster Elasticsearch	6	1132	May 21, 2019
5.5.3 Node stalls for no apparent reason Elasticsearch	19	1585	March 22, 2018

One of three pods stop but Elasticsearch thinks it's still running

Related topics