One of three pods stop but Elasticsearch thinks it's still running

Hey y'all,

We are running Elasticsearch 7.9.3 with the help of helm-charts/elasticsearch and lately we have been having a pod go down out of the blue. When I check on the health of Elasticsearch via CURL _cluster/health I get the following, which infers all nodes are ok :

      "cluster_name" : "elasticsearch",
      "status" : "green",
      "timed_out" : false,
      "number_of_nodes" : 3,
      "number_of_data_nodes" : 3,
      "active_primary_shards" : 43,
      "active_shards" : 86,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0,
      "delayed_unassigned_shards" : 0,
      "number_of_pending_tasks" : 0,
      "number_of_in_flight_fetch" : 0,
      "task_max_waiting_in_queue_millis" : 0,
      "active_shards_percent_as_number" : 100.0

I've checked the pod logs to see what happens before it dies, but this is all I get:

    [gc][147189] overhead, spent [2.9s] collecting in the last [5.2s]
    [gc][young][147189][871] duration [2.9s], collections [1]/[5.2s], total [2.9s]/[56.1s], memory 
    [762.6mb]->[150.7mb]/[1gb], all_pools {[young] [612mb]->[1mb]/[0b]}{[old] [150.1mb]-> 
    [150.1mb]/[1gb]}{[survivor] [534.3kb]->[614.8kb]/[0b]}
    [gc][146383] overhead, spent [669ms] collecting in the last [1.1s]
    [gc][137869] overhead, spent [261ms] collecting in the last [1s]
    [gc][126549] overhead, spent [287ms] collecting in the last [1s]

Any ideas? I did some searching and saw that the above messages were related to heap management, but I'm not so sure.

2.9 of 5.2 seconds and .669 of 1.1 is a lot if time the node will do nothing. It does "stop the world" for that time. Is it possible that it's just pausing and recovering or do you see a full node restart in the log? How much heap do you have allocated?

Hey thanks for the response,

I don't think it attempts a restart, it just sort of stops, and when checking pod status it says it's running but it is no longer "Ready". When I try to exec into the pod it's unresponsive, and the exec command fails.

As for the heap, we are using the helm-chart default, which is 1 gig.

How old is this cluster? Did it work OK for awhile, but more data made it unhealthy? Limited to 1G heap, it may be time to add nodes.

We started off on Elasticsearch 6.8.X last summer (2019) and upgraded to 7 in January. We had no real issue until our upgrade to 7.9 (from 7.4.X I believe). From 7.9.0 we thought maybe it was the memory leak reported here https://github.com/elastic/elasticsearch/issues/61512 . So we did an upgrade to 7.9.3 and the issue still seems to occur.