1 node in an elasticsearch cluster getting stuck for 15 minutes and then starts working

anandgopalratnam · June 27, 2024, 7:46am

We have seen an issue since the last six months in versions 6.4.2 , 6.8.23 and 7.17.1 where a specific node gets stuck for 15 minutes resulting in timeouts for all calls to that node. We have seen this in TransportClient and HighLevel Rest Client. If anyone has faced such an issue it will be great if you could help.

anandgopalratnam · June 27, 2024, 2:07pm

Just adding some more info here. The only setting we found in Elasticsearch having 15m timeout is
indices.recovery.internal_action_timeout
During the issue we checked the indices and there were no unassigned shards or reallocation happening. The Cluster was green.

DavidTurner · June 27, 2024, 7:11pm

These versions are all really old, and the 6.x ones are well past EOL so no point in digging deeper there. In the 7.17 one what does GET _nodes/hot_threads?threads=9999 say while it's stuck?

anandgopalratnam · June 27, 2024, 7:34pm

Thanks @DavidTurner . I will collect that stats the next time this happens. Probably setup a cron to collect this stats. This happens once in about 15 days in our logging cluster. We will try simulating this in our load environment.

Topic		Replies	Views
Long period of querying failure during node timeout Elasticsearch	4	1040	May 15, 2020
Elasticsearch cluster instability Elasticsearch	13	2822	July 6, 2017
Hanging active search threads Elasticsearch	1	315	July 13, 2020
Cluster Hangs for 20 seconds, on a single node crush Elasticsearch	13	892	October 3, 2019
Elasticsearch cluster request timeout and slow response time Elasticsearch	1	1588	March 2, 2021

1 node in an elasticsearch cluster getting stuck for 15 minutes and then starts working

Related topics