We have seen an issue since the last six months in versions 6.4.2 , 6.8.23 and 7.17.1 where a specific node gets stuck for 15 minutes resulting in timeouts for all calls to that node. We have seen this in TransportClient and HighLevel Rest Client. If anyone has faced such an issue it will be great if you could help.
Just adding some more info here. The only setting we found in Elasticsearch having 15m timeout is
indices.recovery.internal_action_timeout
During the issue we checked the indices and there were no unassigned shards or reallocation happening. The Cluster was green.
These versions are all really old, and the 6.x ones are well past EOL so no point in digging deeper there. In the 7.17 one what does GET _nodes/hot_threads?threads=9999
say while it's stuck?
Thanks @DavidTurner . I will collect that stats the next time this happens. Probably setup a cron to collect this stats. This happens once in about 15 days in our logging cluster. We will try simulating this in our load environment.