Elasticsearch not responding for CAT and other apis

Hi,

Env details:
I am using ELK 7.0.1 in kubernetes environment. I have a 9 node cluster which has 3 master pods, 3 data pods and 3 client pods.
Memory & cpu limits are configured as:

 Master pods:  RAM: no limit, CPU: 1000m,  Jvm: -Xms1g -Xmx1g
 Data pods:  RAM: no limit, CPU: 2000m,  Jvm: -Xms4g -Xmx4g
 Client pods: RAM: no limit, CPU: 2000m,  Jvm: -Xms4g -Xmx4g

Problem -
I am not getting response for few rest apis on elasticsearch.
GET _cluster/health - works.
GET _cat/health - works

For ex. GET /_cluster/health
{"cluster_name":"elk-efkc","status":"green","timed_out":false,"number_of_nodes":9,"number_of_data_nodes":3,"active_primary_shards":136,"active_shards":273,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":4,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":4964575,"active_shards_percent_as_number":100.0}

But there is NO response for _cat/nodes, _cat/shards, _cat/indices, _nodes/stats and many such apis. The curl remains stuck for hours.

I saw an older post with similar problem - CAT api doesn't respond

I have shared the response of GET /_nodes/hot_threads?threads=9999here - https://gist.github.com/aggarwalShivani/e49e3d359f06f6ec5e9a9e10067819db .

Can you please help me figure out the issue with the cluster and how can we resolve this?

Thanks,
Shivani

You can add a timeout parameter to those actions, however that will not work if the node you are connecting to is unresponsive. First task should be to figure out, if there is only a single unresponsive node in your cluster and if so, can you get logs from that one?

It looks like you are using a third-party plugin which is consuming all your management threads, preventing any responses from these APIs. I don't think there's anything we can do about this - you'll need to discuss this with the plugin supplier.

Hmm maybe it's not the plugin. I just saw that you're using an old version, 7.0.1, which means you're using a version without https://github.com/elastic/elasticsearch/pull/42299. I suggest upgrading too.

Thanks for the response. @DavidTurner, I cannot upgrade to ELK 7.1.1 immediately as this cluster is not my dev lab. Are there any workaround steps ( restart of some nodes etc) with which I can repair the existing cluster?
Can you also please help me understand the root cause of this issue? Not sure about what are retention leases (mentioned in the github PR) and how it is affecting in this case.

The fundamental issue is that Elasticsearch is writing a (tiny) file for each shard every 30 seconds, and your system seem to be unable to cope with this workload. This is particularly problematic on nodes holding too many shards; I've never seen this on a cluster with under 300 shards like yours, and suspect there's something wrong with your disks if they are struggling under this kind of load.

The fix I linked was to avoid doing these updates if nothing changed. I can't think of a workaround apart from upgrading, or using faster disks.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.