Elasticsearch not responding for CAT and other apis

shivani_aggarwal · January 27, 2020, 1:28pm

Hi,

Env details:
I am using ELK 7.0.1 in kubernetes environment. I have a 9 node cluster which has 3 master pods, 3 data pods and 3 client pods.
Memory & cpu limits are configured as:

 Master pods:  RAM: no limit, CPU: 1000m,  Jvm: -Xms1g -Xmx1g
 Data pods:  RAM: no limit, CPU: 2000m,  Jvm: -Xms4g -Xmx4g
 Client pods: RAM: no limit, CPU: 2000m,  Jvm: -Xms4g -Xmx4g

Problem -
I am not getting response for few rest apis on elasticsearch.
GET _cluster/health - works.
GET _cat/health - works

For ex. GET /_cluster/health
{"cluster_name":"elk-efkc","status":"green","timed_out":false,"number_of_nodes":9,"number_of_data_nodes":3,"active_primary_shards":136,"active_shards":273,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":4,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":4964575,"active_shards_percent_as_number":100.0}

But there is NO response for _cat/nodes, _cat/shards, _cat/indices, _nodes/stats and many such apis. The curl remains stuck for hours.

I saw an older post with similar problem - CAT api doesn't respond

I have shared the response of GET /_nodes/hot_threads?threads=9999here - https://gist.github.com/aggarwalShivani/e49e3d359f06f6ec5e9a9e10067819db .

Can you please help me figure out the issue with the cluster and how can we resolve this?

Thanks,
Shivani

spinscale · January 27, 2020, 3:31pm

You can add a timeout parameter to those actions, however that will not work if the node you are connecting to is unresponsive. First task should be to figure out, if there is only a single unresponsive node in your cluster and if so, can you get logs from that one?

DavidTurner · January 27, 2020, 3:36pm

It looks like you are using a third-party plugin which is consuming all your management threads, preventing any responses from these APIs. I don't think there's anything we can do about this - you'll need to discuss this with the plugin supplier.

DavidTurner · January 27, 2020, 3:50pm

Hmm maybe it's not the plugin. I just saw that you're using an old version, 7.0.1, which means you're using a version without https://github.com/elastic/elasticsearch/pull/42299. I suggest upgrading too.

shivani_aggarwal · January 28, 2020, 5:18am

Thanks for the response. @DavidTurner, I cannot upgrade to ELK 7.1.1 immediately as this cluster is not my dev lab. Are there any workaround steps ( restart of some nodes etc) with which I can repair the existing cluster?
Can you also please help me understand the root cause of this issue? Not sure about what are retention leases (mentioned in the github PR) and how it is affecting in this case.

DavidTurner · January 28, 2020, 8:17am

The fundamental issue is that Elasticsearch is writing a (tiny) file for each shard every 30 seconds, and your system seem to be unable to cope with this workload. This is particularly problematic on nodes holding too many shards; I've never seen this on a cluster with under 300 shards like yours, and suspect there's something wrong with your disks if they are struggling under this kind of load.

The fix I linked was to avoid doing these updates if nothing changed. I can't think of a workaround apart from upgrading, or using faster disks.

system · February 25, 2020, 8:17am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CAT api doesn't respond Elasticsearch	12	5217	March 12, 2019
Tribe node setup, /_cat/nodes unresponsive Elasticsearch	1	429	July 6, 2017
Cat api calls take long time to respond Elasticsearch	1	652	July 5, 2017
Cat API stuck if some data nodes are overloaded Elasticsearch	6	903	January 8, 2020
/_cat/nodes times out Elasticsearch	1	383	July 5, 2017

Elasticsearch not responding for CAT and other apis

Related topics