Slow cluster operations (_cat/nodes time out)

hunsw · August 28, 2019, 8:54pm

Hey all,

This is a fairly complex problem I'm facing for two days now.

SYMPTOMS

The main symptom from the admin perspective is the time out of the _cat APIs, e.g. _cat/nodes and slow user logins in Kibana.

Sometime a _cat/nodes call takes 45-90 seconds or more to complete.

Searches are fast.

SETUP

The setup is:
3 master
12 data nodes (9 hot, 3 warm),
2 client nodes (kibana)

Searchguard security

10k shards, 3k indices. Some of the indices (cc. 10 pct) are shrunk and frozen.

FURTHER INFO

3 data nodes are receiving events from Logstash instances and one of them is busy rejecting invalid JSON data (cannot put a JSON into a text field), one of them has a threadpool queue full with managament events.

This latter one has gc kicking in every few minutes with 3-400ms/1s overhead, but Cerebro shows only a cc. 40% heap usage. (It has 24 GB heap allocated.)

A cluster restart solves the problem intermittently then everything starts to slow down again.

Hot threads is not very helpful to me, transport_worker threads are culprits where there is high cpu usage.

Otherwise Cerebro shows a green and working cluster.

One more thing I noticed: shard recovery after cluster restart... especially after it got to 'yellow' state was painfully slow.

_cat/threadpool shows 0 everywhere except for the 2 aforementioned nodes:

datanode1.write - 1 active, 3509 rejected
datanode2.management - 1-5 active, 50-300 queued
rest of the nodes - 0 or 1 active 0 queue 0 rejected

Any idea/advice to where to go from here? What are those queued management operations, can I check/view them somehow?

hunsw · September 11, 2019, 10:14am

It looks like that when these slowdowns occur, the management thread pool has a quite large queue on one of the data nodes:

nodename name active queue rejected
datanode2 management 5 1543 0

Can I somehow view what tasks are waiting in the queue?
Or at least check a doc/page where the tasks that management threads are responsible for are listed?

system · October 9, 2019, 10:14am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cat API stuck if some data nodes are overloaded Elasticsearch	6	905	January 8, 2020
Occasional Index Management timeout Elasticsearch	11	1389	July 1, 2021
CAT api doesn't respond Elasticsearch	12	5225	March 12, 2019
Elasticsearch not responding for CAT and other apis Elasticsearch	6	1737	February 25, 2020
How to tack performance issues? Elasticsearch	2	484	July 5, 2017

Slow cluster operations (_cat/nodes time out)

Related topics