Slow cluster operations (_cat/nodes time out)

Hey all,

This is a fairly complex problem I'm facing for two days now.

SYMPTOMS

The main symptom from the admin perspective is the time out of the _cat APIs, e.g. _cat/nodes and slow user logins in Kibana.

Sometime a _cat/nodes call takes 45-90 seconds or more to complete.

Searches are fast.

SETUP

The setup is:
3 master
12 data nodes (9 hot, 3 warm),
2 client nodes (kibana)

  • Searchguard security

10k shards, 3k indices. Some of the indices (cc. 10 pct) are shrunk and frozen.

FURTHER INFO

3 data nodes are receiving events from Logstash instances and one of them is busy rejecting invalid JSON data (cannot put a JSON into a text field), one of them has a threadpool queue full with managament events.

This latter one has gc kicking in every few minutes with 3-400ms/1s overhead, but Cerebro shows only a cc. 40% heap usage. (It has 24 GB heap allocated.)

A cluster restart solves the problem intermittently then everything starts to slow down again.

Hot threads is not very helpful to me, transport_worker threads are culprits where there is high cpu usage.

Otherwise Cerebro shows a green and working cluster.

One more thing I noticed: shard recovery after cluster restart... especially after it got to 'yellow' state was painfully slow.

_cat/threadpool shows 0 everywhere except for the 2 aforementioned nodes:

datanode1.write - 1 active, 3509 rejected
datanode2.management - 1-5 active, 50-300 queued
rest of the nodes - 0 or 1 active 0 queue 0 rejected

Any idea/advice to where to go from here? What are those queued management operations, can I check/view them somehow?

It looks like that when these slowdowns occur, the management thread pool has a quite large queue on one of the data nodes:

nodename name active queue rejected
datanode2 management 5 1543 0

Can I somehow view what tasks are waiting in the queue?
Or at least check a doc/page where the tasks that management threads are responsible for are listed?