This is a fairly complex problem I'm facing for two days now.
The main symptom from the admin perspective is the time out of the _cat APIs, e.g. _cat/nodes and slow user logins in Kibana.
Sometime a _cat/nodes call takes 45-90 seconds or more to complete.
Searches are fast.
The setup is:
12 data nodes (9 hot, 3 warm),
2 client nodes (kibana)
- Searchguard security
10k shards, 3k indices. Some of the indices (cc. 10 pct) are shrunk and frozen.
3 data nodes are receiving events from Logstash instances and one of them is busy rejecting invalid JSON data (cannot put a JSON into a text field), one of them has a threadpool queue full with managament events.
This latter one has gc kicking in every few minutes with 3-400ms/1s overhead, but Cerebro shows only a cc. 40% heap usage. (It has 24 GB heap allocated.)
A cluster restart solves the problem intermittently then everything starts to slow down again.
Hot threads is not very helpful to me, transport_worker threads are culprits where there is high cpu usage.
Otherwise Cerebro shows a green and working cluster.
One more thing I noticed: shard recovery after cluster restart... especially after it got to 'yellow' state was painfully slow.
_cat/threadpool shows 0 everywhere except for the 2 aforementioned nodes:
datanode1.write - 1 active, 3509 rejected
datanode2.management - 1-5 active, 50-300 queued
rest of the nodes - 0 or 1 active 0 queue 0 rejected
Any idea/advice to where to go from here? What are those queued management operations, can I check/view them somehow?