Slow Navigation in Kibana

Dear Community,

we use the following configuration of ELK:
Version: 8.5.3
Number of master nodes: 1
Number of master and warm nodes: 2
Hot nodes: 2
Kibana instances: 2

Our issue is that the navigation in Kibana is quite slow. Timeouts are often reported by browser.

Moreover, by sending the query GET kbn:api/task_manager/_health for about 20-30 times, it comes that one of them is not getting answered within the timeout limit of 30s.

The Logs of Kibana repeatedly show the following error every 30s:
{"service":{"node":{"roles":["background_tasks","ui"]}},"ecs":{"version":"8.4.0"},"@timestamp":"2023-12-12T14:53:37.005+01:00","message":"Failed to poll for work: Error: work has timed out","log":{"level":"ERROR","logger":"plugins.taskManager"},"process":{"pid":1013143},"trace":{"id":"15b465f635d80a6082f6c3ba991f0510"},"transaction":{"id":"c85ee3a8920becae"}}

The Logs of Elasticsearch are clean. The cluster health is green:

ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
... 18 99 1 0.17 0.15 0.10 msw - warm-node-02
... 20 87 0 0.03 0.03 0.00 mv - master-node-01
... 26 99 0 0.05 0.07 0.08 msw * warm-node-01
... 40 93 3 0.14 0.21 0.27 his - hot-node-02
... 55 99 3 0.46 0.64 0.60 his - hot-node-01

The /status page of Kibana is yellow showing 99 services as degraded:

The health of task manager reports an error status and a quite high drift:

We noticed this issue after the number of shards in data nodes reached the limit of 1000 shards pro node. At first, we solved this issue by increasing the limit to 2000. In the meanwhile, we reduced the number of shards to less than 700 shards pro data node. However, the issue with Kibana persists.

Could anyone please help us?

How is the usage for the elasticsearch and kibana nodes in monitoring? That is the best place to start looking for hints about the performance degradation.

Hi Marius,

Thank you for replying.

The monitoring page in Kibana cannot be viewed. It repeatedly loads itself and reach timeouts with the error message 'Request timeout: Check the Elasticsearch Monitoring cluster network connection or the load level of the nodes.'

Just a note, the status of Elasticsearch being "green" is a measurement of the allocation of shards, and it means all primary and replica shards are allocated. It is not a measure of querying performance. It sounds like your cluster is still overloaded.

Hi Tim,

The green state of the cluster has indeed nothing to do with the querying performance - sorry for the slip of the tongue.

However, please note the table with some performance measures below the sentence 'The cluster health is green'. The heap and cpu consumption does not seem high, does it?

Could you please suggest us steps to delve deeper into diagnostics and eventually prove the overload?

Hi, it looks like there is perhaps a high number of background tasks in Kibana to manage some internal state. From what I can tell, a large amount of work is going into managing "search sessions."

I see you are on version 8.5. If you are able to upgrade to 8.6 at least, you could benefit from work done to ease this internal management work.

Another option to ease the amount of these background tasks could be to disable search sessions in your cluster. See: Search sessions settings in Kibana | Kibana Guide [8.5] | Elastic