Stats endpoints response slow

scroodj · July 19, 2022, 7:27pm

Elasticsearch: 7.5.1
Infrastructure: Azure AKS
Storage: Standard SSD

_cat/nodes

name                   m role ip          ramMax ramPercent ramCurrent heapMax heapPercent heapCurrent diskTotal diskUsed cpu uptime iic
elasticsearch-master-1 * dilm 10.240.9.33 62.8gb         66     41.2gb  14.9gb          50       7.5gb       2tb    1.5tb   5   1.2d   0
elasticsearch-master-0 - dilm 10.240.1.28 62.8gb         60     37.7gb  14.9gb          68      10.2gb       2tb    1.3tb   5   1.2d   0
elasticsearch-master-2 - dilm 10.240.9.12 62.8gb         67       42gb  14.9gb          46       6.9gb       2tb    1.5tb   4   1.1d   0

_cat/allocation

shards disk.indices disk.used disk.avail disk.total disk.percent host        ip          node
   999        1.5tb     1.5tb    449.1gb        2tb           78 10.240.9.33 10.240.9.33 elasticsearch-master-1
   999        1.5tb     1.5tb    499.9gb        2tb           75 10.240.9.12 10.240.9.12 elasticsearch-master-2
   999        1.3tb     1.3tb    717.3gb        2tb           65 10.240.1.28 10.240.1.28 elasticsearch-master-0

Hello Team,
After recent ELK cluster reboot Node and Indice stats endpoints eventually become very slow for one (elasticsearch-master-2) of 3 nodes in the cluster.
I narrowed down the slowness to a specific metric of the translog, examples below:

GET /_nodes/elasticsearch-master-2/stats/indices/translog

GET /_all/_stats/translog

It takes about 10-15s for one single task to process acording to _cat/tasks.
All other metrics return result almost immidiatly for the affected node.

We use elasticsearch_exporter to send metrics to Prometheus by having it enabled it becomes impossible to use cluster: _cat/tasks

As a workaround we shutdown the exporter. Is there a well known bug for this behavior or any chance to find the root cause of the unexpected slowness?

Thank you.

BR
Aleksandr

scroodj · July 19, 2022, 7:44pm

Hot threads report during long running task /_nodes/_local/hot_threads

DavidTurner · July 19, 2022, 8:08pm

Looks like the caching introduced in #82721 should fix this for you. You should upgrade anyway, v7.5 is very old and long past EOL so it's not supported any more.

scroodj · July 20, 2022, 8:27am

Thank you David for the reply!

system · August 17, 2022, 8:28am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Received response for a request that has timed out and "failed to retrieve stats for node" Elasticsearch	8	1072	October 26, 2023
Timeouts on Node Stats API? Elasticsearch	10	2803	July 6, 2017
Action [cluster:monitor/nodes/stats[n]] timed out Elasticsearch	9	1570	November 5, 2022
Stats requests to data nodes timing out while ingestion is happening Elasticsearch	8	484	June 23, 2022
Occasional Index Management timeout Elasticsearch	11	1389	July 1, 2021

Stats endpoints response slow

Related topics