Elasticsearch: 7.5.1
Infrastructure: Azure AKS
Storage: Standard SSD
_cat/nodes
name m role ip ramMax ramPercent ramCurrent heapMax heapPercent heapCurrent diskTotal diskUsed cpu uptime iic
elasticsearch-master-1 * dilm 10.240.9.33 62.8gb 66 41.2gb 14.9gb 50 7.5gb 2tb 1.5tb 5 1.2d 0
elasticsearch-master-0 - dilm 10.240.1.28 62.8gb 60 37.7gb 14.9gb 68 10.2gb 2tb 1.3tb 5 1.2d 0
elasticsearch-master-2 - dilm 10.240.9.12 62.8gb 67 42gb 14.9gb 46 6.9gb 2tb 1.5tb 4 1.1d 0
_cat/allocation
shards disk.indices disk.used disk.avail disk.total disk.percent host ip node
999 1.5tb 1.5tb 449.1gb 2tb 78 10.240.9.33 10.240.9.33 elasticsearch-master-1
999 1.5tb 1.5tb 499.9gb 2tb 75 10.240.9.12 10.240.9.12 elasticsearch-master-2
999 1.3tb 1.3tb 717.3gb 2tb 65 10.240.1.28 10.240.1.28 elasticsearch-master-0
Hello Team,
After recent ELK cluster reboot Node and Indice stats endpoints eventually become very slow for one (elasticsearch-master-2) of 3 nodes in the cluster.
I narrowed down the slowness to a specific metric of the translog, examples below:
GET /_nodes/elasticsearch-master-2/stats/indices/translog
GET /_all/_stats/translog
It takes about 10-15s for one single task to process acording to _cat/tasks.
All other metrics return result almost immidiatly for the affected node.
We use elasticsearch_exporter to send metrics to Prometheus by having it enabled it becomes impossible to use cluster: _cat/tasks
As a workaround we shutdown the exporter. Is there a well known bug for this behavior or any chance to find the root cause of the unexpected slowness?
Thank you.
BR
Aleksandr