How to monitor detail why elasstic data node is overloaded

Petr.Simik · April 17, 2023, 8:55am

our elastic is v8.6
3 master, 40x datanode (8core/32GB/2TB) , 2x loadbalancer node
We ingest 100k-900K events/sec by few hundreds of different ingest pipelines, some of them creates small indices but there is about 5-8 high performance pumps.

we are repeatedly facing the performance problem with single node overload.
since v8 there is a huge improvement however this problem still happened on weekly basis

I suspect that problem is caused by high performance pumps + ILM when cluster allocates hi-perf shards on the same host which causes its local overload.

This is the reason why I am looking for more detailed monitoring for troublehooting this problem

Idea is to monitor load of node and breakdown of this load to particular shard.
As result I want to see what type of ingest is causing node overload.

Is there aby best practice?
I plan to use cat/_shards API to periodically get the detail and store it in timeseries

this method will provide shardname, some dynamic load parameters and NODE..
So i will be able to link load of shard to NODE.

But I do not know which parameter is providing current ingest rate in time.

in the picture above is the current problem 100% CPU load of one node
so I want to see breakdown / stack what causes this load.

any idea how to achieve this?

Thank you

Christian_Dahlqvist · April 17, 2023, 9:13am

How many indices are you actively indexing into? How many of these are getting high volumes? How many primary and replica shards do these indices have?

What type of storage do your nodes have? Local SSDs?

What does disk I/O and iowait look like on the data nodes?

Petr.Simik · April 17, 2023, 3:06pm

@Christian_Dahlqvist thank you for asking

How many indices are you actively indexing into?
95 Indices + there are about 50 not active

How many of these are getting high volumes?
10 indices

How many primary and replica shards do these indices have?
primary/replica vary depending on performance
for small indices 1/1 for larger we have up to 10/0

index count	primary shards
4	8
3	10
93	1
1	5
3	2
6	6
16	3

What type of storage do your nodes have? Local SSDs?
it is running on VMWare VM, and storage is connected via iSCSI - full SSD

What does disk I/O and iowait look like on the data nodes?

read/write latency on one particular node looks 3ms in average on write
but some nodes are higher on latencies

and for instance tela11 which is on top row now looks from elastic perspective this way

Petr.Simik · April 19, 2023, 4:25am

@Christian_Dahlqvist do you know if I can draw a graph from some internal index where I can determine the node load?
If I can get an information detailing how the CPU load on one node breaks down to the specific tasks that this node performs. - Ingestion of index A,B,C. or something else.

system · May 17, 2023, 4:25am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Very high CPU usage on one Elasticsearch data node Elasticsearch	18	33909	May 9, 2018
Elasticsearch (6.8) is causing server load Elasticsearch	5	819	June 20, 2019
High CPU load Elasticsearch	10	921	May 10, 2022
Data node CPU constantly 90-98% on one node Elasticsearch	6	2412	May 30, 2022
Index breaking for a few seconds/minutes Elasticsearch	7	58	January 16, 2025

How to monitor detail why elasstic data node is overloaded

Related topics