I am having a trouble scaling Elasticsearch setup. The setup consist of over 2500 elastic agents forwarding logs to elasticsearch cluster. All elastic agents are managed by a single fleet server and is assigned to a single policy. The policy consist of Elastic Defend integration only.I am facing rejection exception in elasticsearch. I suspect this is due to all elastic-agents sending data to the same set of nodes and trying to index for the same data stream. The volume of logs produced per elastic defend is huge. Any tips to overcome this rejection exception and scale this setup is appreciated. Thanks in advance
What is the specification of your Elasticsearch cluster, both in terms of node counts and type and size of hardware used? Which version of Elasticsearch are you using?
Elastic stack version - 8.15.1
No of master nodes(dedicated) - 3
No of data nodes(dedicated) - 15
CPU details - AMD EPYC 64-Core Processor
Data node conf - 128GB RAM and 2TB NVME SSD
Master node conf - 16GB RAM and 512GB NVME SSD
Do you have monitoring installed?
Is the load spread evenly across the data nodes?
Do you see any resource constraints, e.g. around network or disk I/O?
What indexing rate are you seeing?
The load is not spread equally. The data majorly produced is for process events which comes under logs-endpoint.process data stream. The indices created by this stream has 5 primary shards. So most of data forwarded by elastic-agent must be indexed in these 5 shards. so most of the load is over the nodes that contains these shards.
There is no issues related to hardware.
Sadly I don't have any other monitoring over it.
Do you have any replicas configured?
What is the load like on the nodes doing most of the indexing work?
Yes. One replica is configured.
Attaching load average of top 5 nodes
17:27:31 up 215 days, 17:53, 2 users, load average: 200.15, 190.53, 192.19
17:27:31 up 215 days, 17:53, 2 users, load average: 200.15, 190.53, 192.19
17:27:32 up 215 days, 17:41, 0 user, load average: 106.91, 109.79, 110.53
17:27:33 up 215 days, 17:36, 0 user, load average: 49.30, 49.86, 51.54
17:27:35 up 215 days, 17:53, 2 users, load average: 210.23, 192.78, 192.90
can you maybe share the data from all the nodes please, via:
/_cat/nodes?v&h=name,version,node.role,cpu,ram.percent,heap.percent,disk.used_percent,load_1m,load_5m,load_15m
name version node.role cpu ram.percent heap.percent disk.used_percent load_1m load_5m load_15m
es-05 8.15.1 dir 41 98 60 30.70 92.29 94.40 96.00
es-02 8.15.1 dir 52 96 27 5.85 141.43 158.95 166.06
es-01 8.15.1 dilrt 52 96 61 22.65 141.43 158.95 166.06
es-03 8.15.1 dir 52 96 60 19.01 141.43 158.95 166.06
es-09 8.15.1 dir 52 96 31 18.92 141.43 158.95 166.06
es-04 8.15.1 dir 23 97 60 5.93 48.42 49.19 52.31
es-07 8.15.1 dir 25 99 73 1.80 43.29 68.69 71.40
es-08 8.15.1 dir 56 98 52 5.02 92.29 94.40 96.00
es-06 8.15.1 dir 16 99 34 4.33 43.29 68.69 71.40
es-m3 8.15.1 imr 1 99 43 4.11 1.00 0.68 0.60
es-11 8.15.1 dilrt 41 98 32 10.19 92.29 94.40 96.00
es-15 8.15.1 dir 41 98 63 5.97 92.29 94.40 96.00
es-10 8.15.1 dir 23 97 35 11.95 48.42 49.19 52.31
es-13 8.15.1 dir 60 96 20 7.81 141.43 158.95 166.06
es-m1 8.15.1 imr 1 98 60 3.44 0.19 0.34 0.32
es-12 8.15.1 dir 25 99 65 6.33 43.29 68.69 71.40
es-14 8.15.1 dir 52 96 66 3.88 141.43 158.95 166.06
es-m2 8.15.1 imr 6 98 88 9.28 1.59 1.48 1.51
Besides the number of shards, did you change the index.refresh_interval
for this index as well or are you using the default value of 1s
?
I do not use Elastic Defend, but I'm assuming that the refresh interval is the same as other integrations, which have 1s as the default.
thanks. I re-ordered your output and shared it back at bottom of this post.
Couple of observations, not sure relate to your core issue but ...
node roles:
l 2x
t 2x
m 3x
d 15x
i 18x
r 18x
And as "groups of roles"
dir.lt 2x
dir... 13x
.irm.. 3x
You disk.used_percent, even across just the 15 data nodes, varies from 1.8% to 30%.
There is suspicious duplication of values in the load_1m load_5m load_15m values, so much I am not sure I trust them.
This might be because they are pretty high, 6x (sic) have all of load_1m/5m/15m > 100. 10x (5 primary and 5 replicas?) have all of load_1m/5m/15m > 90.
name version node.role cpu ram.percent heap.percent disk.used_percent load_1m load_5m load_15m
es-04 8.15.1 dir 23 97 60 5.93 48.42 49.19 52.31
es-10 8.15.1 dir 23 97 35 11.95 48.42 49.19 52.31
es-07 8.15.1 dir 25 99 73 1.80 43.29 68.69 71.40
es-06 8.15.1 dir 16 99 34 4.33 43.29 68.69 71.40
es-12 8.15.1 dir 25 99 65 6.33 43.29 68.69 71.40
es-08 8.15.1 dir 56 98 52 5.02 92.29 94.40 96.00
es-15 8.15.1 dir 41 98 63 5.97 92.29 94.40 96.00
es-05 8.15.1 dir 41 98 60 30.70 92.29 94.40 96.00
es-11 8.15.1 dilrt 41 98 32 10.19 92.29 94.40 96.00
es-01 8.15.1 dilrt 52 96 61 22.65 141.43 158.95 166.06
es-02 8.15.1 dir 52 96 27 5.85 141.43 158.95 166.06
es-03 8.15.1 dir 52 96 60 19.01 141.43 158.95 166.06
es-09 8.15.1 dir 52 96 31 18.92 141.43 158.95 166.06
es-14 8.15.1 dir 52 96 66 3.88 141.43 158.95 166.06
es-13 8.15.1 dir 60 96 20 7.81 141.43 158.95 166.06
es-m3 8.15.1 imr 1 99 43 4.11 1.00 0.68 0.60
es-m1 8.15.1 imr 1 98 60 3.44 0.19 0.34 0.32
es-m2 8.15.1 imr 6 98 88 9.28 1.59 1.48 1.51
I haven't changed the index refresh interval.
I would suggest that you increase the value of the index.refresh_interval
setting for the indices with a high event rate and see if things improve.
This setting is dynamic, so you just need to go to the current backing indice and edit it, try to increase it to 10 or 15s.
If things improve, than you can change it in the custom component template.
Personally I've set it to at least 30s
on my logs@custom
component template, the default of 1s
can have a huge impact on performance.
Sure will try this!