Scaling tips for ElasticAgent - Elasticsearch - Kibana setup

I am having a trouble scaling Elasticsearch setup. The setup consist of over 2500 elastic agents forwarding logs to elasticsearch cluster. All elastic agents are managed by a single fleet server and is assigned to a single policy. The policy consist of Elastic Defend integration only.I am facing rejection exception in elasticsearch. I suspect this is due to all elastic-agents sending data to the same set of nodes and trying to index for the same data stream. The volume of logs produced per elastic defend is huge. Any tips to overcome this rejection exception and scale this setup is appreciated. Thanks in advance

What is the specification of your Elasticsearch cluster, both in terms of node counts and type and size of hardware used? Which version of Elasticsearch are you using?

1 Like

Elastic stack version - 8.15.1
No of master nodes(dedicated) - 3
No of data nodes(dedicated) - 15

CPU details - AMD EPYC 64-Core Processor
Data node conf - 128GB RAM and 2TB NVME SSD
Master node conf - 16GB RAM and 512GB NVME SSD

Do you have monitoring installed?

Is the load spread evenly across the data nodes?

Do you see any resource constraints, e.g. around network or disk I/O?

What indexing rate are you seeing?

The load is not spread equally. The data majorly produced is for process events which comes under logs-endpoint.process data stream. The indices created by this stream has 5 primary shards. So most of data forwarded by elastic-agent must be indexed in these 5 shards. so most of the load is over the nodes that contains these shards.

There is no issues related to hardware.

Sadly I don't have any other monitoring over it.

Do you have any replicas configured?

What is the load like on the nodes doing most of the indexing work?

Yes. One replica is configured.

Attaching load average of top 5 nodes

17:27:31 up 215 days, 17:53,  2 users,  load average: 200.15, 190.53, 192.19
17:27:31 up 215 days, 17:53,  2 users,  load average: 200.15, 190.53, 192.19
17:27:32 up 215 days, 17:41,  0 user,  load average: 106.91, 109.79, 110.53
17:27:33 up 215 days, 17:36,  0 user,  load average: 49.30, 49.86, 51.54
17:27:35 up 215 days, 17:53,  2 users,  load average: 210.23, 192.78, 192.90

can you maybe share the data from all the nodes please, via:

/_cat/nodes?v&h=name,version,node.role,cpu,ram.percent,heap.percent,disk.used_percent,load_1m,load_5m,load_15m

name       version node.role cpu ram.percent heap.percent disk.used_percent load_1m load_5m load_15m
es-05 8.15.1  dir        41          98           60             30.70   92.29   94.40    96.00
es-02 8.15.1  dir        52          96           27              5.85  141.43  158.95   166.06
es-01 8.15.1  dilrt      52          96           61             22.65  141.43  158.95   166.06
es-03 8.15.1  dir        52          96           60             19.01  141.43  158.95   166.06
es-09 8.15.1  dir        52          96           31             18.92  141.43  158.95   166.06
es-04 8.15.1  dir        23          97           60              5.93   48.42   49.19    52.31
es-07 8.15.1  dir        25          99           73              1.80   43.29   68.69    71.40
es-08 8.15.1  dir        56          98           52              5.02   92.29   94.40    96.00
es-06 8.15.1  dir        16          99           34              4.33   43.29   68.69    71.40
es-m3 8.15.1  imr         1          99           43              4.11    1.00    0.68     0.60
es-11 8.15.1  dilrt      41          98           32             10.19   92.29   94.40    96.00
es-15 8.15.1  dir        41          98           63              5.97   92.29   94.40    96.00
es-10 8.15.1  dir        23          97           35             11.95   48.42   49.19    52.31
es-13 8.15.1  dir        60          96           20              7.81  141.43  158.95   166.06
es-m1 8.15.1  imr         1          98           60              3.44    0.19    0.34     0.32
es-12 8.15.1  dir        25          99           65              6.33   43.29   68.69    71.40
es-14 8.15.1  dir        52          96           66              3.88  141.43  158.95   166.06
es-m2 8.15.1  imr         6          98           88              9.28    1.59    1.48     1.51

Besides the number of shards, did you change the index.refresh_interval for this index as well or are you using the default value of 1s ?

I do not use Elastic Defend, but I'm assuming that the refresh interval is the same as other integrations, which have 1s as the default.

thanks. I re-ordered your output and shared it back at bottom of this post.

Couple of observations, not sure relate to your core issue but ...

node roles:

l 2x
t 2x
m 3x
d 15x
i 18x
r 18x

And as "groups of roles"

 dir.lt  2x
 dir... 13x
 .irm..  3x

You disk.used_percent, even across just the 15 data nodes, varies from 1.8% to 30%.

There is suspicious duplication of values in the load_1m load_5m load_15m values, so much I am not sure I trust them.

This might be because they are pretty high, 6x (sic) have all of load_1m/5m/15m > 100. 10x (5 primary and 5 replicas?) have all of load_1m/5m/15m > 90.

name       version node.role cpu ram.percent heap.percent disk.used_percent load_1m load_5m load_15m
es-04      8.15.1  dir        23          97           60              5.93   48.42   49.19    52.31
es-10      8.15.1  dir        23          97           35             11.95   48.42   49.19    52.31
es-07      8.15.1  dir        25          99           73              1.80   43.29   68.69    71.40
es-06      8.15.1  dir        16          99           34              4.33   43.29   68.69    71.40
es-12      8.15.1  dir        25          99           65              6.33   43.29   68.69    71.40
es-08      8.15.1  dir        56          98           52              5.02   92.29   94.40    96.00
es-15      8.15.1  dir        41          98           63              5.97   92.29   94.40    96.00
es-05      8.15.1  dir        41          98           60             30.70   92.29   94.40    96.00
es-11      8.15.1  dilrt      41          98           32             10.19   92.29   94.40    96.00
es-01      8.15.1  dilrt      52          96           61             22.65  141.43  158.95   166.06
es-02      8.15.1  dir        52          96           27              5.85  141.43  158.95   166.06
es-03      8.15.1  dir        52          96           60             19.01  141.43  158.95   166.06
es-09      8.15.1  dir        52          96           31             18.92  141.43  158.95   166.06
es-14      8.15.1  dir        52          96           66              3.88  141.43  158.95   166.06
es-13      8.15.1  dir        60          96           20              7.81  141.43  158.95   166.06
es-m3      8.15.1  imr         1          99           43              4.11    1.00    0.68     0.60
es-m1      8.15.1  imr         1          98           60              3.44    0.19    0.34     0.32
es-m2      8.15.1  imr         6          98           88              9.28    1.59    1.48     1.51
1 Like

I haven't changed the index refresh interval.

I would suggest that you increase the value of the index.refresh_interval setting for the indices with a high event rate and see if things improve.

This setting is dynamic, so you just need to go to the current backing indice and edit it, try to increase it to 10 or 15s.

If things improve, than you can change it in the custom component template.

Personally I've set it to at least 30s on my logs@custom component template, the default of 1s can have a huge impact on performance.

1 Like

Sure will try this!