Optimizing Elastic-Agent Performance

wangsubo · May 12, 2024, 1:16pm

I am currently evaluating the use of ELK to collect Checkpoint data. On a VM, I have a QRadar Event Collector and an Elastic-Agent, both configured with 8 cores, 8GB RAM, and HDD.

Checkpoint logs are being sent at a rate of 5000 EPS, split between the QRadar Event Collector and the Elastic-Agent. I have observed that the QRadar Event Collector consistently receives about 5% more events than the Elastic-Agent during the same time frame.

Here is my current configuration for the elastic-agent.yml:

bulk_max_size: 10000
worker: 6
queue.mem.events: 12800
queue.mem.flush.min_events: 10000
queue.mem.flush.timeout: 50ms
compression_level: 1
connection_idle_timeout: 15s

How can I optimize the settings so that the event volume discrepancy between Elastic-Agent and QRadar Event Collector is within 1%？

Thanks

leandrojmp · May 12, 2024, 1:30pm

Hello,

You need to also provide context about your Elasticsearch cluster, like how many nodes do you have, what are their specs, also, what is the output configuration of your Elastic Agent? How many nodes do you have configured in the output?

Also, QRadar and Elasticsearch are different tools that work in different ways, not sure if it makes any sense comparing the two of them here.

wangsubo · May 12, 2024, 1:50pm

Initially, the Elastic-Agent was configured with only one output node, but I found the EPS to be too low, which I suspected might be related to Elasticsearch performance.

Later, I increased this to three nodes, and there was an improvement in the event capture rate, but there was still about a 5% discrepancy in event volume compared to the QRadar Event Collector.

The Elasticsearch cluster consists of three nodes, each with a 6-core CPU, 32 GB RAM, and a 1 TB HDD.

The reason for comparing QRadar and Elasticsearch is to evaluate the possibility of replacing QRadar with Elasticsearch. Without comparing event volumes, it's difficult to determine whether a replacement is feasible.

Thanks

leandrojmp · May 12, 2024, 2:20pm

HDD is pretty bad for performance in Elasticsearch, it can impact in the indexing rate, the recommendation is to use SSD.

The performance of the Elastic Agent is also influenced by the performance of the destination cluster, if your cluster cannot write to the disk fast enough, it will tell the Elastic Agent to backoff a little.

Have you checked this documentation? There are probably some things that you can try to improve it, even using HDD.

If you haven't read it yet, I strongly recommend that you do.

Per default the Elastic Agent will use 1 primary shard and 1 replica and a refresh interval of 1s, you may need to create a custom template to change the number of primary shards to 3 (the number of nodes you have) and increase the refresh interval for something like 10s, 15s.

I do not know QRadar, does it index every field? And does it have some processing to parse the data? Elasticsearch will do both things.

How did you arrive to this metric? Can you provide more context and some evidences?

wangsubo · May 12, 2024, 3:07pm

I haven't read the document you provided, but I have attempted some benchmark tests. In the CheckPoint logs, there are Rule Names.

I sent the top ten Rule Names by "event count" in CheckPoint (e.g., Rule Name = allow trust to untrust), starting with the one with the fewest events, to both QRadar Event Collector and Elastic-Agent.

Using QRadar Search and Kibana data view to compare the event counts of the same time range from CheckPoint, I found that for Rule Names with low EPS (events per second), the event counts between QRadar Event Collector and Elastic-Agent are exactly the same. However, when the Rule Name EPS reaches 3000-4000, Elastic-Agent begins to show about a 5% discrepancy compared to QRadar.

QRadar requires regular expressions to parse the data.

I also suspect that the bottleneck might be in the HDD, and I plan to try switching to SSD.

Additionally, under default settings, Elastic Agent uses 1 primary shard and 1 replica with a refresh interval of 1 second. Why should increasing the number of primary shards to 3 involve increasing the refresh interval to about 10 or 15 seconds?

Could you provide me with reference materials on how to increase the refresh interval?

Thanks

system · June 9, 2024, 3:07pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.