SIEM Rule Failures

Hello Everyone,

We are on Elastic 7.9 and are mainly using it as a SIEM, suddenly all of the SIEM rules start to fail and not just some but all of them.

Hi @Ameer_Mukadam! Looking at your screenshot, it appears that these failures are due to a large amount of time passing between rule executions.

Before we dig into potential causes/solutions, I want to note two important facts:

  1. Rule failures do not necessarily mean that the rule did not execute and generate signals; in fact in your case I would expect that signals are still being generated.
  2. We only report this particular error if the gap is 4x the rule interval, so given your failure messages I am inferring that your rules run every 1-2 minutes.

That being said, the failures you're seeing are indicative of a performance issue: the amount of time it takes for these rules to query and generate signals is > 4x the expected interval.

While we offer documentation for tuning our prebuilt rules, and many knobs to tune your custom rules (scheduling, query optimization, # of task workers, general vertical/horizontal scaling, etc) , these are mostly manual and depend strongly on your particular environment.

If you have a specific question or more details about the circumstances of your situation I'd be happy to help further!

I think I agree with the performance part the cluster isn’t quick at all, so that might be the issue. Also the rules run at 5 min intervals with 1-2 minutes look back time.

Its been long but I got some details from support so I thought I will just post it here

I checked the diagnostics result. It seems like your cluster is having some performance issue when you ran support diagnostics.

On node vishnu3, the refresh interval was 153.8ms and the number of search rejects was 2516. On node vishnu1, the refresh interval was 127.64ms. The indicated your cluster is in bad performance. I suspect it is related to the error messages in screenshots that complains 13 minutes passed since last rule execution.

When looking at the indices, I found the following indices taking long time for indexing. Looking at cybernx-* indices, it seems like they are not configured refresh_interval. So they are using the default "refresh_interval:1s". In order to improve the performance of indexing, please consider increasing refresh_interval for indices, this will definitely improve the indexing speed. More details please refer to our guide.

jq '[.indices | to_entries | { "key": .key, "value": .value.primaries.indexing.index_time }] | from_entries' indices_stats.json |grep -v 0s | grep "h""

".kibana_task_manager_1": "1.5h",
".async-search": "9.1m",
"cybernx-ise-000016": "5.5h",
"cybernx-ise-000015": "3.2h",
"cybernx-ise-000018": "1.1h",
"cybernx-ise-000017": "5.8h",
"filebeat-elasticsearch-000001": "4.5h",
"cybernx-cnx-000017": "4.7h",
"cybernx-cnx-000014": "1.2h",
"cybernx-cnx-000016": "5.4h",
"cybernx-cnx-000015": "4.3h",
"cybernx-pcpl-000001": "20.4h",
"cybernx-bajaj-000025": "22.6h",
"cybernx-bajaj-000021": "14.3h",
"cybernx-bajaj-000022": "13.9h",
"cybernx-bajaj-000020": "3.3h",

Another question that would need your confirmation. What kind of storage is used for the data nodes (spinning, ssd, nfs)? In particular SSD drives are known to perform better than spinning disks. Always use local storage, remote filesystems such as NFS or SMB should be avoided. So please confirm.

From this I had some doubts. Our indexes are are setup in such a way that they are continuously receiving security logs from FW, servers etc once the index reaches 50 GB or 30 days it will rollover. Now the indexing time which is displayed above is exactly what is it time taken for a document to get index? Or something else? Because at any given time only 1 index for the company will receive logs examples between cybernx-pcpl-000016 and cybernx-pcpl-000017 only 000017 will be receiving the logs the old index will only be used for searching. So what does the time in h for each index denotes exactly.

@Ameer_Mukadam if you're working with a support person, I would recommend continuing your conversation with them as I have limited context here and thus can only provide limited assistance.

To try and answer your question about understanding those values: it appears as though those are the results of an Index Stats API call. Descriptions of those values can be found in more detail here.

The main takeaway is that those statistics are relative to the time period in which they were collected, and so "time spent in indexing" is most useful in relation to time spent elsewhere (i.e. the full stats response).

It sounds as though your support team is narrowing down the issue to those cybernx indices and their refresh_intervals so you look to be on the right track. If you have other questions specific to the SIEM app, let me know and I'd be happy to help!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.