Kibana SIEM app performance

Hello,

we are experiencing some performance issues in Kibana SIEM app especially loading the Detections tab. Making it very annoying to use.

The queries report times ~8ms (Signal count) ~33ms (Inspect signals). But the detections tab takes around 30s to load and is very unresponsive. Clicking inspect on visualization makes the windows 5s unresponsive before anything happens. Also this seems to be worse in custom space than in default space.

At first I was thinking detection engine is causing this since we use a 400+ rules. So I have increased workers value to 100. Same result. So I have disabled all rules and nothing really changed.

I have noticed in Kibana logs one POST request takes long (I have truncated the log):

"message":"POST /api/detection_engine/signals/search 200 112ms - 9.0B"
"message":"POST /api/siem/graphql 200 741ms - 9.0B"
"message":"POST /api/siem/graphql 200 122ms - 9.0B"
"message":"POST /api/ui_metric/report 200 607ms - 9.0B"
"message":"POST /api/ui_metric/report 200 640ms - 9.0B"
"message":"POST /api/siem/graphql 200 1364ms - 9.0B"
"message":"POST /api/siem/graphql 200 3091ms - 9.0B"
"message":"POST /api/ui_metric/report 200 533ms - 9.0B"
"message":"POST /api/siem/graphql 200 2366ms - 9.0B"

Sometimes calls to /api/siem/graphql take around ~2000-3000ms and there is several of them.

Kibana version 7.7
OS: Ubuntu 16.04
It is a single node Kibana with 4 CPUs 8GB RAM.

ES cluster:
13 nodes, Hot-Warm-Cold architecture
2x hot node, 4x warm node, 2x cold node, 3x master, 1x coordinator 1x ingest
Cluster heap 26.2 GB / 64.0 GB
Indices: 328
Shards: 1296
Documents: 3,051,043,815
Storage 4TB

Is there anything we can do to improve performance of SIEM app? I'll be happy to provide additional details if necessary.

What's your time range looking like?

Sometimes when there are just very large volumes of data it is going to be slow unless you cut back on your time range.

Normally I keep the time range at 24h. It usually returns about 150 signals.

Sorry to jump in on this, but I also think performance in the SIEM Kibana app is not optimal. We only use the builtin rules.

Just as a test I recreated the detections tab (using .siem-signals-[space]-*) as a dashboard and it gets loaded instantly even for large time ranges. I understand the SIEM app does a lot more with the interface, just wanted to point out that it's more like SIEM app issue and not our Elasticsearch cluster having problems.

1 Like

We found that the SIEM app was very slow, this was when accessing though Remote Desktop.

When accessing from a laptop with a modern i7 it is much faster.

Ján, if I understand correctly, it seems like the Elasticsearch cluster is doing well, and it seems more of an issue with the Kibana server rather than the browser side, because the graphql calls are slow.

What is the CPU usage do you see for the Kibana process (node.js)? It's single threaded so I'm trying to understand if it might be getting saturated.

At first I was thinking detection engine is causing this since we use a 400+ rules. So I have increased workers value to 100.

Which workers setting are you referring to here?

Kibana process CPU based on monitoring is 2.31 on 5m average and 2.06 on 15m. If I look directly with htop on server the Kibana process jumps between ~170% up to ~320% (VM has 4 CPUs).

The workers setting I mentioned is xpack.task_manager.max_workers as discussed in github issue 54697.

I have figured out where the problem is.

We use different spaces with different SIEM settings for multi-tenancy. The space where the issue occurs has the siem:defaultIndex set to custom value, for indices that use ECS template (but not the original, winlogbeat-* auditbeat-* etc.) It works on our custom indices for sophos-*, checkpoint-* etc.

Unfortunately one of those indices suffered a field explosion because of an error in Logstash parsing and now when I refreshed the Index pattern I see that it has around 8000 fields.

As soon as I removed this pattern from siem:defaultIndex the SIEM app is responsive again.

Not sure why detection tab was especially so badly affected by this, none of the detection rules actually used the pattern where field explosion occurred, but it seems to be the root cause.

The graphql values in logs are still the same, so it probably has nothing to do with this.

1 Like

Thanks for getting back to us, and I'm glad it got better.

One question about the 400 rules: are they 400 in total, not per space, right? I'm asking because you have mentioned spaces and if you have the same rule in two spaces, that counts as two rules that are executed independently.

Given the high CPU usage of the Kibana server, it might still make sense to add multiple Kibana instances to avoid having gaps in detection. The Kibana task manager, which powers the detections rules, is able to scale by adding multiple instances. Adding a second Kibana on the same 4 vCPUs server might be a good way to start, because the nodejs process is otherwise single-threaded.

One question about the 400 rules: are they 400 in total, not per space, right? I'm asking because you have mentioned spaces and if you have the same rule in two spaces, that counts as two rules that are executed independently.

Yes I'm aware that it counts as separate rules. It's 21 rules in the space that had the performance problem, and another 481 rules in default space. (Most of these are from sigma project)

Given the high CPU usage of the Kibana server, it might still make sense to add multiple Kibana instances to avoid having gaps in detection. The Kibana task manager, which powers the detections rules, is able to scale by adding multiple instances. Adding a second Kibana on the same 4 vCPUs server might be a good way to start, because the nodejs process is otherwise single-threaded.

I realized this as I was debugging the issue and has already started deploying additional Kibana node, but I'll also look into the possibility to run multiple instances on the same node if the HW allows it.

Thank you for your help Tudor.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.