Over 110 detections crash SIEM application and Kibana plugins

Hello,

I am having an issue that I have not been able to fix, and have not even been able to understand the underlying issue. I am running a cluster with 6 hot data nodes (virtual, SSD, 16 cpus, 64gb ram and 8TB disks) 4 cold nodes with the same setup. 3 masters that are smaller and 2 ml specific nodes. We started out with 2 kibanas (virtual, SSD, 6 cpus and 12gb ram).

The issue first cropped up when we started to expand the amount of the detections we were using in the SIEM. Once we got over 100-110 detections we started seeing the following behavior: kibana status would go red, when we checked the status page we would see lots of plugin failures (most of them seem to be related to Task Manager failure). We also saw the "gap" time in the detection monitoring page increase drastically for a large number of the rules.

From the user's perspective the SIEM app would be unavailable, but the rest of Kibana seemed to work pretty well. We were told by elastic that we were under resourced. We disagreed on this as other than the performance issues seen in Kibana there was nothing to indicate that the Kibana servers were under any strain. I had never seen either of the servers go above 10% utilization in memory or cpu usage. With no other ideas we finally added two more kibana servers of the same size as the ones that already existed bringing us to 4 total.

Things seemed fine until we started adding additional detections. As soon as we got over 110 detections we started seeing the same issue again. At this point there is no doubt in my mind that we have enough resources, but I have run out of ideas on what could possibly be causing this issue. As we are trying to stand up a functional SOC we obviously need to bring more than 110 detections online.

Also, if anyone knows anything about the Kibana plugins please let me know as there is not much in the documentation to help me understand this issue any better. I would like to know if there is any way to troubleshoot what is going on. It is great that I can see plugins are failing, but if the only thing that I can do is restart the process I am an unhappy engineer. Also, any insight into task manager? I have talked with people at elastic and no one can explain the output in a satisfactory manner. It seems clear that things are not working great, but direct answers are hard to come by.

Curious if anyone has run into a similar issue or has any new thoughts.

Thanks,
Alex

1 Like

What errors do you see in the logs? How is your heap usage? Whats your diskio eg read MB/s?

Im running 750 rules with no major issues on 7.16.3 but on dedicated physical nodes with local nvmes.

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda               3.79         0.08         0.05       1908       1320
dm-0              1.51         0.05         0.00       1147         37
dm-1              0.00         0.00         0.00          2          0
dm-2              2.23         0.03         0.05        729       1271
dm-3              0.01         0.00         0.00          3          2
dm-4              0.11         0.00         0.00          3          4
dm-5              0.01         0.00         0.00          2          0

Down to 130 active detections and everything is green again.

Does the node/user you mentioned has correct privileges?
Weird that there seems to be no resource issue. Maybe you could enable debug and try again? Hopefully someone from Elastic has a clue what could be going on here..

Sorry for the delay. We upgraded to 7.16.2, and we seemed to see an increase in performance. We started at 100 detections and stepped it up by 10 detections every other day. I got to 150 detections before I started seeing issues again. It is a little different this time. The status of two of my kibanas is yellow, and this is because there are a large number of plugins in a degraded state. Of these degraded plugins a large number of them are showing degradation in the security, task manager, alerting and triggerActionsUi services.

As far as heap usage we have four kibana servers and they are all chilling. Cpu and ram usage never get higher than 10% for cpu and 20% for ram usage. As far as the logs the following is an example of something that I am pulling out of the kibana.log:

{"type":"log","@timestamp":"2022-02-16T21:12:52-07:00","tags":["info","plugins","security","audit","saved_objects_authorization_success"],"pid":12281,"eventType":"saved_objects_authorization_success","username":"1471545631","action":"update","types":["siem-detection-engine-rule-status"],"spaceIds":["default"],"args":{"type":"siem-detection-engine-rule-status","id":"11610110-897d-11ec-8186-059e05bba90f","attributes":{"statusDate":"2022-02-17T04:12:52.856Z","status":"going to run","lastFailureAt":"2022-02-09T07:51:18.697Z","lastSuccessAt":"2022-02-17T04:08:06.493Z","lastLookBackDate":null,"lastFailureMessage":"An error occurred during rule execution: message: \"[siem-detection-engine-rule-status:3e28f490-7e8e-11ec-9ee3-092b6bf9e58d]: version conflict, required seqNo [1252542], primary term [2]. current document has seqNo [1252555] and primary term [2]: version_conflict_engine_exception: [version_conflict_engine_exception] Reason: [siem-detection-engine-rule-status:3e28f490-7e8e-11ec-9ee3-092b6bf9e58d]: version conflict, required seqNo [1252542], primary term [2]. current document has seqNo [1252555] and primary term [2]\" name: \"Microsoft Build Engine Using an Alternate Name [Active]\" id: \"c490b870-7e8d-11ec-beaa-cde9595b607a\" rule id: \"4ae83156-36f0-492c-ac44-cbadfe2b26bf\" signals index: \".siem-signals-default\"","lastSuccessMessage":"succeeded"},"options":{"references":[{"id":"c490b870-7e8d-11ec-beaa-cde9595b607a","type":"alert","name":"alert_0"}]}},"message":"1471545631 authorized to [update] [siem-detection-engine-rule-status] in [default]"}
{"type":"log","@timestamp":"2022-02-16T21:12:52-07:00","tags":["info","plugins","security","audit","saved_objects_authorization_success"],"pid":12281,"eventType":"saved_objects_authorization_success","username":"1471545631","action":"update","types":["siem-detection-engine-rule-status"],"spaceIds":["default"],"args":{"type":"siem-detection-engine-rule-status","id":"11610110-897d-11ec-8186-059e05bba90f","attributes":{"statusDate":"2022-02-17T04:12:52.856Z","status":"going to run","lastFailureAt":"2022-02-09T07:51:18.697Z","lastSuccessAt":"2022-02-17T04:08:06.493Z","lastLookBackDate":null,"lastFailureMessage":"An error occurred during rule execution: message: \"[siem-detection-engine-rule-status:3e28f490-7e8e-11ec-9ee3-092b6bf9e58d]: version conflict, required seqNo [1252542], primary term [2]. current document has seqNo [1252555] and primary term [2]: version_conflict_engine_exception: [version_conflict_engine_exception] Reason: [siem-detection-engine-rule-status:3e28f490-7e8e-11ec-9ee3-092b6bf9e58d]: version conflict, required seqNo [1252542], primary term [2]. current document has seqNo [1252555] and primary term [2]\" name: \"Microsoft Build Engine Using an Alternate Name [Active]\" id: \"c490b870-7e8d-11ec-beaa-cde9595b607a\" rule id: \"4ae83156-36f0-492c-ac44-cbadfe2b26bf\" signals index: \".siem-signals-default\"","lastSuccessMessage":"succeeded"},"options":{"references":[{"id":"c490b870-7e8d-11ec-beaa-cde9595b607a","type":"alert","name":"alert_0"}]}},"message":"1471545631 authorized to [update] [siem-detection-engine-rule-status] in [default]"}
{"type":"log","@timestamp":"2022-02-16T21:13:04-07:00","tags":["info","plugins","security","audit","saved_objects_authorization_success"],"pid":12281,"eventType":"saved_objects_authorization_success","username":"1471545631","action":"update","types":["siem-detection-engine-rule-status"],"spaceIds":["default"],"args":{"type":"siem-detection-engine-rule-status","id":"c04eb870-8892-11ec-8186-059e05bba90f","attributes":{"statusDate":"2022-02-17T04:13:04.784Z","status":"going to run","lastFailureAt":"2022-02-08T03:54:00.417Z","lastSuccessAt":"2022-02-17T04:08:10.834Z","lastFailureMessage":"An error occurred during rule execution: message: \"[siem-detection-engine-rule-status:81ed0500-845a-11ec-b3f9-3d86fdb9e0a6]: version conflict, required seqNo [1030080], primary term [2]. current document has seqNo [1030083] and primary term [2]: version_conflict_engine_exception: [version_conflict_engine_exception] Reason: [siem-detection-engine-rule-status:81ed0500-845a-11ec-b3f9-3d86fdb9e0a6]: version conflict, required seqNo [1030080], primary term [2]. current document has seqNo [1030083] and primary term [2]\" name: \"PsExec Network Connection [Active]\" id: \"32e22790-791c-11ec-93b0-e73074d6e1cb\" rule id: \"a1108dc4-68f5-452d-8ce4-e7431774fef6\" signals index: \".siem-signals-default\"","lastSuccessMessage":"succeeded"},"options":{"references":[{"id":"32e22790-791c-11ec-93b0-e73074d6e1cb","type":"alert","name":"alert_0"}]}},"message":"1471545631 authorized to [update] [siem-detection-engine-rule-status] in [default]"}```

There are quite a few more of these logs, but it looks like somehow these documents have somehow gotten their sequence numbers out of whack. To date a reboot seems to get things back in order, but I am curious about how this happens. As these aren't actually error logs I am not sure if actually has anything to do with this particular issue. Next:

{"type":"log","@timestamp":"2022-02-16T21:13:31-07:00","tags":["error","plugins","alerting"],"pid":12281,"message":"Executing Alert default:monitoring_shard_size:ab7631e0-1adc-11ec-99af-57fcc1269633 has resulted in Error: illegal_argument_exception: [illegal_argument_exception] Reason: node [xxxx] does not have the [remote_cluster_client] role, caused by: \"\""}
{"type":"log","@timestamp":"2022-02-16T21:13:31-07:00","tags":["error","plugins","alerting"],"pid":12281,"message":"Executing Alert default:monitoring_alert_nodes_changed:ab721330-1adc-11ec-99af-57fcc1269633 has resulted in Error: illegal_argument_exception: [illegal_argument_exception] Reason: node [xxxx] does not have the [remote_cluster_client] role, caused by: \"\""}
{"type":"log","@timestamp":"2022-02-16T21:21:05-07:00","tags":["error","plugins","eventLog"],"pid":3780,"message":"error setting existing \".kibana-event-log-7.13.3-000007\" index aliases - error setting existing index aliases for index .kibana-event-log-7.13.3-000007 to is_hidden: illegal_state_exception: [illegal_state_exception] Reason: alias [.kibana-event-log-7.13.3] has is_hidden set to true on indices [.kibana-event-log-7.13.3-000007] but does not have is_hidden set to true on indices [.kibana-event-log-7.13.3-000008,.kibana-event-log-7.13.3-000005,.kibana-event-log-7.13.3-000006]; alias must have the same is_hidden setting on all indices"}```

These look like some issues relating to roles that have been granted to some different alerts, but I don't think that would have anything to do with what I have been seeing.

I am going to have to get back to you to on the information for disk i/o. My gut says that things are going to be fine. All of the computation for these detections should be handled within Elasticsearch. Kibana should just handle the timing of when the queries get sent to the stack, and then wait for a response. In my mind the SIEM should be fairly easy on the stack.

Sorry this got a little extensive.

1 Like

Thanks again for that advice on that one host. I am also wondering if you have ever played with the number of workers or polling interval settings? I just increased my number of workers from the default at 10 to 15 it seems like there is a little change, but I think I still have a ways to go.

Also, out of curiosity are you ingesting a lot of data into your stack daily? If you are rocking the default settings I am wondering why exactly you are able to run so many detections while it seems to be shutting my stack down.

My theory right now is that for some reason the kibana servers aren't playing well together, but at the end of the day I have tons of firepower so that is strange to me. I also don't have that many tasks running. According to task manager my avg recurring throughput per minute is 46 and my max is 300 (per node). So it feels like something else is going on here, but adding resources horizontally and vertically have made slight changes in performance, but I can't get anywhere near that 700 number you are running.

1 Like

Hi @alaine! Thanks for bringing this discussion up here! After speaking with one of my team members they suggested that it might be worth checking if you happen to have any indicator match rules running as they can have an impact on overall performance? If so, you could troubleshoot by disabling those and see if there's a noticeable improvement

1 Like

@alaine , for the OCC errors I see:

  1. In 7.16.0 we partially fixed the race condition [Security Solution][Detections] Fix race condition on execution logs write by xcrzx · Pull Request #118171 · elastic/kibana · GitHub
  2. In 7.16.1 we added another fix [Security Solution][Detections] Fix 409 conflict error happening when user enables a rule by banderror · Pull Request #120088 · elastic/kibana · GitHub
  3. In 8.0.0 we reduced the number of calls required to update the status, which should further reduce the probability of an OCC error [Security Solution] Optimized rule execution log performance by xcrzx · Pull Request #118925 · elastic/kibana · GitHub
  4. In upcoming release (soon) we are finally got rid of siem-detection-engine-rule-status [Security Solution][Detections] Rule execution logging overhaul by banderror · Pull Request #121644 · elastic/kibana · GitHub

So the impact and probability of running into the OCC error should be less and less as you upgrade.

For the other errors around roles I think that has to do with node roles and if you're using cross cluster searches:

1 Like

@Michael_Olorunnisola your welcome and thanks for the response. I am double checking the rules now as I am pretty sure that we aren't using any indicator match rules. We had the threat intel feed running for a brief period of time, but had to shut it down I think due to this issue we are talking about and we just weren't getting great info from it.

As I am looking through it I am reminded of another one of the usability things that I would really appreciate to see. Unfortunately, the SIEM is a little clunky to administrate. I have had to go back through and add additional tags myself that seem necessary. One example would be an "active" tag for the active rules. The next one that I am going to add is a tag based on rule type.

Please let me know if I am wrong, but to perform a complete search to ensure none of the rules are using indicator match I have to click on and ensure that none of them are type indicator match in their query type.

Thanks again for this response! We are working on a pipeline to bring in threat intel from a proprietary source, and I was excited about the threat indicator matches. Knowing that they are that intensive is great as we are already in a constrained environment it seems.

Thanks,
Alex

1 Like

@Frank_Hassanabad I am trying to understand what is going on. I can see in the first group of errors that I shared that it is related to the occ page that you shared with me. I have noticed that since upgrading to 7.16.2 the performance has been slightly better. A question that comes to mind is that if we are seeing these errors and failures within the logs how come the SIEM rules monitoring UI isn't showing those errors as well. According to the UI everything is fine.

Also, is there anything else that we are doing in our cluster that is making us more prone to these errors? @willemdh mentioned that he is running about 700 rules on his cluster which turns out to be very close in size and ingestion rate to mine. There are differences between the two clusters, but I am wondering if there is anything that I am doing that is causing this cluster to be more prone to seeing these errors.

As I said we have seen performance increases since the upgrade, but instead of being able to run 110 detections we can run between 130 and 150 rules with tweaks to the number of workers each kibana is running. That being said with the amount of resources each kibana has access to and the amount of workers each of them are running we should be able to run tons of tasks without an issue.

It seems to me like coordination is an issue. Is there something that we can do to give elastic a better chance at properly coordinating across all of our servers? The next test we are going to do is remove all, but one of the kibanas and see if there is any improved performance there.

Thanks again for your help, and look forward to any additional thoughts.

Alex

A question that comes to mind is that if we are seeing these errors and failures within the logs how come the SIEM rules monitoring UI isn't showing those errors as well.

It's because the OCC errors are happening from when it is trying to write the status messages themselves and then it does the throw it cannot write the status message. Which makes sense why you cannot see the error on the front end since the error message writing its self is the thing having the issue. Luckily as you upgrade it should become less probable and eventually go away.

For performance I would look at your query times of your rules in the rule monitoring to see what your query times of your rules are:

You can also turn on slow logs from Elastic:

to see what the slower queries being executed are.

It's incredibly variable why someone is experiencing performance issues and typically situational. So it's tricky to give tips and advice as each person's installation/modifications are very different from another person's. Number of rules being turned on isn't usually the reason. One person's 100 rules can be very different from another's.

A single rule such as an "indicator rule" could end up doing a lot of querying and consume resources over a long period of time or cause an executor within Kibana to be tied up non-stop and then over time you will have to reboot Kibana. Which is why @Michael_Olorunnisola was asking about it above. As an example of a single rule causing issues.

The same effect could be a single EQL or KQL or threshold rule or etc... as well depending on if wildcards are used or the underlying data set it is trying to match against. It could be a small set of rules running on too short of an interval or the look back time is too large as well.

I sometimes see someone who has an index pattern such as auditbeat-* or logs-* and when that pattern is expanded there is a huge number of indexes and they're not using ILM rules to retire those patterns and the query is against a very large count of indexes. The data its self is not large but the count of indexes are. That's another reason why it could be slowing down.

In other cases there could be a mapping explosion and you have a few indexes with a lot of indexed fields and that will slow everything down. You could have also created a scripted or runtime field and are matching against that and it's incredibly slow and CPU or memory intensive.

All good things to check.

1 Like

Frank thanks for getting back to me so quickly, and thanks for clearing up the OCC error issue. It is nice that all that stuff makes sense to me finally.

As far as performance our longest running rules are running in less than 2 seconds. That seems reasonable to me, but is that too long/a possible issue?

As far as indexes that these detections are being run on for the most part it is winlogbeat-* which goes a way back. Right now our ILM on that one is 30 days hot an additional 90 days cold and then they get deleted at the 120 day mark. Is that too much?

The things like mapping explosions, and too many indices running though I would think that would show up in the monitoring section. Is that correct? There was a time a few upgrades ago when we had some rules running as much as 30 seconds or more. Then we added an additional two kibanas, and more recently I have added workers to all of the kibanas. Right now I am at 30 workers per kibana and about to change each to 50 workers apiece on some recommendations I have gotten. At that the throughput should be completely ridiculous and we don't really have any other tasks that Kibana has to worry about. It really seems like we have over-engineered this thing to a comical level...

The last thing that I will address here is that in task manager it seems like all of this tasks are being considered "non-recurring" which has me confused. I was under the impression that all these detections should be considered "recurring" tasks. Am I off base here?

Thanks again,
Alex

As far as performance our longest running rules are running in less than 2 seconds. That seems reasonable to me, but is that too long/a possible issue?

Running in less than 2 seconds is a good thing if that is accurate. Your workers will get returned to the pool and will continue to be re-used. Gaps in detection being seen are usually a sign that either workers aren't being returned to the pool fast enough or there are congestion issues due to either networking and/or latency issues and/or slow running rules/queries.

The things like mapping explosions, and too many indices running though I would think that would show up in the monitoring section. Is that correct?

No, it wouldn't show up there. However, going through stack management and querying your mappings directly can show you if you have large numbers of indexes or large numbers of mappings.

I don't think there is any 1 simple stat for mapping explosions but there's ways of getting that information either through the UI above or curl like these examples:

I just did a simple search for "type" above. If you see large numbers of index patterns/fields with guid's or auto-generated fields then that will cause slow downs directly within the web application pages of the security application fwiw and cause odd ball NodeJS lock ups as it is being parsed as JSON in various areas of application. An older article but it describes mapping explosion.

If you have a huge amount of index mappings or indexes in general, then certain operations can cause large "blocking times" within NodeJS such as parsing JSON message if they get large. You would see CPU spikes though. Larger messages will end up getting pushed through your network and latency would begin increasing. CPU spikes, however, will only show up on 1 core since NodeJS only uses 1 core per machine which is why having multiple Kibana instances helps. But even if you have multiple Kibana instances if they're each parsing large messages the poor performance would be the same. Since it's just 1 CPU core NodeJS is using this can lead to misleading reports from a OS level if the OS is reporting performance of say 4 CPU's together and the other 3 are idle while the main one that NodeJS is using is spiked at 100%.

The last thing that I will address here is that in task manager it seems like all of this tasks are being considered "non-recurring" which has me confused. I was under the impression that all these detections should be considered "recurring" tasks. Am I off base here?

Usually I don't have to go to that level of depth with task manager but as far as I know they are always scheduled and running per the interval on the rule given to it.

Right now our ILM on that one is 30 days hot an additional 90 days cold and then they get deleted at the 120 day mark. Is that too much?

It depends on data size, cluster size and other variables. I would make sure your rules aren't touching anything cold or frozen though. I don't know if you are writing cross cluster searches or what your setup is as well. Cross cluster searches in some instances can be slow.

Has this issue been resolved you were seeing earlier by setting your nodes roles?

{"type":"log","@timestamp":"2022-02-16T21:13:31-07:00","tags":["error","plugins","alerting"],"pid":12281,"message":"Executing Alert default:monitoring_alert_nodes_changed:ab721330-1adc-11ec-99af-57fcc1269633 has resulted in Error: illegal_argument_exception: [illegal_argument_exception] Reason: node [xxxx] does not have the [remote_cluster_client] role, caused by: \"\""}

If that is not being set correctly or if you're seeing errors I don't know if you just have large amounts of workers timing out trying to contact a node that cannot be contacted.

Do you have any errors within Elasticsearch logs? Did you enable slow logs and double check to ensure there aren't issues there with queries?

when we checked the status page we would see lots of plugin failures (most of them seem to be related to Task Manager failure

Do you have any other messages/logs not already mentioned in this thread you can share?

Frank,

Thanks again for getting back to me and in such depth. Last friday I upped the running detections to 160 detections and things still look good. All of the runtimes are under 1.5 seconds. We are currently at 30 workers per node and have left the poll interval at the default 3000ms. I am going to add 10 more rules today and I will keep doing that until we start running into issues again to see how far 30 workers gets us. I am going to keep an eye on cpu and ram and see if that spikes at all, but at this point the servers are still seemingly unaware that anything is happening at all.

As far as the mapping explosion this is actually an issue that I unfortunately forgot about. I was researching this for another issue (which I also forget at this time), but got side tracked. As a quick way to find the number of fields (if you are interested) you can go to "stack management" > "index patterns" > then select the related pattern and it will show you the exact number of fields related to the alias. This works for me as the lion's share of our data is contained in two different index types filebeat and winlogbeat.

In our case filebeat seems like it is definitley an issue and maybe it is the issue that is leading to some of our performance issues. Which it is weird to me the only place that we are seeing a problem is on the Kibana nodes. I have over 6000 fields in my filebeat alias that is taking in over 160gb a day. This data is coming from zeek and suricata. What has me a little confused is that we aren't seeing any issues on the Elasticsearch nodes in any way that is showing up noticeably. The other thing that I remember from when I first went over this issue was that nearly all of those fields never have any information populating them as we aren't sending any of that data from our endpoints to the stack.

To address the other points you made. Using the "type" method I still have over 6300 occurences a few of these can be explained away by fields that are both type text and keyword, but it is still fairly close to the number I found using the other method mentioned above. For this next part I am definitely out of my element a bit, but I will try my best. There really aren't many occurences of guid in the mapping (strictly based on a ctrl f of "guid"). The auto-generated part I am a little unsure of as well. To my understanding basically all of these fields were autogenerated by doing filebeat -e setup (or something along those lines) when we were starting up filebeat for the first time so the ingest pipelines could be created within the stack. What I am wondering is if when this was done for the first time there were too many modules enabled. As I look through the mapping there are so many fields that will just never apply. We have mappings for things like azure, kubernetes, google, etc... that are never going to be present within our environment, but the mapping is still there and presumably consuming a certain amount of resources in some way.

Our other big index alias is winlogbeat. We ingest even more of those everyday, but the total fields are around 1200. Seems a little high, but it doesn't give me cause for concern. Again, let me know if I should be freaking out.

This leads us back to the performance picture on the Kibana nodes though. I didn't realize (or have forgotten) that NodeJS only runs on one core, so thanks for the info there. That being said whenever I look at the nodes for cpu and ram utilization I am running top. I am going to speak as specifically as possible here, because sometimes I get a little turned around. I use the keyboard shortcut to show each of the individual cpus. At no significant amount of time is any one of them pegged at 100% it does happen from time to time, but for no longer than a few seconds at once. Now each of our cpus have one thread and 1 core per socket and we have 6 sockets. So, I am reading this as we have 1 core per cpu and since none of the individual cpus are pegged for any meaningful amount of time it doesn't seem like NodeJS is pounding just one cpu. Finally, the 100% utilization isn't jumping from cpu to cpu either. There will occasionally be a spike up to 100% for a brief period and then it goes right back to all cpus being at or under 15%.

Also, we currently have 400 active indices. I need to go through these and eliminate the indices that can be eliminated, but I think this number is appropriate. Please correct me if I am off base here.

"It depends on data size, cluster size and other variables. I would make sure your rules aren't touching anything cold or frozen though. I don't know if you are writing cross cluster searches or what your setup is as well. Cross cluster searches in some instances can be slow."

I will start with the stuff that I can say for sure. We aren't righting or searching across clusters. As far as touching anything cold or frozen I am unsure of how to check up on that. I have a feeling that the detections are running against everything as I don't think we have configured anything to say specifically only to look at hot indices. Is that something we can do? At the same time I don't know if there is actually any difference between our hot and cold indices as our hot and cold nodes are identical. They are all virtual machines using SSD. The only difference is that we get rid of all replicas once the index goes cold. Unless there is something going under the hood that does something "different" to a cold index it should be the same search time as our hot indices I would imagine, but happy to be wrong here as well. Something I am also a little unclear on.

Has this issue been resolved you were seeing earlier by setting your nodes roles?

Yes. As another note we realized that our time situation was a little screwy so we had to go through and sync all the servers to the time server. It is hard to tell if that is the reason for the reduction in the running time of all the detections, but it doesn't seem to have made things worse.

Do you have any errors within Elasticsearch logs? Did you enable slow logs and double check to ensure there aren't issues there with queries?

Just checked the Elasticsearch logs and didn't find any errors. I forgot to enable the slow logs. When I do that what time threshold do you think I should put them at?

Do you have any other messages/logs not already mentioned in this thread you can share?

Nothing directly related to the plugin issue at this time. I will have to wait until/if it happens again to grab logs that are related specifically to that issue, but I went through the logs again and found the following log which I am not sure what to do with yet. I haven't had the time to research it yet, but it seems like it is new.

{"type":"log","@timestamp":"2022-03-01T02:11:29-07:00","tags":["error","plugins","alerting"],"pid":27859,"message":"Executing Alert default:monitoring_alert_cluster_health:ab708c90-1adc-11ec-99af-57fcc1269633 has resulted in Error: search_phase_execution_exception: [illegal_argument_exception] Reason: no mapping found for `cluster_uuid` in order to collapse on; [illegal_argument_exception] Reason: no mapping found for `cluster_uuid` in order to collapse on; [illegal_argument_exception] Reason: no mapping found for `cluster_uuid` in order to collapse on; [illegal_argument_exception] Reason: no mapping found for `cluster_uuid` in order to collapse on; [illegal_argument_exception] Reason: no mapping found for `cluster_uuid` in order to collapse on; [illegal_argument_exception] Reason: no mapping found for `cluster_uuid` in order to collapse on, caused by: \"no mapping found for `cluster_uuid` in order to collapse on,no mapping found for `cluster_uuid` in order to collapse on\""}

Thanks again for all your time!
Alex

The above link explains away the filebeat mapping explosion issue I believe.

Did you mean to post a link above there?

I don't know much about the monitoring_alert_cluster_health rule type to be honest. My guess is it's looking at a index and for the field of cluster_uuid and can't find it.

Last friday I upped the running detections to 160 detections and things still look good. All of the runtimes are under 1.5 seconds. We are currently at 30 workers per node and have left the poll interval at the default 3000ms.

Great!

but the total fields are around 1200.

That's not too bad. When it starts going beyond 10k it can start to slow things way down. If you see it growing unbounded meaning it just keeps increasing in number and as the indexes age out you don't see a "steady" number of index fields is when there is a bigger problem.

I wouldn't hand trim these from their normal mappings but you could if you wanted to.

Also, we currently have 400 active indices. I need to go through these and eliminate the indices that can be eliminated, but I think this number is appropriate. Please correct me if I am off base here.

I would say that depends on your scale and what is going on. Each of those active indices are going to consume resources. Generically an unused index is better left off. If you have rules that doing something like winlogbeat-* and that is pulling a lot of indices then performance could suffer. It just depends.

I forgot to enable the slow logs. When I do that what time threshold do you think I should put them at?

I think the defaults for slowlog from the docs are pretty good.

Haha yes I did, but it isn't a big deal. It just explained why the stuff that I was seeing in filebeat was not a mapping explosion.

Things are still going great. I was going to enable more rules last week, but I have run into a separate issue where administering rules in the Security app is really difficult. Having a hard time finding rules that I haven't duplicated, as you can't tag rules that you have duplicated or eliminate tag types.

Good to know on the 1200 rules though. It seems to me from the documentation and the thoughts that you have shared that we are in pretty good shape as far as indices and individual fields are concerned.

Thanks for the slow logs suggestion I will enable that as soon as I start running into issues again.

Last question: is there anyway to eliminate specific indices from a detection's query? All of our detections are written "winlogbeat-7-xx-mm-dd-yyyy" and currently we are using "winlogbeat-*" to perform the query. Aside from using some kind of regex I can't really see the best way to query only our hot indices.

Thanks again!

Things are still going great.

Good good, I don't want to jinx you at this point :slight_smile:

Last question: is there anyway to eliminate specific indices from a detection's query? All of our detections are written "winlogbeat-7-xx-mm-dd-yyyy" and currently we are using "winlogbeat-*" to perform the query.

I don't want to overreach with best practices as sometimes depending on scale people have to do things a certain way that might not seem like a best practice but really is. Personally I'm not a big expert on rollovers and ILM indexes. I know in some cases you can use "subtraction" to remove indexes such as -someIndex-<pattern> and then that will remove it and in other cases you can actually use date math within index patterns in some areas.

...but...if things are going great for you I wouldn't want to throw a wrench into this by leading you down a bad rabbit hole or making things more complicated. Keeping things as simple as you can is probably for the best long term.

A lot of things within Elasticsearch already do interesting optimizations and are always getting better such as the improvements to API's such as field caps which makes some advice obsolete.

I would say seeing things in the slow logs and concentrating on those if you want to further optimize or keep tabs on what is happening is your best bet.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.