Sorry for the delay. We upgraded to 7.16.2, and we seemed to see an increase in performance. We started at 100 detections and stepped it up by 10 detections every other day. I got to 150 detections before I started seeing issues again. It is a little different this time. The status of two of my kibanas is yellow, and this is because there are a large number of plugins in a degraded state. Of these degraded plugins a large number of them are showing degradation in the security, task manager, alerting and triggerActionsUi services.
As far as heap usage we have four kibana servers and they are all chilling. Cpu and ram usage never get higher than 10% for cpu and 20% for ram usage. As far as the logs the following is an example of something that I am pulling out of the kibana.log:
{"type":"log","@timestamp":"2022-02-16T21:12:52-07:00","tags":["info","plugins","security","audit","saved_objects_authorization_success"],"pid":12281,"eventType":"saved_objects_authorization_success","username":"1471545631","action":"update","types":["siem-detection-engine-rule-status"],"spaceIds":["default"],"args":{"type":"siem-detection-engine-rule-status","id":"11610110-897d-11ec-8186-059e05bba90f","attributes":{"statusDate":"2022-02-17T04:12:52.856Z","status":"going to run","lastFailureAt":"2022-02-09T07:51:18.697Z","lastSuccessAt":"2022-02-17T04:08:06.493Z","lastLookBackDate":null,"lastFailureMessage":"An error occurred during rule execution: message: \"[siem-detection-engine-rule-status:3e28f490-7e8e-11ec-9ee3-092b6bf9e58d]: version conflict, required seqNo [1252542], primary term [2]. current document has seqNo [1252555] and primary term [2]: version_conflict_engine_exception: [version_conflict_engine_exception] Reason: [siem-detection-engine-rule-status:3e28f490-7e8e-11ec-9ee3-092b6bf9e58d]: version conflict, required seqNo [1252542], primary term [2]. current document has seqNo [1252555] and primary term [2]\" name: \"Microsoft Build Engine Using an Alternate Name [Active]\" id: \"c490b870-7e8d-11ec-beaa-cde9595b607a\" rule id: \"4ae83156-36f0-492c-ac44-cbadfe2b26bf\" signals index: \".siem-signals-default\"","lastSuccessMessage":"succeeded"},"options":{"references":[{"id":"c490b870-7e8d-11ec-beaa-cde9595b607a","type":"alert","name":"alert_0"}]}},"message":"1471545631 authorized to [update] [siem-detection-engine-rule-status] in [default]"}
{"type":"log","@timestamp":"2022-02-16T21:12:52-07:00","tags":["info","plugins","security","audit","saved_objects_authorization_success"],"pid":12281,"eventType":"saved_objects_authorization_success","username":"1471545631","action":"update","types":["siem-detection-engine-rule-status"],"spaceIds":["default"],"args":{"type":"siem-detection-engine-rule-status","id":"11610110-897d-11ec-8186-059e05bba90f","attributes":{"statusDate":"2022-02-17T04:12:52.856Z","status":"going to run","lastFailureAt":"2022-02-09T07:51:18.697Z","lastSuccessAt":"2022-02-17T04:08:06.493Z","lastLookBackDate":null,"lastFailureMessage":"An error occurred during rule execution: message: \"[siem-detection-engine-rule-status:3e28f490-7e8e-11ec-9ee3-092b6bf9e58d]: version conflict, required seqNo [1252542], primary term [2]. current document has seqNo [1252555] and primary term [2]: version_conflict_engine_exception: [version_conflict_engine_exception] Reason: [siem-detection-engine-rule-status:3e28f490-7e8e-11ec-9ee3-092b6bf9e58d]: version conflict, required seqNo [1252542], primary term [2]. current document has seqNo [1252555] and primary term [2]\" name: \"Microsoft Build Engine Using an Alternate Name [Active]\" id: \"c490b870-7e8d-11ec-beaa-cde9595b607a\" rule id: \"4ae83156-36f0-492c-ac44-cbadfe2b26bf\" signals index: \".siem-signals-default\"","lastSuccessMessage":"succeeded"},"options":{"references":[{"id":"c490b870-7e8d-11ec-beaa-cde9595b607a","type":"alert","name":"alert_0"}]}},"message":"1471545631 authorized to [update] [siem-detection-engine-rule-status] in [default]"}
{"type":"log","@timestamp":"2022-02-16T21:13:04-07:00","tags":["info","plugins","security","audit","saved_objects_authorization_success"],"pid":12281,"eventType":"saved_objects_authorization_success","username":"1471545631","action":"update","types":["siem-detection-engine-rule-status"],"spaceIds":["default"],"args":{"type":"siem-detection-engine-rule-status","id":"c04eb870-8892-11ec-8186-059e05bba90f","attributes":{"statusDate":"2022-02-17T04:13:04.784Z","status":"going to run","lastFailureAt":"2022-02-08T03:54:00.417Z","lastSuccessAt":"2022-02-17T04:08:10.834Z","lastFailureMessage":"An error occurred during rule execution: message: \"[siem-detection-engine-rule-status:81ed0500-845a-11ec-b3f9-3d86fdb9e0a6]: version conflict, required seqNo [1030080], primary term [2]. current document has seqNo [1030083] and primary term [2]: version_conflict_engine_exception: [version_conflict_engine_exception] Reason: [siem-detection-engine-rule-status:81ed0500-845a-11ec-b3f9-3d86fdb9e0a6]: version conflict, required seqNo [1030080], primary term [2]. current document has seqNo [1030083] and primary term [2]\" name: \"PsExec Network Connection [Active]\" id: \"32e22790-791c-11ec-93b0-e73074d6e1cb\" rule id: \"a1108dc4-68f5-452d-8ce4-e7431774fef6\" signals index: \".siem-signals-default\"","lastSuccessMessage":"succeeded"},"options":{"references":[{"id":"32e22790-791c-11ec-93b0-e73074d6e1cb","type":"alert","name":"alert_0"}]}},"message":"1471545631 authorized to [update] [siem-detection-engine-rule-status] in [default]"}```
There are quite a few more of these logs, but it looks like somehow these documents have somehow gotten their sequence numbers out of whack. To date a reboot seems to get things back in order, but I am curious about how this happens. As these aren't actually error logs I am not sure if actually has anything to do with this particular issue. Next:
{"type":"log","@timestamp":"2022-02-16T21:13:31-07:00","tags":["error","plugins","alerting"],"pid":12281,"message":"Executing Alert default:monitoring_shard_size:ab7631e0-1adc-11ec-99af-57fcc1269633 has resulted in Error: illegal_argument_exception: [illegal_argument_exception] Reason: node [xxxx] does not have the [remote_cluster_client] role, caused by: \"\""}
{"type":"log","@timestamp":"2022-02-16T21:13:31-07:00","tags":["error","plugins","alerting"],"pid":12281,"message":"Executing Alert default:monitoring_alert_nodes_changed:ab721330-1adc-11ec-99af-57fcc1269633 has resulted in Error: illegal_argument_exception: [illegal_argument_exception] Reason: node [xxxx] does not have the [remote_cluster_client] role, caused by: \"\""}
{"type":"log","@timestamp":"2022-02-16T21:21:05-07:00","tags":["error","plugins","eventLog"],"pid":3780,"message":"error setting existing \".kibana-event-log-7.13.3-000007\" index aliases - error setting existing index aliases for index .kibana-event-log-7.13.3-000007 to is_hidden: illegal_state_exception: [illegal_state_exception] Reason: alias [.kibana-event-log-7.13.3] has is_hidden set to true on indices [.kibana-event-log-7.13.3-000007] but does not have is_hidden set to true on indices [.kibana-event-log-7.13.3-000008,.kibana-event-log-7.13.3-000005,.kibana-event-log-7.13.3-000006]; alias must have the same is_hidden setting on all indices"}```
These look like some issues relating to roles that have been granted to some different alerts, but I don't think that would have anything to do with what I have been seeing.
I am going to have to get back to you to on the information for disk i/o. My gut says that things are going to be fine. All of the computation for these detections should be handled within Elasticsearch. Kibana should just handle the timing of when the queries get sent to the stack, and then wait for a response. In my mind the SIEM should be fairly easy on the stack.
Sorry this got a little extensive.