Detection engine scheduler stuck after upgrade

j91321 · June 22, 2020, 8:07am

Hello,

I have encountered an issue where after the detection engine rule scheduling, got stuck after upgrade from 7.7.1 to 7.8. I have a setup with two Kibana nodes and there were no other issues during the upgrade, but none of the rules were rescheduled to run after upgrade, they were stuck over the weekend, last run was reported correctly with time before the update.

I had to deactivate and re-activate all rules to get them unstuck. I haven't noticed this issue before, not even sure how to troubleshoot it. I checked Kibana logs, but there were no logs from detection engine or task manager.

Is this some known bug with detection engine? I tried to look over github issues, but haven't found anything relevant.

j91321 · June 22, 2020, 10:28am

It seems that I was a bit premature in thinking that the deactivation and reactivation of rules will help. The rules get executed once then it's stuck again.

One thing I noticed, when I go to Stack Management/Alerts and find any rule, even newly created one, and try to open it I get following:

Based on logs, the error happens when accessing /api/alert/alert-id/state. I can confirm this error using the API:

curl -k -X GET "https://127.0.0.1/api/alert/b1f7c791-cfc7-458a-8f09-0f0371b7592a/state" -H 'Content-Type: application/json' -H 'kbn-xsrf: true' -u elastic
Enter host password for user 'elastic':
{"statusCode":404,"error":"Not Found","message":"Saved object [task/49d66770-b476-11ea-a909-bd7923da034c] not found"}

Frank_Hassanabad · June 23, 2020, 3:26am

We haven't seen this yet anywhere. This is an unusual one. Typically an error such as this:

Saved object [task/49d66770-b476-11ea-a909-bd7923da034c] not found

Indicates that the task was picked up and run by possibly another Kibana instance and then removed or it was manually deleted by something. If you go to the rules details screen do you see any error messages or hints there?

j91321 · June 23, 2020, 7:39am

There were no error messages in the rule details screen.

There was a third Kibana instance that was used for one of our clients, as a way to have multi-tenancy way back before spaces existed, it had it's own .kibana-client index and own copy ofkibana_system role, but I guess the .kibana_task_manager-* indices were not separated and that may have broke it (although this instance didn't use SIEM or any of the alerting features).

Second thing I noticed, the reserved kibana_system role doesn't have any privileges for .kibana_task_manager-* how can it write into this system index if there are no privileges?

I tried two things, shutting down the third Kibana instance that had custom kibana index, and created role which grants kibana user all privileges to .kibana_task_manager-*. After I deactivated and reactivated the rules task scheduling is working, even after several hours.

Now I have removed the custom privilege for .kibana_task_manager-* from kibana user and it's still working. So I guess the third instance was the problem, although I had no previous problems with it until 7.8.

Frank_Hassanabad · June 23, 2020, 3:32pm

Yeah this could be causing the issues if you have a third Kibana instance that was using the same .kibana-task-manager but a different .kibana index than the other two.

The kibana task manager uses the .kibana-task-manager index to communicate task sharing between all the other kibana instances, so if you have one kibana instance you are using which has a different .kibana index for saved objects you really do want to also change your .kibana-task-manager to not be shared or you might run into less tested territory as well as some unwanted side effects.

Basically a rule will activate and create a new task in the task manager index (.kibana-task-manager indicating any Kibana instance is allowed to "steal the task". All the kibana instances each have a poll cycle that checks periodically against the .kibana-task-manager to see if there is an available task they can pick up. If there is, then that kibana instance will mark the document as taken by typically deleting the task.

That Kibana instance then executes the task which would be a rule execution. If that rule's saved object does not exist because that Kibana is searching within its .kibana saved objects index and it is disjointed you will end up with at least a failed rule execution.

More technical information about task manager is here:

github.com

elastic/kibana/blob/master/x-pack/plugins/task_manager/server/README.md

# Kibana task manager

The task manager is a generic system for running background tasks. It supports:

- Single-run and recurring tasks
- Scheduling tasks to run after a specified datetime
- Basic retry logic
- Recovery of stalled tasks / timeouts
- Tracking task state across multiple runs
- Configuring the run-parameters for specific tasks
- Basic coordination to prevent the same task instance from running on more than one Kibana system at a time

## Implementation details

At a high-level, the task manager works like this:

- Every `{poll_interval}` milliseconds, check the `{index}` for any tasks that need to be run:
  - `runAt` is past
  - `attempts` is less than the configured threshold
- Attempt to claim the task by using optimistic concurrency to set:

This file has been truncated. show original

Why you might not have had previous problems is a variety of reasons. You could have gotten lucky and the other Kibana instance sharing the task manager never picked up the alerting tasks previously and now that you upgraded your ordering/luck has changed and now it started picking them up.

I think you did the right thing by disjoining the .kibana-task-manager index from the saved object index since they were shared if you're trying to segregate data from instances.

j91321 · June 23, 2020, 5:48pm

Thanks for the great explanation and pointing me to the task manager README, I'll definitely through read it as I feel it's becoming a big part of Kibana.

system · July 21, 2020, 5:48pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Failing to get Detection Alerts SIEM detection-rules	2	529	February 24, 2022
Detection Custom Rule not working SIEM elastic-stack-alerting , detection-rules	8	1370	May 27, 2021
SIEM detection engine is not getting started SIEM elastic-stack-machine-learning , detection-rules	13	1972	October 18, 2020
Detection Rule Export API not working SIEM detection-rules	3	547	December 16, 2021
Detection Failiure in ELK7.8 SIEM SIEM	2	343	April 2, 2021

Detection engine scheduler stuck after upgrade

Related topics