Detection engine scheduler stuck after upgrade

Hello,

I have encountered an issue where after the detection engine rule scheduling, got stuck after upgrade from 7.7.1 to 7.8. I have a setup with two Kibana nodes and there were no other issues during the upgrade, but none of the rules were rescheduled to run after upgrade, they were stuck over the weekend, last run was reported correctly with time before the update.

I had to deactivate and re-activate all rules to get them unstuck. I haven't noticed this issue before, not even sure how to troubleshoot it. I checked Kibana logs, but there were no logs from detection engine or task manager.

Is this some known bug with detection engine? I tried to look over github issues, but haven't found anything relevant.

It seems that I was a bit premature in thinking that the deactivation and reactivation of rules will help. The rules get executed once then it's stuck again.

One thing I noticed, when I go to Stack Management/Alerts and find any rule, even newly created one, and try to open it I get following:

Based on logs, the error happens when accessing /api/alert/alert-id/state. I can confirm this error using the API:

curl -k -X GET "https://127.0.0.1/api/alert/b1f7c791-cfc7-458a-8f09-0f0371b7592a/state" -H 'Content-Type: application/json' -H 'kbn-xsrf: true' -u elastic
Enter host password for user 'elastic':
{"statusCode":404,"error":"Not Found","message":"Saved object [task/49d66770-b476-11ea-a909-bd7923da034c] not found"}

We haven't seen this yet anywhere. This is an unusual one. Typically an error such as this:

Saved object [task/49d66770-b476-11ea-a909-bd7923da034c] not found

Indicates that the task was picked up and run by possibly another Kibana instance and then removed or it was manually deleted by something. If you go to the rules details screen do you see any error messages or hints there?

There were no error messages in the rule details screen.

There was a third Kibana instance that was used for one of our clients, as a way to have multi-tenancy way back before spaces existed, it had it's own .kibana-client index and own copy ofkibana_system role, but I guess the .kibana_task_manager-* indices were not separated and that may have broke it (although this instance didn't use SIEM or any of the alerting features).

Second thing I noticed, the reserved kibana_system role doesn't have any privileges for .kibana_task_manager-* how can it write into this system index if there are no privileges?

I tried two things, shutting down the third Kibana instance that had custom kibana index, and created role which grants kibana user all privileges to .kibana_task_manager-*. After I deactivated and reactivated the rules task scheduling is working, even after several hours.

Now I have removed the custom privilege for .kibana_task_manager-* from kibana user and it's still working. So I guess the third instance was the problem, although I had no previous problems with it until 7.8.

Yeah this could be causing the issues if you have a third Kibana instance that was using the same .kibana-task-manager but a different .kibana index than the other two.

The kibana task manager uses the .kibana-task-manager index to communicate task sharing between all the other kibana instances, so if you have one kibana instance you are using which has a different .kibana index for saved objects you really do want to also change your .kibana-task-manager to not be shared or you might run into less tested territory as well as some unwanted side effects.

Basically a rule will activate and create a new task in the task manager index (.kibana-task-manager indicating any Kibana instance is allowed to "steal the task". All the kibana instances each have a poll cycle that checks periodically against the .kibana-task-manager to see if there is an available task they can pick up. If there is, then that kibana instance will mark the document as taken by typically deleting the task.

That Kibana instance then executes the task which would be a rule execution. If that rule's saved object does not exist because that Kibana is searching within its .kibana saved objects index and it is disjointed you will end up with at least a failed rule execution.

More technical information about task manager is here:

Why you might not have had previous problems is a variety of reasons. You could have gotten lucky and the other Kibana instance sharing the task manager never picked up the alerting tasks previously and now that you upgraded your ordering/luck has changed and now it started picking them up.

I think you did the right thing by disjoining the .kibana-task-manager index from the saved object index since they were shared if you're trying to segregate data from instances.

Thanks for the great explanation and pointing me to the task manager README, I'll definitely through read it as I feel it's becoming a big part of Kibana.