My kibana servers have been acting up for a while now, and I was hoping someone here might have some insights. I have been having these issues for a while now, and the only advice I have really gotten to date is to add more resources. As it stands I have 4 virtualized kibana servers each with 6gb dedicated to the kibana process (all kibana and elastic nodes are on version 7.16.2). From my knowledge of Kibana this should be more than enough for my use case.
We have what I would consider a medium to large sized stack. We have 3 masters, 9 data nodes, 2 ml nodes and we ingest a fair amount of data on a daily basis. That being said there really isn't much task saturation on the kibana nodes. We were running 152 detections when the issue started most recently.
According to the status page two of the kibana nodes are in a yellow state and according to task manager the status is warn, but it doesn't look like there are any long running tasks on them. If you look at the two kibana nodes that are green they show longer running tasks, but the health is not affected. It almost seems like tasks aren't getting executed on the two servers due to their state being degraded.
I have also been looking at the status page that has a listing of all the plugins. I have noticed that the first half of the page is fine. The plugins run from a-z and they are all green and then roughly half way down the page the plugins seem to run from a-z again almost as if they are coming from a different list and almost all of those plugins are degraded. They list the services that they rely on that are degraded, and there are a few services that come up frequently that are degraded. Things like security, task manager, cloud and a couple others. So I am curious if any one has anymore info on those plugins and the services that they rely on. I have tons of clues, but the documentation does not give me anyway of linking all of this stuff together.
Finally, I have looked through the logs pretty exhaustively and there aren't really any clues there yet. There isn't anything eggregious at least. I haven't enabled trace or debug logs yet. The other thing that is really frustrating to me is that all of my servers are chilling. None of them have cpu or ram usage above 20% ever.
I have seen a few github issues open that are dancing around this subject of task manager being unhelpful, but haven't seen anything major that has happened to make it any better to resolve some of these issues. It is nice because it lets you know there is an issue, but I don't have access to enough information to trouble shoot this thing in an efficient manner.