[7.14.1] Linearly increasing latency in task manager index searches

We're migrating from a 6.8 cluster to a 7.14 cluster and we're seeing some abnormal behavior with Kibana's task manager index. Sorry if this is the wrong forum.

For context, this is a 3-node ES v7.14.1 cluster that as of now we're only writing data to. We noticed that the Search Latency of some nodes was increasing linearly, despite having no clients doing any searches. After checking the indices, we pinned it down to Kibana's Task Manager index. These are the index metrics for the last week:

image

image

We also see that the index size right now takes over 200MB, and it keeps growing:
image

Using cat api, we see that the index has 15 docs and more than 600k pending deletes. I guess those deletes are the reason why the searches take progressively longer?

green  open   .kibana_task_manager_7.14.1_001 UUID   1   1         15       628919    270.3mb        135.1mb

We could force merge to remove the deleted docs, but it seems a bit silly that an index with 15 docs is taking up over 270 MB of disk, with search requests taking over 100ms.

We'd like to know whether a) this is expected behavior and b) whether there's any setting that can be adjusted to improve the performance of this index, if needed. It might not be an issue but we haven't seen anything similar in our current ES 6.8 cluster and we want to make sure we're gtg before migrating the search clients.

HI @spiqueras,

This does not sound like expected behavior. Can you provide the output of the Task manager health API? That will provide us with an idea of the task types that are running.

We did add a cleanup in 7.14 that deletes up task documents from failed actions (related to rules and connectors, sometimes these failed actions could build up prior to 7.14) on upgrade, but since you are upgrading from 6.8, it doesn't seem like you should have any rules running. Can you confirm whether or not you are using rules and connectors?

Thanks,

Hi @ying.mao , thanks for the quick response. You'll find the response of the Task Manager Health API in the following gist.

Regarding your comment, sorry if I wasn't clear. We're not upgrading the 6.8 cluster directly, we're running both clusters in parallel in different machines so we can test for regressions. Nevertheless, I can confirm we haven't setup any rules or connectors in the new cluster.

In the meantime, we have added two more nodes to the cluster. This is how the Task Manager Index metrics look right now:

image

image

The discontinuities you see in the first chart roughly correspond to the moment where we added each of the nodes. Request time and latency go down, but later exhibit a similar increasing pattern. The index size in disk is also increasing.

Other metrics of the same index show erratic behavior:

image

image

image

Now that I think of it, we run a Kibana instance for each Elasticsearch node (right now it's 5 ES nodes and 5 Kibana instances). This setup works well in our 6.8 cluster, but I wonder if the recent changes to Task Manager are somehow interfering with our setup. It's also a bit weird that the Task Manager Health API reports that the number of observed_kibana_instances is 1, when the interface reports 5 instances as expected, so there might be something misconfigured there.

Thanks again for your time

Hi @spiqueras,

Thanks for providing the task manager health output. Everything looks fairly normal to me. Is the task manager search latency affecting the rest of your cluster? Or is your concern driven primarily by the charts?

The way task manager works is by claiming and running tasks at a specified poll interval (default 3 seconds). Every time it makes a claim or updates a task status (2-3 times per task per run), it updates the underlying document in the task manager index. Since updates in Elasticsearch are actually creating a new document and deleting the old, this accounts for the number of tomb-stoned documents you are seeing in the index. Task manager is the only feature in Kibana that uses the task manager index and the search latency, while increasing is still in the order of milliseconds, so I think as long as you are not seeing performance issues wrt the rest of your cluster, this is not something to be concerned about.

Re: the # observed Kibana instances. This is an experimental capacity estimation we are trying out and it uses the server UUIDs that we see in the task manager index. Since you have very few tasks running, it seems that a single Kibana instance is picking up all the tasks so only 1 UUID is seen in the index, leading to the observed value of 1.

Hope that helps!

Hi @ying.mao ,

Thanks for the explanation. It makes sense that the number of deleted documents is that high. It's a bit weird that ES/Lucene isn't claiming that space with more frequent merges, but we haven't tested whether this has a noticeable impact on the overall cluster performance (and we don't really have a baseline to compare against other than the old cluster, which is running on different hardware).

We'll keep monitoring the performance as we move more clients to the new cluster and let you know if there's anything off.

1 Like