KQL Value Suggestions are killing my cluster

We have a cluster made of 7 nodes (3 hots/4 warm) hosting our log data in daily indexes.
The cluster contains about 180 indexes with 3 shards each. The total document count is about 9.300M with 50M documents in average per index.

We noticed that browsing logs using Kibana Discover and KQL imposes a huge load on the cluster caused by the value suggestion - and this even when the selected time range is very short (last hour for instance).

The load appears when the user is writing his KQL query as soon as it starts entering the value part of his criteria. At this moment we see the following entries in the slow query log on each node:

[2019-08-23T17:06:05,612][TRACE][i.s.s.query              ] [es03] [gas-logstash-2019.03.10][0] took[762.6ms], took_millis[762], total_hits[10000+ hits], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[579], source[{"size":0,"timeout":"1000ms","terminate_after":100000,"query":{"match_all":{"boost":1.0}},"aggregations":{"suggestions":{"terms":{"field":"level","size":10,"shard_size":10,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"execution_hint":"map","order":[{"_count":"desc"},{"_key":"asc"}],"include":".*"}}}}], id[], 
[2019-08-23T17:06:05,814][TRACE][i.s.s.query              ] [es03] [gas-logstash-2019.02.25][0] took[964.8ms], took_millis[964], total_hits[10000+ hits], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[579], source[{"size":0,"timeout":"1000ms","terminate_after":100000,"query":{"match_all":{"boost":1.0}},"aggregations":{"suggestions":{"terms":{"field":"level","size":10,"shard_size":10,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"execution_hint":"map","order":[{"_count":"desc"},{"_key":"asc"}],"include":".*"}}}}], id[], 
[2019-08-23T17:06:05,894][TRACE][i.s.s.query              ] [es03] [gas-logstash-2019.03.04][0] took[1s], took_millis[1044], total_hits[10000+ hits], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[579], source[{"size":0,"timeout":"1000ms","terminate_after":100000,"query":{"match_all":{"boost":1.0}},"aggregations":{"suggestions":{"terms":{"field":"level","size":10,"shard_size":10,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"execution_hint":"map","order":[{"_count":"desc"},{"_key":"asc"}],"include":".*"}}}}], id[], 
[2019-08-23T17:06:06,032][TRACE][i.s.s.query              ] [es03] [gas-logstash-2019.03.05][0] took[1.1s], took_millis[1183], total_hits[10000+ hits], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[579], source[{"size":0,"timeout":"1000ms","terminate_after":100000,"query":{"match_all":{"boost":1.0}},"aggregations":{"suggestions":{"terms":{"field":"level","size":10,"shard_size":10,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"execution_hint":"map","order":[{"_count":"desc"},{"_key":"asc"}],"include":".*"}}}}], id[], 
[2019-08-23T17:06:06,213][TRACE][i.s.s.query              ] [es03] [gas-logstash-2019.03.06][0] took[1.3s], took_millis[1364], total_hits[10000+ hits], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[579], source[{"size":0,"timeout":"1000ms","terminate_after":100000,"query":{"match_all":{"boost":1.0}},"aggregations":{"suggestions":{"terms":{"field":"level","size":10,"shard_size":10,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"execution_hint":"map","order":[{"_count":"desc"},{"_key":"asc"}],"include":".*"}}}}], id[], 
[2019-08-23T17:06:06,802][TRACE][i.s.s.query              ] [es03] [gas-logstash-2019.03.11][0] took[1.1s], took_millis[1187], total_hits[10000+ hits], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[579], source[{"size":0,"timeout":"1000ms","terminate_after":100000,"query":{"match_all":{"boost":1.0}},"aggregations":{"suggestions":{"terms":{"field":"level","size":10,"shard_size":10,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"execution_hint":"map","order":[{"_count":"desc"},{"_key":"asc"}],"include":".*"}}}}], id[], 

We can see dozen of such entries on each nodes, primarily on warm nodes as they have slower disks and host more indexes. The situation gets worst as the user continues entering his value since new queries seems to be sent for every character he types, increasing the load on the cluster even more.

As far as I understand, Kibana doesn't take the selected time frame into account when building its value suggestions. Therefore the entire dataset ( +9 billion docs) is queried every time for suggestions... Warm nodes are of course not "designed" to handle such load :frowning:

We disabled value suggestion for the moment until we find a better solution.

Is this expected?
Does someone else experience the same behaviour?
Should we configure/tune the cluster differently?

Thanks,
/Bertrand

Same here - killing warm nodes

Is there anybody from the Kibana team here that can give some advices ?

This was addressed in a recent release - https://github.com/elastic/kibana/pull/37643

I hope this helps, I understand how annoying it is.

@mattkime The symptoms I described here are for ES 7.3.0 - AFAIK that version already includes the feature you are referring to.

I can see that most requests are timing out. However, they still hit the shards before they are cancelled after one second (the timeout set by Kibana). It looks like the timeout starts counting once the request execution is started but does not take into account the time spent in the queue while waiting for an executor. The request sent by Kibana is broken into multiple "sub" requests (one for every shard) and it seems that Kibana must wait for all of them to succeed or timeout before returning. As result, Kibana is blocked for 30s or more in our case...

Hello @Bertrand

Using autocomplete with large amounts of data is certainly troublesome and it's difficult to find a solution that works for everyone.

You're describing query execution problems. We could examine if there are options for improving your specific use case.

Here's a github issue debating various options and tradeoffs - https://github.com/elastic/kibana/issues/15887

Hi @mattkime

Our use case and symptoms are exactly what is described in that github issue. Timings are similar too and we were also surprised (as related in the issue) to notice that the 1s timeout set on the query did not have the expected effect...

The github issue is opened since early 2018... Should we understand there is no hope to make things better?

I'm afraid the KQL suggestion feature is not usable as it is for the moment to anybody storing log data in ES. I have seen quite a few post recently in the forum where people are complaining about sudden CPU spikes reaching 100% for some time. In some cases, this might be caused by the KQL Value Suggestions feature which is enabled by default (as far as I remember)...

Are you aware of anything we can tune to make it better?

Discussing this with a colleague

@mattkime Did you come up with something?

Given that Kibana is often used in very large deployments with thousands of indices and shards it would be interesting to learn how Kibana is performance tested at scale.

I have re-started the discussion here: https://github.com/elastic/kibana/issues/15887#issuecomment-528598601

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.