KQL Value Suggestions are killing my cluster

Bertrand · August 23, 2019, 3:35pm

We have a cluster made of 7 nodes (3 hots/4 warm) hosting our log data in daily indexes.
The cluster contains about 180 indexes with 3 shards each. The total document count is about 9.300M with 50M documents in average per index.

We noticed that browsing logs using Kibana Discover and KQL imposes a huge load on the cluster caused by the value suggestion - and this even when the selected time range is very short (last hour for instance).

The load appears when the user is writing his KQL query as soon as it starts entering the value part of his criteria. At this moment we see the following entries in the slow query log on each node:

[2019-08-23T17:06:05,612][TRACE][i.s.s.query              ] [es03] [gas-logstash-2019.03.10][0] took[762.6ms], took_millis[762], total_hits[10000+ hits], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[579], source[{"size":0,"timeout":"1000ms","terminate_after":100000,"query":{"match_all":{"boost":1.0}},"aggregations":{"suggestions":{"terms":{"field":"level","size":10,"shard_size":10,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"execution_hint":"map","order":[{"_count":"desc"},{"_key":"asc"}],"include":".*"}}}}], id[], 
[2019-08-23T17:06:05,814][TRACE][i.s.s.query              ] [es03] [gas-logstash-2019.02.25][0] took[964.8ms], took_millis[964], total_hits[10000+ hits], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[579], source[{"size":0,"timeout":"1000ms","terminate_after":100000,"query":{"match_all":{"boost":1.0}},"aggregations":{"suggestions":{"terms":{"field":"level","size":10,"shard_size":10,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"execution_hint":"map","order":[{"_count":"desc"},{"_key":"asc"}],"include":".*"}}}}], id[], 
[2019-08-23T17:06:05,894][TRACE][i.s.s.query              ] [es03] [gas-logstash-2019.03.04][0] took[1s], took_millis[1044], total_hits[10000+ hits], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[579], source[{"size":0,"timeout":"1000ms","terminate_after":100000,"query":{"match_all":{"boost":1.0}},"aggregations":{"suggestions":{"terms":{"field":"level","size":10,"shard_size":10,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"execution_hint":"map","order":[{"_count":"desc"},{"_key":"asc"}],"include":".*"}}}}], id[], 
[2019-08-23T17:06:06,032][TRACE][i.s.s.query              ] [es03] [gas-logstash-2019.03.05][0] took[1.1s], took_millis[1183], total_hits[10000+ hits], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[579], source[{"size":0,"timeout":"1000ms","terminate_after":100000,"query":{"match_all":{"boost":1.0}},"aggregations":{"suggestions":{"terms":{"field":"level","size":10,"shard_size":10,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"execution_hint":"map","order":[{"_count":"desc"},{"_key":"asc"}],"include":".*"}}}}], id[], 
[2019-08-23T17:06:06,213][TRACE][i.s.s.query              ] [es03] [gas-logstash-2019.03.06][0] took[1.3s], took_millis[1364], total_hits[10000+ hits], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[579], source[{"size":0,"timeout":"1000ms","terminate_after":100000,"query":{"match_all":{"boost":1.0}},"aggregations":{"suggestions":{"terms":{"field":"level","size":10,"shard_size":10,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"execution_hint":"map","order":[{"_count":"desc"},{"_key":"asc"}],"include":".*"}}}}], id[], 
[2019-08-23T17:06:06,802][TRACE][i.s.s.query              ] [es03] [gas-logstash-2019.03.11][0] took[1.1s], took_millis[1187], total_hits[10000+ hits], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[579], source[{"size":0,"timeout":"1000ms","terminate_after":100000,"query":{"match_all":{"boost":1.0}},"aggregations":{"suggestions":{"terms":{"field":"level","size":10,"shard_size":10,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"execution_hint":"map","order":[{"_count":"desc"},{"_key":"asc"}],"include":".*"}}}}], id[],

We can see dozen of such entries on each nodes, primarily on warm nodes as they have slower disks and host more indexes. The situation gets worst as the user continues entering his value since new queries seems to be sent for every character he types, increasing the load on the cluster even more.

As far as I understand, Kibana doesn't take the selected time frame into account when building its value suggestions. Therefore the entire dataset ( +9 billion docs) is queried every time for suggestions... Warm nodes are of course not "designed" to handle such load

We disabled value suggestion for the moment until we find a better solution.

Is this expected?
Does someone else experience the same behaviour?
Should we configure/tune the cluster differently?

Thanks,
/Bertrand

Aharon · August 23, 2019, 4:43pm

Same here - killing warm nodes

Bertrand · August 26, 2019, 9:36am

Is there anybody from the Kibana team here that can give some advices ?

mattkime · August 28, 2019, 1:55am

This was addressed in a recent release - https://github.com/elastic/kibana/pull/37643

I hope this helps, I understand how annoying it is.

Bertrand · August 28, 2019, 11:40am

@mattkime The symptoms I described here are for ES 7.3.0 - AFAIK that version already includes the feature you are referring to.

I can see that most requests are timing out. However, they still hit the shards before they are cancelled after one second (the timeout set by Kibana). It looks like the timeout starts counting once the request execution is started but does not take into account the time spent in the queue while waiting for an executor. The request sent by Kibana is broken into multiple "sub" requests (one for every shard) and it seems that Kibana must wait for all of them to succeed or timeout before returning. As result, Kibana is blocked for 30s or more in our case...

mattkime · August 28, 2019, 3:55pm

Hello @Bertrand

Using autocomplete with large amounts of data is certainly troublesome and it's difficult to find a solution that works for everyone.

You're describing query execution problems. We could examine if there are options for improving your specific use case.

Here's a github issue debating various options and tradeoffs - https://github.com/elastic/kibana/issues/15887

Bertrand · August 28, 2019, 4:26pm

Hi @mattkime

Our use case and symptoms are exactly what is described in that github issue. Timings are similar too and we were also surprised (as related in the issue) to notice that the 1s timeout set on the query did not have the expected effect...

The github issue is opened since early 2018... Should we understand there is no hope to make things better?

I'm afraid the KQL suggestion feature is not usable as it is for the moment to anybody storing log data in ES. I have seen quite a few post recently in the forum where people are complaining about sudden CPU spikes reaching 100% for some time. In some cases, this might be caused by the KQL Value Suggestions feature which is enabled by default (as far as I remember)...

Are you aware of anything we can tune to make it better?

mattkime · August 28, 2019, 4:41pm

Discussing this with a colleague

Bertrand · September 2, 2019, 2:27pm

@mattkime Did you come up with something?

Christian_Dahlqvist · September 5, 2019, 6:22am

Given that Kibana is often used in very large deployments with thousands of indices and shards it would be interesting to learn how Kibana is performance tested at scale.

tylersmalley · September 5, 2019, 9:35pm

I have re-started the discussion here: https://github.com/elastic/kibana/issues/15887#issuecomment-528598601

system · October 3, 2019, 9:41pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Multi match bool query returning 0 hits on large number of documents Elasticsearch	3	471	January 25, 2021
Kibana 7 suggest values issue Kibana	3	376	July 3, 2019
Terms list might be incomplete because the request is taking too long Kibana	4	545	July 30, 2021
Kibana kql suggestions not working Kibana	11	525	December 15, 2021
KQL-related performance issue Kibana kql-kibana-query-language	11	2242	November 14, 2019

KQL Value Suggestions are killing my cluster

Related topics