We have a cluster made of 7 nodes (3 hots/4 warm) hosting our log data in daily indexes.
The cluster contains about 180 indexes with 3 shards each. The total document count is about 9.300M
with 50M
documents in average per index.
We noticed that browsing logs using Kibana Discover and KQL imposes a huge load on the cluster caused by the value suggestion - and this even when the selected time range is very short (last hour for instance).
The load appears when the user is writing his KQL query as soon as it starts entering the value part of his criteria. At this moment we see the following entries in the slow query log on each node:
[2019-08-23T17:06:05,612][TRACE][i.s.s.query ] [es03] [gas-logstash-2019.03.10][0] took[762.6ms], took_millis[762], total_hits[10000+ hits], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[579], source[{"size":0,"timeout":"1000ms","terminate_after":100000,"query":{"match_all":{"boost":1.0}},"aggregations":{"suggestions":{"terms":{"field":"level","size":10,"shard_size":10,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"execution_hint":"map","order":[{"_count":"desc"},{"_key":"asc"}],"include":".*"}}}}], id[],
[2019-08-23T17:06:05,814][TRACE][i.s.s.query ] [es03] [gas-logstash-2019.02.25][0] took[964.8ms], took_millis[964], total_hits[10000+ hits], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[579], source[{"size":0,"timeout":"1000ms","terminate_after":100000,"query":{"match_all":{"boost":1.0}},"aggregations":{"suggestions":{"terms":{"field":"level","size":10,"shard_size":10,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"execution_hint":"map","order":[{"_count":"desc"},{"_key":"asc"}],"include":".*"}}}}], id[],
[2019-08-23T17:06:05,894][TRACE][i.s.s.query ] [es03] [gas-logstash-2019.03.04][0] took[1s], took_millis[1044], total_hits[10000+ hits], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[579], source[{"size":0,"timeout":"1000ms","terminate_after":100000,"query":{"match_all":{"boost":1.0}},"aggregations":{"suggestions":{"terms":{"field":"level","size":10,"shard_size":10,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"execution_hint":"map","order":[{"_count":"desc"},{"_key":"asc"}],"include":".*"}}}}], id[],
[2019-08-23T17:06:06,032][TRACE][i.s.s.query ] [es03] [gas-logstash-2019.03.05][0] took[1.1s], took_millis[1183], total_hits[10000+ hits], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[579], source[{"size":0,"timeout":"1000ms","terminate_after":100000,"query":{"match_all":{"boost":1.0}},"aggregations":{"suggestions":{"terms":{"field":"level","size":10,"shard_size":10,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"execution_hint":"map","order":[{"_count":"desc"},{"_key":"asc"}],"include":".*"}}}}], id[],
[2019-08-23T17:06:06,213][TRACE][i.s.s.query ] [es03] [gas-logstash-2019.03.06][0] took[1.3s], took_millis[1364], total_hits[10000+ hits], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[579], source[{"size":0,"timeout":"1000ms","terminate_after":100000,"query":{"match_all":{"boost":1.0}},"aggregations":{"suggestions":{"terms":{"field":"level","size":10,"shard_size":10,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"execution_hint":"map","order":[{"_count":"desc"},{"_key":"asc"}],"include":".*"}}}}], id[],
[2019-08-23T17:06:06,802][TRACE][i.s.s.query ] [es03] [gas-logstash-2019.03.11][0] took[1.1s], took_millis[1187], total_hits[10000+ hits], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[579], source[{"size":0,"timeout":"1000ms","terminate_after":100000,"query":{"match_all":{"boost":1.0}},"aggregations":{"suggestions":{"terms":{"field":"level","size":10,"shard_size":10,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"execution_hint":"map","order":[{"_count":"desc"},{"_key":"asc"}],"include":".*"}}}}], id[],
We can see dozen of such entries on each nodes, primarily on warm nodes as they have slower disks and host more indexes. The situation gets worst as the user continues entering his value since new queries seems to be sent for every character he types, increasing the load on the cluster even more.
As far as I understand, Kibana doesn't take the selected time frame into account when building its value suggestions. Therefore the entire dataset ( +9 billion docs
) is queried every time for suggestions... Warm nodes are of course not "designed" to handle such load
We disabled value suggestion for the moment until we find a better solution.
Is this expected?
Does someone else experience the same behaviour?
Should we configure/tune the cluster differently?
Thanks,
/Bertrand