KQL-related performance issue

When I submit a terms query in Kibana on a keyword field using KQL, my queries often time out. If I submit the exact same with KQL off (using Lucene), performance is fantastic and I get results in a second or two.

I inspected the queries to see what was different. Everything is identical except for this:
KQL put the terms query in a filter:
"query": {
"bool": {
"must": [
{
"range": {
"@timestamp": {
"format": "strict_date_optional_time",
"gte": "2019-09-13T02:20:00.000Z",
"lte": "2019-09-13T02:35:59.999Z"
}
}
}
],
"filter": [
{
"bool": {
"should": [
{
"match_phrase": {
"aa.customer.name": "customer1"
}
}
],
"minimum_should_match": 1
}
}
],
"should": ,
"must_not":
}
},

With Lucene, it's done as part of the query using a query_string:
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "aa.customer.name:"customer1"",
"analyze_wildcard": true,
"time_zone": "America/Chicago"
}
},
{
"range": {
"@timestamp": {
"format": "strict_date_optional_time",
"gte": "2019-09-13T02:20:00.000Z",
"lte": "2019-09-13T02:35:59.999Z"
}
}
}
],
"filter": ,
"should": ,
"must_not":
}
},

If I submit the KQL query a second time, I usually get results very quickly.

When I run the query in devtools with "profile": true set, I cannot reproduce the poor performance using the KQL version of the query. In fact, the KQL query typically takes less time when I run just the query with profile enabled.

For more background: I had previously thought this was something that Elasticsearch was doing in how it was querying indices.

I then found this thread: Query_string performance issue But that's really for the opposite issue. But here are the results of running this with _validate/query?rewrite&explain

KQL with filter:
"explanation" : "+@timestamp:[1568374200000 TO 1568375099999] #aa.customer.name:customer1"

Lucene with query_string:
"explanation" : "+aa.customer.name:customer1 +@timestamp:[1568340000000 TO 1568340899999]"

I'm at a bit of a loss as the performance of the query profile requests aren't matching what I see when I run the queries directly in the Discover interface. Can anyone suggest where I should look next to track this down?

@benpolzin can you reproduce the performance issue in devtools when you're not using profile? If so this is probably an ES issue and you might get better help asking in that forum. It makes sense that the initial run of a filter query would be slower than subsequent runs since it hasn't been cached yet, but I don't know why it would ever be slower than a non-filtered query_string query.

@Bargs, thanks for your reply. This just gets more and more odd. Every query I submit via dev console returns quickly, in a few seconds. When I submit that same query in the Discover tab with KQL it takes >30 seconds and often times out (our timeout is set to 45 seconds). The same query with Lucene in the Discover tab, back to a couple seconds. I do swap the term I'm searching for and time to avoid false results due to caching, it doesn't seem to make any difference.

I had been submitting just the actual "query" portion of the request in the dev console, but I added back in the aggs, docvalue_fields, highlight, etc. sections to make the request match exactly and I still don't see the same timeouts that I do in the Discover app.

Here's the error I see when requests time out in the Discover app in case this is helpful:

Error loading data
Request Timeout after 45000ms
SearchError: Request Timeout after 45000ms
    at https://kibana.domain.com/bundles/commons.bundle.js:5:973799
    at processQueue (https://kibana.domain.com/built_assets/dlls/vendors.bundle.dll.js:499:199747)
    at https://kibana.domain.com/built_assets/dlls/vendors.bundle.dll.js:499:200710
    at Scope.$digest (https://kibana.domain.com/built_assets/dlls/vendors.bundle.dll.js:499:210518)
    at https://kibana.domain.com/built_assets/dlls/vendors.bundle.dll.js:499:213229
    at completeOutstandingRequest (https://kibana.domain.com/built_assets/dlls/vendors.bundle.dll.js:499:64643)
    at https://kibana.domain.com/built_assets/dlls/vendors.bundle.dll.js:499:67485

Hmm strange. A few more questions:

  • What version of ES/Kibana are you running?
  • Do you see any errors in the ES or Kibana server logs?
  • I'm wondering if the slow down is in Kibana or ES. If you configure the Slow Log in ES do you see the slow query show up there?
  • Version 7.3.0 for ES and Kibana
  • I don't see any obvious errors in Kibana or Elasticsearch logs

I set slowlog to TRACE to see every request regardless of time and ran a couple tests. Our ES cluster has 10 data nodes. 5 are "hot" and have faster SSD disks and have the most recent 7 days of data. 5 are "cold" and use traditional HDD disks.

When I submit a query using KQL, CPU on the 5 cold nodes spikes to 100% and load average also spikes. Looking at metricbeat stats, CPU is not comprised primarily of iowait, though.

That out of the way... when I look at my slowlog on one of the cold nodes I see dozens of log lines that look like this (names slightly obfuscated) when running a KQL query:

[2019-09-17T11:08:46,797][TRACE][i.s.s.query ] [usfitesa32] [indexprefixname-3.0-2019.08.16-000014][0] took[57.9micros], took_millis[0], total_hits[10000+ hits], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[208], source[{"size":0,"timeout":"1000ms","terminate_after":100000,"query":{"match_all":{"boost":1.0}},"aggregations":{"suggestions":{"terms":{"field":"customer.name","size":10,"shard_size":10,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"execution_hint":"map","order":[{"_count":"desc"},{"_key":"asc"}],"include":"kqlcustomer01.*"}}}}], id[],

But I don't see ANY lines that match for the Lucene query that I submitted right after that for a different value of the 'customer.name' field in slowlog on the cold nodes.

When I look at slowlog on one of the hot nodes, I see slowlog TRACE lines for both the KQL and Lucene queries.

If I'm reading this correctly the question is: Why does Elastic not query the indices on the cold nodes when using Lucene, but it does for KQL queries?

The log line you pasted is actually from the request that generates the autocomplete suggestions in KQL. The suggestions aren't bound by the time filter so they're going to hit your cold nodes. The actual query will be bound by the time filter so if you're querying less than the last 7 days of data you'll only see it hitting the hot nodes. The suggestion requests shouldn't impact the actual query for data but you could turn them off in Kibana's advanced settings just to reduce the number of moving pieces.

How long do the slowlog lines say the actual KQL data query took on the hot nodes? Does it match up with the time it takes for the browser to time out?

I ran this test again and I think these results from the hot nodes are only for the autocomplete suggestion queries.

I'm unsure why the actual queries aren't showing up there, unless I'm reading that wrong. I'm using filebeat's ES module to send slowlog.json to a separate monitoring ES server and used this query: event.dataset:"elasticsearch.slowlog" and "kqlcustomer02" and host.name:(server01 or server02 or server03 or server04 or server05) where those hostnames are the hot nodes.

But then I went into my Kibana Advanced settings and set filterEditor:suggestValues to Off. And now all of my searches using KQL return lickety-split with no timeouts.

I don't suppose there is a way to limit those auto-suggest queries to the time range of the actual query? Otherwise, we will live without auto-suggest.

@Bargs, thank you very much for your time walking through this with me. It's greatly appreciated!