Query/filter ordering performance issue

Elastic version: 6.3.2
Documents: ~120,000,000

We're noticing very odd behavior with how Elastic appears to be ordering filter/query operations. Given the following query:

POST /index/_search
{
  "from": 0,
  "size": 5,
  "query": {
"bool": {
  "must": [
    {
      "term": {
        "customer": "customer-id"
      }
    },
    {
      "query_string": {
        "fields": [
          "filename"
        ],
        "query": "*term*"
      }
    }
  ]
}
  }
}

Running this query takes ~3.5 seconds.
Removing the customer terms query/filter takes ~3.5 seconds
Removing the query_string query takes 35ms and consists of ~1,500 results (down from 120M total docs in the index)

It seems like the query_string operation is being run before the term filter since it takes 3.5 seconds with or without the term filter.

My question is, why doesn't Elastic apply the term filter first so that the query_string operation (more expensive) only runs on 1,500 documents instead of the full 120M?

Ideas I've had but haven't been able to make work yet:

  • Rearranging the query to get the term filter to run before the query_string. I've tried different variations of nesting must/should/bool.
  • Passing some type of hint to the query processor so that it knows to run the term filter first. I haven't found any way to do this.

Also, I realize wildcard prefixes cause performance issues, but I don't think that's what's happening here.

Any insight would be appreciated.

Nick

try moving the term query into the filter part of the bool query.

Also you should rethink your use of wildcards in the query string query. Maybe using shingles, ngrams or edge-ngrams at index time would yield better results at query time.

Thank you @spinscale. Though I tried this and it still doesn't seem to address the performance issue. It does cause the end result-set to be filtered correctly, but it does not limit the initial search data set so performance is still poor.

POST /index/_search
{
  "from": 0,
  "size": 5,
  "query": {
    "bool": {
      "filter": {
          "term": {
            "customer": "customer-id"
          }
      }, 
      "must": [
        {
          "query_string": {
            "fields": [
              "filename"
            ],
            "query": "*term*"
          }
        }
      ]
    }
  }
} 

We'll try the other analyzers, but even if we get that working, it seems like this filter issue would provide a massive performance improvement.

I think that the query string query is so expensive, that the filter does not play a role.

Have you tried using the profile API? See https://www.elastic.co/guide/en/elasticsearch/reference/7.1/search-profile.html and https://www.elastic.co/guide/en/kibana/7.1/xpack-profiler.html

Thank you @spinscale. I'll check out the profiler.

For the time being, we've made use of the standard analyzer, ngrams and removed the wildcard prefix and are seeing significant improvements.

1 Like