ES query performance when searching on filed with wildcard in field name

I saw significant performance difference between searching on field with exact field name and searching on field with wildcard suffix i name.

Following is my mapping:

PUT /myindex
{
     "mappings": {
      "docs": {
          "Locale": {
            "type": "text",
            "fields": {
                "lowercaseraw": {
                     "type": "text",
                     "analyzer": "lowercase_keyword"
                }
            },
            "analyzer": "standard"
          },
          "Title": {
            "type": "text",
            "analyzer": "standard"
          },
          "Title_english": {
            "type": "text",
            "analyzer": "english"
          },
          "Title_french": {
            "type": "text",
            "analyzer": "french"
          },
          "Title_finnish": {
            "type": "text",
            "analyzer": "finnish"
          }
     }
}

Note: There are totally around 50 Title_ fields each of which use a specific ES language analyzer. I just list a few as example.*

When ingesting document, I will insert the doc with TWO Title, one is for "Title", the other is for "Title_xxx", "xxx" will be determined by the language of the doc.

For example, if the doc is english, then:

POST /myindex/docs/doc_1
{
  "Title": "Elasticsearch 5.0 is released!",
  "Title_english": "Elasticsearch 5.0 is released!",
  "Locale": "en-us"
}

If the doc is french, then:

POST /myindex/docs/doc_1
{
  "Title": "Elasticsearch 5.0 is released!",
  "Title_french": "Elasticsearch 5.0 is released!",
  "Locale": "fr-fr"
}

Similar to other languages.

So, although there are 50 around Title_* fields in mapping, there is always TWO Title fields has value.

Then I saw significant query performance difference when searching english documents with following two different query DSL:

{
  "query" : {
    "bool": {
        "must": {
          "multi_match" : {
            "query" : "elasticserach 5",
            "fields" : ["Title_english"],
            "type" : "phrase",
            "slop" : 5,
            "boost" : 10
          }
        },
      "filter" : {
        "bool" : {
          "must" : 
            {
                "match" : {
                  "Locale.lowercaseraw" : "en-us"
                }
            }
        }
      }
    }
  }
}
{
  "query" : {
    "bool": {
        "must": {
          "multi_match" : {
            "query" : "elasticsearch 5",
            "fields" : ["Title_*"],
            "type" : "phrase",
            "slop" : 5,
            "boost" : 10
          }
        },
      "filter" : {
        "bool" : {
          "must" : 
            {
                "match" : {
                  "Locale.lowercaseraw" : "en-us"
                }
            }
        }
      }
    }
  }
}

The only difference is, that the #1 DSL use "Title_english", the #2 use "Title_", and please be noticed that the filter makes sure only english documents match the query, that is, only Title and Title_english in the matched documents have value, all other Title_ is null.

Per my rough perf testing using JMeter (200 QPS), the perf result is as follows:
#1:
Avg. CPU usage: 33%
Avg. Response Time: 26 ms

#2
Avg. CPU usage: 67%
Avg. Response Time: 50 ms

Although #2 use "Title_*", the ES actually search on one field, "Title_english", since the filter makes sure only "Title_english" has value.

So why so big difference? Is it bad practice using wildcard suffix in search field name?

Why? Because each doc needs to be checked to see if any of the fields begin in Title_, that takes time.
It's like a full table scan in a RDBMS.

It makes sense, but it is interesting that if Title_xxx field value is empty, the query on ["Title", "Title_*"] perf is quite good. Whereas if Title_english is not empty, query only on ["Title_*"] got very bad performance.
But in two cases, only one field is searched, and all used wildcard.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.