Performance regression in Elasticsearch 6.5 using wildcards

We have recently migrated from Elasticsearch 5.6.7 to 6.5.0, and we found a serious performance regression with searches that use a query string query with wildcards. In particular:

  1. String Queries with prefixed star wildcards use a lot of CPU (e.g. 100% on all 4 cores) and perform much worse compared to the Elasticsearch version 5.6.7.
  2. String Queries with highlighting and prefixed star wildcards use even more CPU and perform very poorly.

Exactly the same searches executed with Elasticsearch 5.6.7 on similar indices perform much better: minimal CPU usage and fast response times.

What is the reason of this performance regression?
Is there a way to restore the search performance we had with Elasticsearch 5.6.7?

Without knowing more I would suspect this change.

How are you executing the search? Kibana or some other tool?

Could you give a more detailed example of a query?

We verified this performance issue using both Kibana and the Nest client.
This is an example of a search without using highlighting.

{
  "size": 20,
  "sort": [
    {
      "name.keyword": {
        "order": "asc"
      }
    }
  ],
  "_source": {
    "includes": [
      "key",
      "name",
      "path",
      "type",
      "typeEnum",
      "sourceId",
      "sourceType",
      "stateIndex",
      "state"
    ]
  },
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "query": "*sql*"
          }
        }
      ],
      "should": [
        {
          "has_child": {
            "type": "esentity",
            "score_mode": "none",
            "query": {
              "bool": {
                "must": [
                  {
                    "match_all": {}
                  }
                ],
                "should": [
                  {
                    "query_string": {
                      "query": "*sql*"
                    }
                  }
                ],
                "filter": [
                  {
                    "exists": {
                      "field": "lastSyncTime"
                    }
                  }
                ]
              }
            },
            "inner_hits": {
              "_source": {
                "includes": [
                  "stateIndex"
                ]
              }
            }
          }
        }
      ],
      "filter": [
        {
          "bool": {
            "must": [
              {
                "match_all": {}
              },
              {
                "term": {
                  "joinKey": {
                    "value": "parent"
                  }
                }
              }
            ]
          }
        }
      ]
    }
  }
}

This particular index contains a join, however we have created an equivalent index without the join. In this case the performance is better but still significantly inferior to a similar index of Elasticsearch 5.6.7.

First I would like to point out that wildcard queries, especially ones with leading wildcards are the most inefficient queries you can run, and should be avoided at all cost.

In version 6.5 the _all field was removed and replaced by a all_fields query string query. This could have an impact on accuracy as well as performance, especially if you have a large number of fields in your documents, as you are not specifying any specific field to search. You could potentially create your own equivalent of an _all field by using the copy_to functionality, but I am not sure if that would work or not.

Have you looked into perhaps using ngrams instead of wildcard queries?

We know that, in general, wildcards, especially the prefixed ones, are very expensive, however, we are specifically comparing the performance impact of these wildcards in Elasticsearch 5.6.7 and 6.5.0. What we found is that, for certain indices, the performance cost of using these wildcards in a query is prohibitive in 6.5.0; it wasn't so in 5.6.7.

We have certain indices that could potentially contain a few thousands of fields, however, we already found serious performance issues in indices that contain 350 fields of type text.

As we provide our users the ability to search their data using a query string query, we don't want to limit their searches by disallowing leading wildcards, if possible.
I don't see how we can make use of the N-gram tokenizer if our users want to search for "*server", or "*sql" in real time.

Regarding your suggestion of using the copy_to functionality to create a custom _all field, why aren't you sure it will work? What are the possible issues that will affect this solution, in addition to an increase of the storage space usage?

I am not sure this is what is causing the difference in behaviour, which is why I am not sure it will resolve the issue. It is however a change that has been introduced, so it would be worth to trying it out. This will result in additional storage, and should bring you up closer to the size it took in 5.6.

We did some tests using the copy_to functionality where all string values are copied into the custom "all_fields" field. Without using highlighting, any search is now much faster than before.
However, using highlighting, we still see a significant performance drop in searches with leading star wildcards.
This is an example of a search using highlighting and star wildcards:

{
  "size": 20,
  "sort": [
    {
      "lastUpdated": {
        "order": "desc"
      }
    }
  ],
  "highlight": {
    "pre_tags": [
      "<hl>"
    ],
    "post_tags": [
      "</hl>"
    ],
    "number_of_fragments": 0,
    "fields": {
      "key": {
        "type": "plain",
        "highlight_query": {
          "query_string": {
            "query": "*sql*"
          }
        }
      },
      "severityIndex": {
        "type": "plain",
        "highlight_query": {
          "query_string": {
            "query": "*sql*"
          }
        }
      },
      "target": {
        "type": "plain",
        "highlight_query": {
          "query_string": {
            "query": "*sql*"
          }
        }
      },
      "name": {
        "type": "plain",
        "highlight_query": {
          "query_string": {
            "query": "*sql*"
          }
        }
      },
      "resolutionState": {
        "type": "plain",
        "highlight_query": {
          "query_string": {
            "query": "*sql*"
          }
        }
      },
      "lastUpdated": {
        "type": "plain",
        "highlight_query": {
          "query_string": {
            "query": "*sql*"
          }
        }
      },
      "sourceType": {
        "type": "plain",
        "highlight_query": {
          "query_string": {
            "query": "*sql*"
          }
        }
      },
      "source.scom.monitoringObjectPath": {
        "type": "plain",
        "highlight_query": {
          "query_string": {
            "query": "*sql*"
          }
        }
      }
    },
    "require_field_match": false
  },
  "_source": {
    "includes": [
      "key",
      "severityIndex",
      "target",
      "name",
      "resolutionState",
      "lastUpdated",
      "sourceType",
      "source.scom.monitoringObjectPath"
    ]
  },
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "query": "*sql*",
            "fields": [
              "all_fields"
            ]
          }
        }
      ],
      "filter": [
        {
          "bool": {
            "must": [
              {
                "match_all": {}
              },
              {
                "match_all": {}
              },
              {
                "match_all": {}
              },
              {
                "match_all": {}
              }
            ]
          }
        }
      ]
    }
  }
}

As you can see, we used the Plain highlighter instead of the default highlighter, as it seems to perform better in our case.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.