Performance regression in Elasticsearch 6.5 using wildcards

lcremonesi · December 13, 2018, 4:45pm

We have recently migrated from Elasticsearch 5.6.7 to 6.5.0, and we found a serious performance regression with searches that use a query string query with wildcards. In particular:

String Queries with prefixed star wildcards use a lot of CPU (e.g. 100% on all 4 cores) and perform much worse compared to the Elasticsearch version 5.6.7.
String Queries with highlighting and prefixed star wildcards use even more CPU and perform very poorly.

Exactly the same searches executed with Elasticsearch 5.6.7 on similar indices perform much better: minimal CPU usage and fast response times.

What is the reason of this performance regression?
Is there a way to restore the search performance we had with Elasticsearch 5.6.7?

A_B · December 14, 2018, 4:58pm

Without knowing more I would suspect this change.

How are you executing the search? Kibana or some other tool?

Could you give a more detailed example of a query?

lcremonesi · December 14, 2018, 6:07pm

We verified this performance issue using both Kibana and the Nest client.
This is an example of a search without using highlighting.

{
  "size": 20,
  "sort": [
    {
      "name.keyword": {
        "order": "asc"
      }
    }
  ],
  "_source": {
    "includes": [
      "key",
      "name",
      "path",
      "type",
      "typeEnum",
      "sourceId",
      "sourceType",
      "stateIndex",
      "state"
    ]
  },
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "query": "*sql*"
          }
        }
      ],
      "should": [
        {
          "has_child": {
            "type": "esentity",
            "score_mode": "none",
            "query": {
              "bool": {
                "must": [
                  {
                    "match_all": {}
                  }
                ],
                "should": [
                  {
                    "query_string": {
                      "query": "*sql*"
                    }
                  }
                ],
                "filter": [
                  {
                    "exists": {
                      "field": "lastSyncTime"
                    }
                  }
                ]
              }
            },
            "inner_hits": {
              "_source": {
                "includes": [
                  "stateIndex"
                ]
              }
            }
          }
        }
      ],
      "filter": [
        {
          "bool": {
            "must": [
              {
                "match_all": {}
              },
              {
                "term": {
                  "joinKey": {
                    "value": "parent"
                  }
                }
              }
            ]
          }
        }
      ]
    }
  }
}

This particular index contains a join, however we have created an equivalent index without the join. In this case the performance is better but still significantly inferior to a similar index of Elasticsearch 5.6.7.

Christian_Dahlqvist · December 17, 2018, 10:57am

First I would like to point out that wildcard queries, especially ones with leading wildcards are the most inefficient queries you can run, and should be avoided at all cost.

In version 6.5 the _all field was removed and replaced by a all_fields query string query. This could have an impact on accuracy as well as performance, especially if you have a large number of fields in your documents, as you are not specifying any specific field to search. You could potentially create your own equivalent of an _all field by using the copy_to functionality, but I am not sure if that would work or not.

Have you looked into perhaps using ngrams instead of wildcard queries?

lcremonesi · December 17, 2018, 4:07pm

We know that, in general, wildcards, especially the prefixed ones, are very expensive, however, we are specifically comparing the performance impact of these wildcards in Elasticsearch 5.6.7 and 6.5.0. What we found is that, for certain indices, the performance cost of using these wildcards in a query is prohibitive in 6.5.0; it wasn't so in 5.6.7.

We have certain indices that could potentially contain a few thousands of fields, however, we already found serious performance issues in indices that contain 350 fields of type text.

As we provide our users the ability to search their data using a query string query, we don't want to limit their searches by disallowing leading wildcards, if possible.
I don't see how we can make use of the N-gram tokenizer if our users want to search for "*server", or "*sql" in real time.

Regarding your suggestion of using the copy_to functionality to create a custom _all field, why aren't you sure it will work? What are the possible issues that will affect this solution, in addition to an increase of the storage space usage?

Christian_Dahlqvist · December 17, 2018, 4:11pm

I am not sure this is what is causing the difference in behaviour, which is why I am not sure it will resolve the issue. It is however a change that has been introduced, so it would be worth to trying it out. This will result in additional storage, and should bring you up closer to the size it took in 5.6.

lcremonesi · December 18, 2018, 10:29pm

We did some tests using the copy_to functionality where all string values are copied into the custom "all_fields" field. Without using highlighting, any search is now much faster than before.
However, using highlighting, we still see a significant performance drop in searches with leading star wildcards.
This is an example of a search using highlighting and star wildcards:

{
  "size": 20,
  "sort": [
    {
      "lastUpdated": {
        "order": "desc"
      }
    }
  ],
  "highlight": {
    "pre_tags": [
      "<hl>"
    ],
    "post_tags": [
      "</hl>"
    ],
    "number_of_fragments": 0,
    "fields": {
      "key": {
        "type": "plain",
        "highlight_query": {
          "query_string": {
            "query": "*sql*"
          }
        }
      },
      "severityIndex": {
        "type": "plain",
        "highlight_query": {
          "query_string": {
            "query": "*sql*"
          }
        }
      },
      "target": {
        "type": "plain",
        "highlight_query": {
          "query_string": {
            "query": "*sql*"
          }
        }
      },
      "name": {
        "type": "plain",
        "highlight_query": {
          "query_string": {
            "query": "*sql*"
          }
        }
      },
      "resolutionState": {
        "type": "plain",
        "highlight_query": {
          "query_string": {
            "query": "*sql*"
          }
        }
      },
      "lastUpdated": {
        "type": "plain",
        "highlight_query": {
          "query_string": {
            "query": "*sql*"
          }
        }
      },
      "sourceType": {
        "type": "plain",
        "highlight_query": {
          "query_string": {
            "query": "*sql*"
          }
        }
      },
      "source.scom.monitoringObjectPath": {
        "type": "plain",
        "highlight_query": {
          "query_string": {
            "query": "*sql*"
          }
        }
      }
    },
    "require_field_match": false
  },
  "_source": {
    "includes": [
      "key",
      "severityIndex",
      "target",
      "name",
      "resolutionState",
      "lastUpdated",
      "sourceType",
      "source.scom.monitoringObjectPath"
    ]
  },
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "query": "*sql*",
            "fields": [
              "all_fields"
            ]
          }
        }
      ],
      "filter": [
        {
          "bool": {
            "must": [
              {
                "match_all": {}
              },
              {
                "match_all": {}
              },
              {
                "match_all": {}
              },
              {
                "match_all": {}
              }
            ]
          }
        }
      ]
    }
  }
}

As you can see, we used the Plain highlighter instead of the default highlighter, as it seems to perform better in our case.

system · January 15, 2019, 10:29pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Wildcards on indices upgraded to Elasticsearch 6.5 don't work correctly Elasticsearch	4	478	January 17, 2019
Wildcard query performance Elasticsearch	8	12187	July 6, 2017
Leading Wildcard Performance? Elasticsearch	2	827	July 6, 2017
Wildcard queries slow since ES 5.x Elasticsearch	2	1119	December 27, 2022
Performance degradation for wildcard queries from 1.5.2 to 5.2.2? Elasticsearch	3	1323	May 11, 2017

Performance regression in Elasticsearch 6.5 using wildcards

Related topics