Elasticsearch Wildcard fieldtype has slow performance for wildcard queries

Chris_Doman · December 26, 2020, 1:21pm

I'm testing the new ElasticSearch WildCard field type (https://www.elastic.co/blog/find-strings-within-strings-faster-with-the-new-elasticsearch-wildcard-field) that is supposed to offer better wildcard queries.

Strangely, it seems to be slower for wildcard queries than the default text field.

Is this expected behaviour, or am I doing something wrong?

I am searching about approx. 500,000 rows of log data.

With the default text type (though only searches over first 256 chars due to ignore_above):

 "result" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }

time curl -X GET "localhost:9200/text-index/_count?pretty" -H "Content-Type: application/json" -d'{     "query": {         "wildcard": {             "result": "*dfdskfjdskofsdjf*"      }     } } '

= 3.2 Seconds

With the new wildcard field type:

"result" : {
          "type" : "wildcard"
        }


time curl -X GET "localhost:9200/wildcard-index/_count?pretty" -H "Content-Type: application/json" -d'{     "query": {         "wildcard": {             "result": "*dfdskfjdskofsdjf*"      }     } } '

= 24.95 seconds

Mark_Harwood · December 26, 2020, 10:24pm

What’s the cardinality of values in this field?
As the flowchart at the end of this blog advises - the wildcard field is designed for use on fields with millions of unique values.

Keyword fields get bogged down by large numbers of unique terms while wildcard fields will be bogged down by large numbers of docs that share a common term (ie low cardinality vs high cardinality fields.)

Chris_Doman · December 28, 2020, 2:52pm

Mostly unique values but there is repetition of words within them - typical log data.
I've ended up using a default text field instead of wildcard, with ngram tokenizer

Mark_Harwood · December 28, 2020, 3:14pm

Thanks. I am keen to diagnose the performance further if you can help with these 2 questions:

On the keyword field, what is the value of this aggregation?

{
  "size":0,
  "aggs":{
    "numTerms":{
      "cardinality": {
        "field": "x"
      }
    }
  }
}

How many documents matched your wildcard query?

Mark_Harwood · December 29, 2020, 9:31am

Just reviewing this topic again and a couple of things jumped out:

Keyword values greater than 256 are completely ignored, not truncated. They are completely missing from the index.

Your example request was searching the text field and not the keyword field which would need to be queried by the ‘result.keyword’ name.
Querying a text field may be faster (fewer unique terms in the index) but your query scope is different - you’re searching within the confines of single terms/words rather than for any possible character sequence in the original value. The blog highlights why word-based text indexes are of less use in machine generated content where there’s no common agreement between searcher and search engine as to what constitutes a word.

system · January 26, 2021, 9:31am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Wildcard queries slow since ES 5.x Elasticsearch	2	993	December 27, 2022
Relevation on wildcard results and wildcard speed Elasticsearch	6	413	July 6, 2017
QueryString vs multiple wildcards Elasticsearch	6	1030	July 5, 2023
Performance of filtered wildcard queries Elasticsearch	2	2705	June 29, 2018
Ngram behavior vs wildcard field type Elasticsearch	8	1574	May 25, 2022

Elasticsearch Wildcard fieldtype has slow performance for wildcard queries

Related topics