Elasticsearch Wildcard fieldtype has slow performance for wildcard queries

I'm testing the new ElasticSearch WildCard field type (https://www.elastic.co/blog/find-strings-within-strings-faster-with-the-new-elasticsearch-wildcard-field) that is supposed to offer better wildcard queries.

Strangely, it seems to be slower for wildcard queries than the default text field.

Is this expected behaviour, or am I doing something wrong?

I am searching about approx. 500,000 rows of log data.

With the default text type (though only searches over first 256 chars due to ignore_above):

 "result" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }

time curl -X GET "localhost:9200/text-index/_count?pretty" -H "Content-Type: application/json" -d'{     "query": {         "wildcard": {             "result": "*dfdskfjdskofsdjf*"      }     } } '

= 3.2 Seconds

With the new wildcard field type:

"result" : {
          "type" : "wildcard"
        }


time curl -X GET "localhost:9200/wildcard-index/_count?pretty" -H "Content-Type: application/json" -d'{     "query": {         "wildcard": {             "result": "*dfdskfjdskofsdjf*"      }     } } '

= 24.95 seconds

What’s the cardinality of values in this field?
As the flowchart at the end of this blog advises - the wildcard field is designed for use on fields with millions of unique values.

Keyword fields get bogged down by large numbers of unique terms while wildcard fields will be bogged down by large numbers of docs that share a common term (ie low cardinality vs high cardinality fields.)

Mostly unique values but there is repetition of words within them - typical log data.
I've ended up using a default text field instead of wildcard, with ngram tokenizer

Thanks. I am keen to diagnose the performance further if you can help with these 2 questions:

  1. On the keyword field, what is the value of this aggregation?
{
  "size":0,
  "aggs":{
    "numTerms":{
      "cardinality": {
        "field": "x"
      }
    }
  }
}
  1. How many documents matched your wildcard query?

Just reviewing this topic again and a couple of things jumped out:

Keyword values greater than 256 are completely ignored, not truncated. They are completely missing from the index.

Your example request was searching the text field and not the keyword field which would need to be queried by the ‘result.keyword’ name.
Querying a text field may be faster (fewer unique terms in the index) but your query scope is different - you’re searching within the confines of single terms/words rather than for any possible character sequence in the original value. The blog highlights why word-based text indexes are of less use in machine generated content where there’s no common agreement between searcher and search engine as to what constitutes a word.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.