Why is a wildcard query string matching on stemmed terms?

cphramington · November 6, 2023, 10:10pm

First, some background. I understand that the algorithmic stemmer is not perfect, e.g. "focused" is stemmed to "focus," while "focus" is stemmed to "focu," which I've validated by looking through the term vectors.

However, when I execute queries like:

GET /_search
{
     "query": {
           "query_string": {
                "query": "focus*",
                "analyze_wildcard": True,
                "allow_leading_wildcard": True
            },
    }
}

across documents that contain explicitly mapped "text" fields with instances of "focus" and "focused," I retrieve results of both instances.

I assumed that since the wildcard contents wouldn't be analyzed, I would only get results for "focused," since "focu" wouldn't match in the inverted index.

The only conclusion I can draw is that the query is searching the entire source document, rather than solely the inverted index, however I haven't come across any documentation confirming this.

This leaves me with the following questions:

How do query string queries containing wildcards search for matches?
If the query is doing a full sweep of the source documents, is there a way to disable this behavior and only utilize the inverted index?

Please do not comment on the inefficiency of using wildcard queries, I'm here purely to understand how the operation is being completed.

Christian_Dahlqvist · November 7, 2023, 6:25am

You did not specify any default_field so I believe all fields in the document that support term queries are searched. In cases like this it always helps if you also provide the mapping you are using.

cphramington · November 7, 2023, 3:05pm

Yes, I want to search across all searchable fields of all documents, but doesn't that mean searching across the inverted index, NOT the source document? This distinction is what I'm trying to resolve.

This is not the exact index, but is representative of the problem:

{
  "settings": {
    "index": {
      "number_of_shards": 2,
      "number_of_replicas": 0
    },
    "analysis": {
      "analyzer": {
        "default": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "porter_stem"
          ]
        }
      }
    }
  },
  "mappings": {
    "dynamic": false,
    "properties": {
      "data": {
        "type": "text"
      }
    }
}

cphramington · November 7, 2023, 9:49pm

Just to be sure, I tested the search on a single field and the results were the same.

cphramington · November 8, 2023, 10:54pm

The explanation was right in front of me the whole time. "analyze_wildcard" does, in fact, analyze the wildcard.

system · December 6, 2023, 10:55pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Question about wildcard query Elasticsearch	9	472	May 5, 2021
Wildcard query returns null when Uppercase Letters are used Elasticsearch	4	3861	July 6, 2017
Query String Query and leading wildcard Elasticsearch	1	684	January 9, 2017
Wildcards with query_string query and custom analyzer Elasticsearch	2	263	May 25, 2022
Query_string with wildcard not working as expected (or wrong understanging of analyze_wildcard) Elasticsearch	0	9	December 12, 2024

Why is a wildcard query string matching on stemmed terms?

Related topics