Why is a wildcard query string matching on stemmed terms?

First, some background. I understand that the algorithmic stemmer is not perfect, e.g. "focused" is stemmed to "focus," while "focus" is stemmed to "focu," which I've validated by looking through the term vectors.

However, when I execute queries like:

GET /_search
{
     "query": {
           "query_string": {
                "query": "focus*",
                "analyze_wildcard": True,
                "allow_leading_wildcard": True
            },
    }
}

across documents that contain explicitly mapped "text" fields with instances of "focus" and "focused," I retrieve results of both instances.

I assumed that since the wildcard contents wouldn't be analyzed, I would only get results for "focused," since "focu" wouldn't match in the inverted index.

The only conclusion I can draw is that the query is searching the entire source document, rather than solely the inverted index, however I haven't come across any documentation confirming this.

This leaves me with the following questions:

  1. How do query string queries containing wildcards search for matches?
  2. If the query is doing a full sweep of the source documents, is there a way to disable this behavior and only utilize the inverted index?

Please do not comment on the inefficiency of using wildcard queries, I'm here purely to understand how the operation is being completed.

You did not specify any default_field so I believe all fields in the document that support term queries are searched. In cases like this it always helps if you also provide the mapping you are using.

Yes, I want to search across all searchable fields of all documents, but doesn't that mean searching across the inverted index, NOT the source document? This distinction is what I'm trying to resolve.

This is not the exact index, but is representative of the problem:

{
  "settings": {
    "index": {
      "number_of_shards": 2,
      "number_of_replicas": 0
    },
    "analysis": {
      "analyzer": {
        "default": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "porter_stem"
          ]
        }
      }
    }
  },
  "mappings": {
    "dynamic": false,
    "properties": {
      "data": {
        "type": "text"
      }
    }
}

Just to be sure, I tested the search on a single field and the results were the same.

The explanation was right in front of me the whole time. "analyze_wildcard" does, in fact, analyze the wildcard.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.