Keyword subfield mapping causes unexpected querying results

We have an index mapping schema with a lot of text fields. To be able to sort and filter them we added keyword subfield mapping with lowercase normalizer. Here is a short part of our schema:

{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "createdTime": {
        "type": "date",
        "format": "strict_date_optional_time||epoch_millis||basic_date"
      },
      "field1": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256,
            "normalizer": "lowercase"
          }
        }
      },
      "fileSize": {
        "type": "integer"
      },
    }
  }
}

We index 4 documents with such values in "field1":

  1. selection
  2. electron.jpg
  3. election
  4. ele cti on

Then we do a full-text search with this query:

POST <indexname>/_search
{
  "query": {
     "bool": {
       "must": [
         {
           "query_string": {
             "query": "ele*on"
           }
         }
       ]
    }
  }
}

But it returns incorrect results (expected are 2 and 3):

  • If we have a text field, and a subfield as a keyword - it returns 3 and 4 results
  • If we remap to have only a text field - it returns a 3 result
  • If we remap to have only a text field, and also add a simple build-in analyzer to it - it returns the expected results
  • If we have a text field, and a subfield as a keyword, and also add a simple build-in analyzer to the text field - it returns 2, 3 and 4

What we're missing here? What options do we have?

Please note, that we need to support sorting, filtering (which is available with a keyword subfield), and a full-text wildcard query with an asterisk in the middle.

Welcome!

Please note that it could be a bad practice to use wildcards (Query DSL | Elasticsearch Guide [8.11] | Elastic)...

And normally, users don't enter wildcards on a search engine. I'm never doing this within the google search bar as an example. :wink:

Instead, you should look at the wildcard field type if you really want to use wildcards.

But it returns incorrect results (expected are 2 and 3):

ele*on matches ele cti on IMO... But I understand what you mean. You want to compare full terms, right? So you want to compare ele*on with selection, electron.jpg, election, ele, cti and on, right.?

So you need to find an analyzer which does exactly this. I'd use a custom analyzer and use the _analyze API to better understand ho to build the right one for your use case. See Test an analyzer | Elasticsearch Guide [8.8] | Elastic.

I'd recommend looking at ngrams instead of using wildcards.

Thanks for the quick response!

We'd read about that, but for now, we decided to start in this way since we migrating from Azure Search and we use a similar approach there (Azure Search is also built on top of the Lucene engine). For other scenarios (including trailing and leading wildcard querying), everything works fine.

Regarding using the wildcard field type as far as I understand we can't do a full-text search with this field, we have to add a specific field in wildcard query?

GET /_search
{
  "query": {
    "wildcard": {
      "user.id": {
        "value": "ki*y",
        "boost": 1.0,
        "rewrite": "constant_score"
      }
    }
  }
}

You got it right! As I mentioned, we tried a simple analyzer for a text field only. This works as expected (search response contains electron.jpg and election):

{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "createdTime": {
        "type": "date",
        "format": "strict_date_optional_time||epoch_millis||basic_date"
      },
      "field1": {
        "type": "text",
        "analyzer": "simple"
      },
      "fileSize": {
        "type": "integer"
      },
    }
  }
}

But as soon as we add a keyword subfield the search response will start returning electron.jpg, election, and also ele cti on. We found it weird since we thought that keyword subfield mapping should be different from the main text field.

Thanks for suggesting ngrams! Will we be able to support our scenarios with them (both full-text and search within a specific field)?

I think (from what I recall), that Azure Search was actually built on top of Elasticsearch. But that's another story :wink: .

we can't do a full-text search with this field, we have to add a specific field in wildcard query?

Indeed. So normally I recommend doing multiple searches at the same time. Combining scores between partial match and exact match is normally super helpful for the end users. See the following script as an idea:

Will we be able to support our scenarios with them (both full-text and search within a specific field)?

Yes I believe so with the above strategy ^^^ :slight_smile:

1 Like

As far as I understand, your recommendation is to do multiple wildcard searches for each field in our index document if we want to achieve a full-text search with a wildcard query. Like:

GET /_search
{
  "query": {
    "wildcard": {
      "field1": {
        "value": "ele*on"
      }
    },
    "wildcard": {
      "field2": {
        "value": "ele*on"
      }
    },
    "wildcard": {
      "field3": {
        "value": "ele*on"
      }
    }
  }
}

Do I get it right?

Yeah. But was more thinking of something like:

GET /_search
{
  "query": {
    "multi_match" : {
      "query":    "ele on", 
      "fields": [ "field1.keyword^3.0", "field1^2.0", "field1.ngram", "field1.phonetic" ] 
    }
  }
}

But as you (really) want to use wildcards, I guess you have it right...

1 Like

Thank you for your quick and detailed responses! I think it will help us.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.