Prefix query insensitive case : normalizer or analyzer ? (nest inside)

Hi,

I get hard time to figure out which kind of settings should I use, my needs are basics, I got a text field (literally a phrase) that I should be able to query using a prefix ("start with") in an insensitive way.

ie.
productDescription = "12 tons of happyness"
productDescription = "12 TONS of smiles"
productDescription = "12 Tons of kiss"

well using the query

    GET myindex/_search
    {
      "query": 
      {
        "prefix": {
          "productDescription": {
            "value": "12 tons"
          }
        }
      }
    }

=> I expect to get all the three records

I managed to achieve this using a Normalizer : Case insensitive sort doesn't work (in my case using .net Nest lib : https://stackoverflow.com/questions/59714787/elasticsearch-7-x-case-insensitive-sorting-using-normalizer) but is it the correct way?

Should I use a Custom analyser instead and why ? (https://www.elastic.co/guide/en/elasticsearch/client/net-api/current/writing-analyzers.html)

I'm a bit disapointed on which one to choose.

Could someone help me ?
Thank you!

This is a bit of a long answer, because you're hitting on some fundamental Elasticsearch concepts. :slight_smile:

Elasticsearch has two types of string field: text and keyword. The main difference between the two is that text fields get analyzed, and as a result you can use those fields for full-text query features like case-insensitive search. keyword fields on the other hand are not analyzed. As a result, keyword fields are typically used for exact, case-sensitive searches.

As often with Elasticsearch, there are multiple ways to solve your requirement. But your options depend on whether the productDescription field is a text or keyword field in your index' mapping.

If productDescription is a text field (default), you could use a custom analyzer to create a single lower-cased token, and use a prefix query on that:

PUT myindex
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "filter": ["lowercase"]
        }
      }
    }
  }, 
  "mappings": {
    "properties": {
      "productDescription": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

POST myindex/_doc
{
  "productDescription": "12 tons of happyness"
}

POST myindex/_doc
{
  "productDescription": "12 TONS of smiles"
}

POST myindex/_doc
{
  "productDescription": "12 Tons of kiss"
}

GET myindex/_search
{
  "query": {
    "prefix": {
      "productDescription": {
        "value": "12 tons"
      }
    }
  }
}

However, something to be aware of is that the prefix query is a term-level query. Term-level queries do not analyze search terms. So, this solution only works if you know for certain that the query will always be in lower case. The following request fails to find the documents for example:

GET myindex/_search
{
  "query": {
    "prefix": {
      "productDescription": {
        "value": "12 Tons"
      }
    }
  }
}

For that reason, I'd say that mapping the productDescription as a keyword field and applying a normalizer would be the better option. Now, your query will be truly case-insensitive:

PUT myindex
{
  "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer": {
          "filter": ["lowercase"]
        }
      }
    }
  }, 
  "mappings": {
    "properties": {
      "productDescription": {
        "type": "keyword",
        "normalizer": "my_normalizer"
      }
    }
  }
}

POST myindex/_doc
{
  "productDescription": "12 tons of happyness"
}

POST myindex/_doc
{
  "productDescription": "12 TONS of smiles"
}

POST myindex/_doc
{
  "productDescription": "12 Tons of kiss"
}

GET myindex/_search
{
  "query": {
    "prefix": {
      "productDescription": {
        "value": "12 tons"
      }
    }
  }
}

GET myindex/_search
{
  "query": {
    "prefix": {
      "productDescription": {
        "value": "12 Tons"
      }
    }
  }
}

By the way, all of this will become much easier in the next version of Elasticsearch, 7.10. The prefix query will get a case_insensitive parameter. You will then be able to query keyword fields case-insensitively with a prefix query, without the need for a normalizer.

1 Like

@abdon

you're awesome, I finally ended with the same solution (using a normalizer) and I'm glad to learn that the next version of Elastic will wrap that for us.

The using of the normalizer leads me to another "problem" for which I opened another topic Use an analyzer and a normalizer at the same time on the same field?

If you have some time left, could you have a check ?

Many thanks for your precious help