Exact match with case insensitivity

animageofmine · June 15, 2017, 3:05pm

I have seen tons of articles on these, most of them for version 2.4. Since ES introduced keyword and text starting version 5.0, I wanted to check if there is a way to search on a keyword with case insensitivity.

Let's take an example. Following is my mapping:

PUT caseinsensitive
{
  "mappings": {
    "mytype": {
      "properties": {        
        "exact_value": {
          "type":  "keyword" 
        }
      }
    }
  }
}

Add a document to the index

PUT caseinsensitive/mytype/1
{
  "exact_value": "Quick Foxes!"  
}

SEARCH, that will not return any result because "q" and "f" are small case.

GET caseinsensitive/mytype/_search
{
  "query": {
    "term": {
      "exact_value": "quick foxes!" 
    }
  }
}

I have two questions:

Is there a way to use keyword in the mapping and make the search case insensitive? I do expect everything else to match, except the case.
Is there a way to search without case sensitivity and without changing the current mapping? Otherwise we would have to rebuild all the indices.

One use case is that, if I perform terms aggregation, I don't want two different buckets for same term with different cases (e.g. Seattle & seattle)

I found this article as one solution. Is that the best way? I believe this wouldn't use fieldData, so the performance impact is minimal. Please let me know.

Thanks so much.

dadoonet · June 15, 2017, 3:29pm

You can use normalizer to lowercase the field.

Or you can index the same field multiple times. Once as a keyword. Another time as a text so you can perform full text search on it.

animageofmine · June 15, 2017, 4:21pm

Thank you. Looks like normalizer is a good option.

My understanding is that, if I use keyword with a normalizer, ES wouldn't use fielddata. Can you please confirm?

dadoonet · June 15, 2017, 4:59pm

Yes. But anyway elasticsearch is not going to use fielddata for text either.
I'd probably prefer having one field for aggregation/sorting and another for full text search.

But depends on your use case I guess.

animageofmine · June 15, 2017, 5:44pm

Use case is exact match with case insensitivity. So, I think we can skip the text part for now.

When is the fieldData used then? Only when the data is analyzed? Every time I read about fielddata, I forget it. Its definitely my problem, but I can blame it on versioning difference (2.4 vs 5.x).

dadoonet · June 15, 2017, 8:02pm

In 5.x fielddata is only available if you explicitly set it on text fields.

animageofmine · June 19, 2017, 2:22pm

Sounds good. Thank you so much for the clarification.

animageofmine · July 13, 2017, 3:19pm

@dadoonet

Reviving the thread again. Looks like lowercasing while indexing wouldn't work for us. We use the keywords for aggregations and searching.

Basically, we want to index as it is (no lowercasing), but search with case insensitivity. Something like grep -i "Seattle". How can this be done in ES? See an example below:

WITH A LOWERCASE NORMALIZER

Index Mapping

PUT caseinsensitive
{
  "settings": {      
    "analysis": {
      "normalizer": {
        "lowercase_normalizer": {
          "type": "custom",
          "char_filter": [],
          "filter": ["lowercase"]
        }
      }     
    }
  },
  "mappings": {
    "type": {
      "properties": {
        "city": {
          "type": "keyword",
          "doc_values": true,
          "normalizer": "lowercase_normalizer"
        }
      }
    }
  }
}

Index data

PUT caseinsensitive/mytype/1
{
    "city": "New York"
}

PUT caseinsensitive/mytype/2
{
    "city": "new York"
}

PUT caseinsensitive/mytype/3
{
    "city": "Seattle"
}

Terms Aggregation

GET caseinsensitive/_search
{
    "size": 0,
    "aggs" : {
        "cities" : {
            "terms" : { "field" : "city" }
        }
    }
}

Actual response payload below. The problem here is that everything is lowercased. If we expose this to customer, it is annoying and looks like we are not respecting their data. Imaging country names not starting with a Capital letter.

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "cities": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "new york",
               "doc_count": 2
            },
            {
               "key": "seattle",
               "doc_count": 1
            }
         ]
      }
   }
}

Expected. Response Payload.

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "cities": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "New York OR new York (actual casing preserved while indexing)",
               "doc_count": 2
            },
            {
               "key": "Seattle",
               "doc_count": 1
            }
         ]
      }
   }
}

dadoonet · July 25, 2017, 3:48pm

If you preserve casing when building doc values, you will end up with something like:

"New York": 1
"new York": 1

Instead of

"New York": 2

Which is probably wrong.
You need to find a way to normalize your data, one way or another.

I don't think there is any existing token filter which can do that.
May be you can you an ingest painless script to transform at index time your data to something normalized like from new york, NEW YORK, New York to New York...

system · August 22, 2017, 3:48pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Case insensitive search on keyword Elasticsearch	4	7655	May 8, 2021
Case Insensitive search using match query for keyword Elasticsearch	6	10370	June 8, 2017
Elastic Search Case Insensitive for key Elasticsearch	3	128	April 29, 2024
Case Insensitive Search Elasticsearch	4	1135	July 5, 2017
Case insensitive search on not analyzed fields Elasticsearch	3	2115	July 5, 2017

Exact match with case insensitivity

WITH A LOWERCASE NORMALIZER

Related topics