Exact match with case insensitivity


(Animageofmine) #1

I have seen tons of articles on these, most of them for version 2.4. Since ES introduced keyword and text starting version 5.0, I wanted to check if there is a way to search on a keyword with case insensitivity.

Let's take an example. Following is my mapping:

PUT caseinsensitive
{
  "mappings": {
    "mytype": {
      "properties": {        
        "exact_value": {
          "type":  "keyword" 
        }
      }
    }
  }
}

Add a document to the index

PUT caseinsensitive/mytype/1
{
  "exact_value": "Quick Foxes!"  
}

SEARCH, that will not return any result because "q" and "f" are small case.

GET caseinsensitive/mytype/_search
{
  "query": {
    "term": {
      "exact_value": "quick foxes!" 
    }
  }
}

I have two questions:

  1. Is there a way to use keyword in the mapping and make the search case insensitive? I do expect everything else to match, except the case.
  2. Is there a way to search without case sensitivity and without changing the current mapping? Otherwise we would have to rebuild all the indices.

One use case is that, if I perform terms aggregation, I don't want two different buckets for same term with different cases (e.g. Seattle & seattle)

I found this article as one solution. Is that the best way? I believe this wouldn't use fieldData, so the performance impact is minimal. Please let me know.

Thanks so much.


(David Pilato) #2

You can use normalizer to lowercase the field.

Or you can index the same field multiple times. Once as a keyword. Another time as a text so you can perform full text search on it.


(Animageofmine) #3

Thank you. Looks like normalizer is a good option.

My understanding is that, if I use keyword with a normalizer, ES wouldn't use fielddata. Can you please confirm?


(David Pilato) #4

Yes. But anyway elasticsearch is not going to use fielddata for text either.
I'd probably prefer having one field for aggregation/sorting and another for full text search.

But depends on your use case I guess.


(Animageofmine) #5

Use case is exact match with case insensitivity. So, I think we can skip the text part for now.

When is the fieldData used then? Only when the data is analyzed? Every time I read about fielddata, I forget it. Its definitely my problem, but I can blame it on versioning difference (2.4 vs 5.x). :slight_smile:


(David Pilato) #6

In 5.x fielddata is only available if you explicitly set it on text fields.


(Animageofmine) #7

Sounds good. Thank you so much for the clarification.


(Animageofmine) #8

@dadoonet

Reviving the thread again. Looks like lowercasing while indexing wouldn't work for us. We use the keywords for aggregations and searching.

Basically, we want to index as it is (no lowercasing), but search with case insensitivity. Something like grep -i "Seattle". How can this be done in ES? See an example below:

WITH A LOWERCASE NORMALIZER

Index Mapping

PUT caseinsensitive
{
  "settings": {      
    "analysis": {
      "normalizer": {
        "lowercase_normalizer": {
          "type": "custom",
          "char_filter": [],
          "filter": ["lowercase"]
        }
      }     
    }
  },
  "mappings": {
    "type": {
      "properties": {
        "city": {
          "type": "keyword",
          "doc_values": true,
          "normalizer": "lowercase_normalizer"
        }
      }
    }
  }
}

Index data

PUT caseinsensitive/mytype/1
{
    "city": "New York"
}

PUT caseinsensitive/mytype/2
{
    "city": "new York"
}

PUT caseinsensitive/mytype/3
{
    "city": "Seattle"
}

Terms Aggregation

GET caseinsensitive/_search
{
    "size": 0,
    "aggs" : {
        "cities" : {
            "terms" : { "field" : "city" }
        }
    }
}

Actual response payload below. The problem here is that everything is lowercased. If we expose this to customer, it is annoying and looks like we are not respecting their data. Imaging country names not starting with a Capital letter.

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "cities": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "new york",
               "doc_count": 2
            },
            {
               "key": "seattle",
               "doc_count": 1
            }
         ]
      }
   }
}

Expected. Response Payload.

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "cities": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "New York OR new York (actual casing preserved while indexing)",
               "doc_count": 2
            },
            {
               "key": "Seattle",
               "doc_count": 1
            }
         ]
      }
   }
}

(David Pilato) #9

If you preserve casing when building doc values, you will end up with something like:

"New York": 1
"new York": 1

Instead of

"New York": 2

Which is probably wrong.
You need to find a way to normalize your data, one way or another.

I don't think there is any existing token filter which can do that.
May be you can you an ingest painless script to transform at index time your data to something normalized like from new york, NEW YORK, New York to New York...


(system) #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.