Case insensitive sort doesn't work

shwetaskatdare · August 6, 2018, 4:36pm

Elasticsearch version (bin/elasticsearch --version): 5.3

JVM version (java -version): java 10.0.1 2018-04-17

Description of the problem including expected versus actual behavior:
Hello,
I am trying to achieve case insensitive sort using lowercase analyzer and keyword field. However, the result is not as expected. I am hesitant to use normalizers because of it being an experimental feature. I've tried using default analyzer and also explicitly setting standard analyzer. There was a similar issue https://github.com/elastic/elasticsearch/issues/22410, but it wasn't much help. Can you please help clarify if what I expect in the results is the correct ES behavior? if not, how can I correctly achieve case insensitive sorting?

Appreciate any help.

Steps to reproduce:

Create Index

PUT http://localhost:9200/testindex
{
  "settings": {
    "analysis": {
      "analyzer": {
        "case_insensitive": {
               "tokenizer": "lowercase"
        	}
    	}
    }
  }
}

Mapping

PUT http://localhost:9200/testindex/_mapping/testmapping
{
    "properties": {
      "Id": {
        "type": "keyword"
      },
      "Name": {
        "type": "text",
        "analyzer": "case_insensitive",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      }
    }
}

Documents

PUT http://localhost:9200/testindex/testmapping/1
{
	"Id": 1,
	"Name": "III-bbb"
}

PUT http://localhost:9200/testindex/testmapping/8
{
	"Id": 8,
	"Name": "III-ccc"
}

PUT http://localhost:9200/testindex/testmapping/2
{
	"Id": 2,
	"Name": "iii-aaa"
}

Search query

POST http://localhost:9200/testindex/testmapping/_search
{
	"query": {
		"match": {
			"Name": "iii"
		}
	},
	"sort": [
		{
			"Name.keyword": {
				"order": "asc"
			}
		}]
}

Actual search result

{
    "took": 3,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 3,
        "max_score": null,
        "hits": [
            {
                "_index": "testindex",
                "_type": "testmapping",
                "_id": "1",
                "_score": null,
                "_source": {
                    "Id": 1,
                    "Name": "III-bbb"
                },
                "sort": [
                    "III-bbb"
                ]
            },
            {
                "_index": "testindex",
                "_type": "testmapping",
                "_id": "8",
                "_score": null,
                "_source": {
                    "Id": 8,
                    "Name": "III-ccc"
                },
                "sort": [
                    "III-ccc"
                ]
            },
            {
                "_index": "testindex",
                "_type": "testmapping",
                "_id": "2",
                "_score": null,
                "_source": {
                    "Id": 2,
                    "Name": "iii-aaa"
                },
                "sort": [
                    "iii-aaa"
                ]
            }
        ]
    }
}

Expected result:
Order: Id: 2, Id: 1, & Id: 8

Based on the _analyze API result the text is being tokenized in lowercase as follow, so the results should be ordered alphabetically and not lexicographically

POST http://localhost:9200/testindex/_analyze
{
	"field": "testmapping.Name",
	"text": "IIII-bbb"
}

Result: 
{
    "tokens": [
        {
            "token": "iiii",
            "start_offset": 0,
            "end_offset": 4,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "bbb",
            "start_offset": 5,
            "end_offset": 8,
            "type": "<ALPHANUM>",
            "position": 1
        }
    ]
}

nourtemp · August 11, 2018, 7:37pm

Just following this question, as I have the exact problem.

klof · August 12, 2018, 6:28am

The keyword version of a field is not analyzed, meaning you will be able to search, sort and aggregate only on exact values.
Your logic implies to sort on the analyzed version of the field name (Name), but when you sort or aggregate on an analyzed field, you get an error message, that requires the field to enable fielddata :

https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html#_enabling_fielddata_on_literal_text_literal_fields

PUT testindex/_mapping/testmapping
{
  "properties": {
    "Id": {
      "type": "keyword"
    },
    "Name": {
      "type": "text",
      "analyzer": "case_insensitive",
      "fielddata": true
    }
  }
}

Then, your search with the sort on the analyzed field will give you the expected result :

POST testindex/testmapping/_search
{
  "query": {
    "match": {
      "Name": "iii"
    }
  },
  "sort": [
    {
      "Name": {
        "order": "asc"
      }
    }
  ]
}

However, be aware that enable fielddata can be expensive, as mentioned in the documentation, and maybe it would be better to modify your documents (create a new field "name_lowercase", with the keyword type) then sort on this field.

https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html#_fielddata_is_disabled_on_literal_text_literal_fields_by_default

nourtemp · August 12, 2018, 2:31pm

That's actually what I want (Assuming is the same what the question owner wants).
I want to sort based on the exact values, not the analyzed ones.
That's why I sort using "Name.keyword" not "Name" which I believe the correct way to do the non-expensive sorting.

If not, what is the optimum way to sort case-insensitively on a text field (like Person Full Name) ?

Right?

klof · August 13, 2018, 5:47am

I'm not sure, i understood he wanted to sort, the analyzed values (names in lowercase).

However, you're right, it's better to sort on keyword values (case-sensitive), but be aware that uppercase letters will be sorted before lowercase letters.
You can debug it with your programming language. if you sort an array with values :

["a", "b",  "B", "C", "c", "A"] 
the result is : 
["A", "B", "C", "a", "b", "c"]

In your case if your data are properly formatted there's no issue to sort by keyword, otherwise I think you should create a new field (name_lowercase) just for this purpose.

shwetaskatdare · August 16, 2018, 5:03pm

thanks for your input @klof and @nourtemp. I want to sort on the not analyzed values (or exact value) which is why I am sorting on Name.keyword (if you see the query). I do not want to enable the fielddata because of what's mentioned in the document. Looking for a non expensive way of sorting

abdon · August 16, 2018, 9:18pm

To sort case-insensitively on a keyword field, you can apply a normalizer to that field. You can think of normalizers as analyzers for keyword fields instead of text fields. So, instead of defining an analyzer in your settings, define a normalizer:

PUT /testindex
{
  "settings": {
    "analysis": {
      "normalizer": {
        "case_insensitive": {
          "filter": "lowercase"
        }
      }
    }
  }
}

Next, apply that normalizer to your keyword field in your mapping:

PUT /testindex/_mapping/testmapping
{
  "properties": {
    "Id": {
      "type": "keyword"
    },
    "Name": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword",
          "normalizer": "case_insensitive"
        }
      }
    }
  }
}

Now, you will be able to sort case-insensitively on Name.keyword without using fielddata.

nourtemp · September 11, 2018, 10:48am

Hello abdon
Thanks for you reply.
Actually, I'm using NEST as client as my project is .Net based.
AFAI, NEST doesn't support normalizers yet.
Correct me if I'm wrong.

abdon · September 11, 2018, 11:08am

What problem are you running into specifically? It seems NEST added support for normalizers back in February 2017. Are you on an old version?

system · October 9, 2018, 11:08am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Case insensitive sorting using normalizer Elasticsearch language-clients	4	7400	February 10, 2020
Case-insensitive sort Elasticsearch	3	1728	July 6, 2017
How to do case insensitive sort in ES 5.1+? Elasticsearch	2	2905	January 17, 2017
Sorting text fields case insensitive Elasticsearch	2	976	April 28, 2017
Case insensitive search by using query in java API Elasticsearch	7	5489	July 5, 2017

Case insensitive sort doesn't work

Related topics