Case insensitive sort doesn't work


(Shwetaskatdare) #1

Elasticsearch version (bin/elasticsearch --version): 5.3

JVM version (java -version): java 10.0.1 2018-04-17

Description of the problem including expected versus actual behavior:
Hello,
I am trying to achieve case insensitive sort using lowercase analyzer and keyword field. However, the result is not as expected. I am hesitant to use normalizers because of it being an experimental feature. I've tried using default analyzer and also explicitly setting standard analyzer. There was a similar issue https://github.com/elastic/elasticsearch/issues/22410, but it wasn't much help. Can you please help clarify if what I expect in the results is the correct ES behavior? if not, how can I correctly achieve case insensitive sorting?

Appreciate any help.

Steps to reproduce:

  1. Create Index
PUT http://localhost:9200/testindex
{
  "settings": {
    "analysis": {
      "analyzer": {
        "case_insensitive": {
               "tokenizer": "lowercase"
        	}
    	}
    }
  }
}
  1. Mapping
PUT http://localhost:9200/testindex/_mapping/testmapping
{
    "properties": {
      "Id": {
        "type": "keyword"
      },
      "Name": {
        "type": "text",
        "analyzer": "case_insensitive",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      }
    }
}

  1. Documents
PUT http://localhost:9200/testindex/testmapping/1
{
	"Id": 1,
	"Name": "III-bbb"
}

PUT http://localhost:9200/testindex/testmapping/8
{
	"Id": 8,
	"Name": "III-ccc"
}

PUT http://localhost:9200/testindex/testmapping/2
{
	"Id": 2,
	"Name": "iii-aaa"
}

  1. Search query
POST http://localhost:9200/testindex/testmapping/_search
{
	"query": {
		"match": {
			"Name": "iii"
		}
	},
	"sort": [
		{
			"Name.keyword": {
				"order": "asc"
			}
		}]
}

  1. Actual search result
{
    "took": 3,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 3,
        "max_score": null,
        "hits": [
            {
                "_index": "testindex",
                "_type": "testmapping",
                "_id": "1",
                "_score": null,
                "_source": {
                    "Id": 1,
                    "Name": "III-bbb"
                },
                "sort": [
                    "III-bbb"
                ]
            },
            {
                "_index": "testindex",
                "_type": "testmapping",
                "_id": "8",
                "_score": null,
                "_source": {
                    "Id": 8,
                    "Name": "III-ccc"
                },
                "sort": [
                    "III-ccc"
                ]
            },
            {
                "_index": "testindex",
                "_type": "testmapping",
                "_id": "2",
                "_score": null,
                "_source": {
                    "Id": 2,
                    "Name": "iii-aaa"
                },
                "sort": [
                    "iii-aaa"
                ]
            }
        ]
    }
}
  1. Expected result:
    Order: Id: 2, Id: 1, & Id: 8

Based on the _analyze API result the text is being tokenized in lowercase as follow, so the results should be ordered alphabetically and not lexicographically

POST http://localhost:9200/testindex/_analyze
{
	"field": "testmapping.Name",
	"text": "IIII-bbb"
}

Result: 
{
    "tokens": [
        {
            "token": "iiii",
            "start_offset": 0,
            "end_offset": 4,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "bbb",
            "start_offset": 5,
            "end_offset": 8,
            "type": "<ALPHANUM>",
            "position": 1
        }
    ]
}


(Nour) #2

Just following this question, as I have the exact problem.


#3

The keyword version of a field is not analyzed, meaning you will be able to search, sort and aggregate only on exact values.
Your logic implies to sort on the analyzed version of the field name (Name), but when you sort or aggregate on an analyzed field, you get an error message, that requires the field to enable fielddata :

https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html#_enabling_fielddata_on_literal_text_literal_fields

PUT testindex/_mapping/testmapping
{
  "properties": {
    "Id": {
      "type": "keyword"
    },
    "Name": {
      "type": "text",
      "analyzer": "case_insensitive",
      "fielddata": true
    }
  }
}

Then, your search with the sort on the analyzed field will give you the expected result :

POST testindex/testmapping/_search
{
  "query": {
    "match": {
      "Name": "iii"
    }
  },
  "sort": [
    {
      "Name": {
        "order": "asc"
      }
    }
  ]
}

However, be aware that enable fielddata can be expensive, as mentioned in the documentation, and maybe it would be better to modify your documents (create a new field "name_lowercase", with the keyword type) then sort on this field.

https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html#_fielddata_is_disabled_on_literal_text_literal_fields_by_default


(Nour) #4

That's actually what I want (Assuming is the same what the question owner wants).
I want to sort based on the exact values, not the analyzed ones.
That's why I sort using "Name.keyword" not "Name" which I believe the correct way to do the non-expensive sorting.

If not, what is the optimum way to sort case-insensitively on a text field (like Person Full Name) ?

Right?


#5

I'm not sure, i understood he wanted to sort, the analyzed values (names in lowercase).

However, you're right, it's better to sort on keyword values (case-sensitive), but be aware that uppercase letters will be sorted before lowercase letters.
You can debug it with your programming language. if you sort an array with values :

["a", "b",  "B", "C", "c", "A"] 
the result is : 
["A", "B", "C", "a", "b", "c"]

In your case if your data are properly formatted there's no issue to sort by keyword, otherwise I think you should create a new field (name_lowercase) just for this purpose.


(Shwetaskatdare) #6

thanks for your input @klof and @nourtemp. I want to sort on the not analyzed values (or exact value) which is why I am sorting on Name.keyword (if you see the query). I do not want to enable the fielddata because of what's mentioned in the document. Looking for a non expensive way of sorting


(Abdon Pijpelink) #7

To sort case-insensitively on a keyword field, you can apply a normalizer to that field. You can think of normalizers as analyzers for keyword fields instead of text fields. So, instead of defining an analyzer in your settings, define a normalizer:

PUT /testindex
{
  "settings": {
    "analysis": {
      "normalizer": {
        "case_insensitive": {
          "filter": "lowercase"
        }
      }
    }
  }
}

Next, apply that normalizer to your keyword field in your mapping:

PUT /testindex/_mapping/testmapping
{
  "properties": {
    "Id": {
      "type": "keyword"
    },
    "Name": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword",
          "normalizer": "case_insensitive"
        }
      }
    }
  }
}

Now, you will be able to sort case-insensitively on Name.keyword without using fielddata.


(Nour) #8

Hello abdon
Thanks for you reply.
Actually, I'm using NEST as client as my project is .Net based.
AFAI, NEST doesn't support normalizers yet.
Correct me if I'm wrong.


(Abdon Pijpelink) #9

What problem are you running into specifically? It seems NEST added support for normalizers back in February 2017. Are you on an old version?


(system) #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.