Elasticsearch sorting issue


(Jacques Du Plessis) #1

Can someone please explain to me why I am getting such a random result when I sort my data.
I am just doing a simple query, trying to return all the documents filtered my a companyId sorting my lastName in desc order.

//Return all the employees from a specific company ordering by lastName asc | desc

GET employee-index-sorting
{
  "query": {
    "bool": {
      "filter": {
        "term": {
          "companyId": 3179
        }
      }
    }
  },
  "sort": [
    {
      "lastName.keyword": { <-- Should this be keyword? or not_analyzed
        "order": "desc"
      }
    }
  ]
}

The result looks as follows, but noticed that last names like van der Mescht and van Breda appears before Zwane and Zwezwe. Why would that be?

{
        "_index": "employee-index",
        "_type": "_doc",
        "_id": "637467",
        "_score": null,
        "_source": {
          "companyId": 3179,
          "firstName": "Name",
          "lastName": "van der Mescht",
        },
        "sort": [
          "van der Mescht"
        ]
      },
      {
        "_index": "employee-index",
        "_type": "_doc",
        "_id": "678335",
        "_score": null,
        "_source": {
          "companyId": 3179,
          "firstName": "Name3",
          "lastName": "van Breda",
        },
        "sort": [
          "van Breda"
        ]
      },
      {
        "_index": "employee-index",
        "_type": "_doc",
        "_id": "113896",
        "_score": null,
        "_source": {
          "companyId": 3179,
          "firstName": "Name2",
          "lastName": "Zwezwe",
        },
        "sort": [
          "Zwezwe"
        ]
      },
      {
        "_index": "employee-index",
        "_type": "_doc",
        "_id": "639639",
        "_score": null,
        "_source": {
          "companyId": 3179,
          "firstName": "Name1",
          "lastName": "Zwane",
        },
        "sort": [
          "Zwane"
        ]
      }

I am including my mappings as I think it might be the reason for it, but can't figure it out.

PUT employee-index-sorting
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {},
        "analyzer": {
          "keyword_analyzer": {
            "filter": [
              "lowercase",
              "asciifolding",
              "trim"
            ],
            "char_filter": [],
            "type": "custom",
            "tokenizer": "keyword"
          },
          "edge_ngram_analyzer": {
            "filter": [
              "lowercase"
            ],
            "tokenizer": "edge_ngram_tokenizer"
          },
          "edge_ngram_search_analyzer": {
            "tokenizer": "lowercase"
          }
        },
        "tokenizer": {
          "edge_ngram_tokenizer": {
            "type": "edge_ngram",
            "min_gram": 2,
            "max_gram": 5,
            "token_chars": [
              "letter"
            ]
          }
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "employeeId": {
          "type": "keyword"
        },
        "companyGroupId": {
          "type": "keyword"
        },
        "companyId": {
          "type": "keyword"
        },
        "number": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "preferredName": {
          "type": "text",
          "index": false
        },
        "firstName": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "middleName": {
          "type": "text",
          "index": false
        },
        "lastName": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "fullName": {
          "type": "text",
          "fields": {
            "keywordstring": {
              "type": "text",
              "analyzer": "keyword_analyzer"
            },
            "edgengram": {
              "type": "text",
              "analyzer": "edge_ngram_analyzer",
              "search_analyzer": "edge_ngram_search_analyzer"
            }
          },
          "analyzer": "standard"
        },
        "terminationDate": {
          "type": "date"
        },
        "companyName": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "email": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "idNumber": {
          "type": "text"
        },
        "description": {
          "type": "text",
          "index": false
        },
        "jobNumber": {
          "type": "keyword"
        },
        "frequencyId": {
          "type": "long"
        },
        "frequencyCode": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "frequencyAccess": {
          "type": "boolean"
        }
      }
    }
  }
}

I enabled "fielddata": true, on the text fields, no i think this is the wrong approch as i think it should just work fine with .keyword but anywayz i am still not getting the correct order. Seems like the Asc is working ok but the desc is not at all


#2

maybe this link will help you
https://www.elastic.co/guide/en/elasticsearch/guide/current/sorting-collations.html
the reason may be the word "van" start with a lowercase, but 'Zwezwe' is not


(Jacques Du Plessis) #3

Thanks for the response, turns out you were right. for a more in-depth answer please see SO link