Index sorting

Hi,

I have an index with following setting, where the the index.sort.field is specified at index time to use two keyword fields primary_id & secondary_id.

{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
  "analyzer": {
    "folding": {
      "tokenizer": "standard",
      "filter": [
        "lowercase",
        "asciifolding"
      ]
    }
  },
  "normalizer": {
    "lowerasciinormalizer": {
      "type": "custom",
      "filter": [
        "lowercase",
        "asciifolding"
      ]
    }
  }
},
"index" : {
  "sort.field" : ["primary_id", "secondary_id"],
  "sort.order" : ["asc", "asc"]
}
},
"mappings": {
"_doc": {
  "dynamic_templates": [
    {
      "string_as_keyword": {
        "match_mapping_type": "string",
        "mapping": {
          "type": "keyword",
          "normalizer": "lowerasciinormalizer"
        }
      }
    }
  ],
  "properties": {
    "primary_id": {
      "type": "keyword",
      "normalizer": "lowerasciinormalizer"
    },
    "secondary_id": {
      "type": "keyword",
      "normalizer": "lowerasciinormalizer"
    }
    //other fields
  }
}
}
}

When I then search the index for example with term query and using default sort (or specifying _doc sort) I got a different and wrong sort than the expected result order when specifying sort with the primary_id field.

//This returns for example ABCD103921654 first instead of ABCD100317418 when using primary_id for sorting
{
"from" : 0, "size" : 10,
"sort": [
{
"_doc" : {
"order" : "asc"
}
}   
],
"query": {
"term": {
  "name" : {
  "value" : "Express"
}
}
}
}

This is on a local ES instance with no shards.

Thanks

Hi!

Index sorting is intended to optimize the index for specific query patterns, not to provide a default sort field for queries against the index. You can find more detail about what kinds of queries index sorting can be used to optimize in this blog post.

The _doc sort order has no guarantees other than being the fastest order - this may be different from the index sort order for a number of reasons. If you want to guarantee that the returned results are ordered by a field (or set of fields), you'll need to specify which field to sort on in the query.

Thanks Brown for your reply and the link.

I have misunderstood that _doc sorting is index sorting which would be the same as what is defined in the "index.sort" settings if applied. so specifying sort by _doc or sort by "index.sort.field" would be the same. But this seems to be wrong.

The use case I'm trying to achieve is to have guaranteed order by the fields defined in the index settings while being as fast as the _doc sorting. This is especially needed for the scroll API to deep paginate through a large index (~5 million documents) .

Scroll requests have optimizations that make them faster when the sort order is _doc . If you want to iterate over all documents regardless of the order, this is the most efficient option:

Currently, to scroll the whole index while sorting using _doc takes ~2 mins. The same using the primary_id index field sorting takes ~4 mins. Is there any way to optimize this or any other better way for scrolling with a provided order?

I know this is not an ideal use case for ES, but this is a user requirement we need to provide in our system.

Hi,

Can you please explain how/why _doc would be different from index sort?

Currently, to scroll the whole index (~5 million documents) while sorting using _doc takes ~2 mins. The same using the primary_id index field sorting takes ~4 mins.
Is there any way to optimize this or any other better way for scrolling with a provided order?

Regards

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.