Terms aggregation bucket Sorting is not working when using with partition

I am trying to do TermAggregation + Sort order(by _count) + pagination(using partition). But resulting bucket partitions are not sorted properly.

Without using partitions, sorting is working properly.

Here is my elastic Query,

GET inventory/_search?filter_path=aggregations
{
 "aggs":  {
  "totalPacks": {
    "terms": {
      "field": "sellerId.keyword",
      "min_doc_count": 1,
      "shard_min_doc_count": 0,
      "show_term_doc_count_error": false,
      "include": {
         "partition": 0,
         "num_partitions": 10
      },
      "size": 10,
      "order": {
         "_count": "asc"
      }
    }
  }
}
}

this gives output ,
partition: 0

"buckets" : [
        {
          "key" : "305860",
          "doc_count" : 3
        },
        {
          "key" : "13101",
          "doc_count" : 10
        }

partition: 2

"buckets" : [
        {
          "key" : "05702",
          "doc_count" : 4
        },
        {
          "key" : "17188",
          "doc_count" : 42
        }
      ]

doc_count: 4 is coming in 1st but should come in 0th partition
doc_count: 10 is coming on 0th but should come in 1ST partition.

This is working as designed.
Sort order is valid within partitions, not across them.
It’s a coping strategy for when your data is distributed across lots of machines which makes it difficult to perform certain operations in a single search. It relies on your client application stitching together results from multiple partitions where appropriate.

it is partitioning in which order? And if we need to apply sort first and then partition the data. Is there any approach?

This comment with links may be of use

Hi @Mark_Harwood, I understood the issue. But Is there a way to sort data across all partitions?

Or can u suggest any other way using that we can achieve Aggregation + sorting across all data+ pagination?

We are wondering how other big apps that use ES, manage the sort order with pagination.

The challenge is data locality.
Unless you have all data related to a seller kept on the same shard it’s harder to reason about each seller. Time-based indices make this harder. The fact you’re sorting by lowest doc count makes this even harder.
If you can put all docs related to the same seller in one shard there’s hope. “Custom routing” or the “transform” api are 2 ways to ensure information related to each seller is available in only one shard.

If you need "_count": "asc", what will happen if using rare terms aggregation?
Do you still need pagination even with rare terms aggregation??

Hi @Mark_Harwood, I have 1 shard and 1 replica for my index then it should work right? Descending order on count is also not working even I have only one shard for my index.

Index Setting:

"number_of_shards": "1",
"number_of_replicas": "1"

In what way is it “not working”?
Can you share some outputs

I meant Descending sorting on count is not working across all data. I m looking for a way that returns descending sorted data on _count across all partitions.

As per your comment, I understood Sorting may not work across all data if data resides in different shards/machines. But here my index configuration is 1 shard. please suggest if this is possible by changing any configuration.

Output: For DESC order on _count
Partition: 0

 "buckets" : [
        {
          "key" : "34950",
          "doc_count" : 11
        },
        {
          "key" : "28747",
          "doc_count" : 10
        }
     ]

Partition:1

"buckets" : [
        {
          "key" : "29194",
          "doc_count" : 65
        },
        {
          "key" : "16518",
          "doc_count" : 36
        }
      ]

That could be done by just not using partitions, no?

But I want a Paginated bucket. result query may return more than 1 lakhs bucket but at a time it should return 15 buckets this is the reason I am using partitions.

Deep pagination on a derived value is tough because the deeper you go into the smaller values you have to hold all the other keys and derived values in memory just to figure out where you are in the global order.
Maybe the solution is turn the derived value into a concrete one using the “transform” api to create a derived index holding the totals for each key. This can be queried and sorted on the total field using search_after param. This can be implemented efficiently because it requires less memory but does not support direct indexing into the sort position (eg randomly picking a page number from the results )

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.