Aggregated sorted search with top hits and paging - items missing

irkallacz · September 25, 2025, 1:37pm

My intend is to have paged list of all categories presented by cheapest product, sorted by price. I have search terms aggregation by product category, sorted by min aggregation on price:

"aggs": {
        "count":{"cardinality":{"field":"category_id"}},
        "categories": {
            "terms": {
                "field": "category_id",
                "order": {
                    "sorting": "asc"
                },
                 "include": {
                    "num_partitions": 2,
                    "partition": 0
                },
                "size": 10
            }
        },
        "aggs": {
            "fields": {
                "top_hits": {
                    "size": 1,
                    "sort": [
                        {
                            "price": "asc"
                        }
                    ]
                }
            },
            "sorting": {
                "min": {
                    "field": "price"
                }
            }
        }
    },

There are 32 categories total. So far everything work, bud when I change the num_partitions to 3, some of categories witch was there, are suddenly missing. Namely the one, witch have cheapest product and was on first place when there was only 2 partitions.

I have no idea, what is going on there and why. Code itself is variation on Terms aggregation | Reference witch seams to be basic stuff.

Can someone please tell me, where the problem is and how to make it work as intended?

Thanks a lot.

john-wagster · September 26, 2025, 3:21am

@irkallacz first off welcome!

I’m not an expert on aggs. But my rough knowledge here and a quick review of the docs: Terms aggregation | Reference suggests that when you increase num_partitions you are bucketing your categories into further partitions but then by specifying partition you are asking for only those categories in the first or 0th partition. You’d need to make subsequent queries for the 1rst and 2nd partitions if you had set num_parititions to 3. So it makes sense to me that you report missing categories as you increase num_partitions without querying for the other partitionvalues. Let me know if that helps at all.

Mark_Harwood1 · September 26, 2025, 7:09am

Hi. Partitions should only be used when there are very large numbers of unique values (maybe millions or more). 32 unique values should work fine without partitions.

irkallacz · September 29, 2025, 7:48am

Thanks for the response, but that is not the issue. I always ask for 0 partition ("partition": 0). And I have buckets sorted by price, so at first place should be same category. But when I change num_partitions to 3 instead of 2, at the first partition (0) at first position is something else.

Mark_Harwood1 · September 29, 2025, 3:18pm

To be expected. Each category’s chosen partition number is hash-of-the-category-id modulo num_partitions. If you change num_partitions then you change

how many categories are in each partition and
which categories go in which partition.

As I said before, if you only have a small number of unique categories you do not need to set num_partitions - that is only required when you have millions of unique values and get errors.

irkallacz · September 30, 2025, 10:42am

Categories are sorted by price of cheapest product. Even if there are different number of partitions, I assume that the order should be the same. And category with cheapest product should be on first partition (witch in my case is not). Or I am missing something?

I know, that in my specific case, the partitions are not needed, I am just trying to understand how this work.

Mark_Harwood1 · September 30, 2025, 2:34pm

Distributed data can be the enemy of analytics. I added partitions to elasticsearch to solve worst-case scenarios like the one I call the “Elizabeth Taylor problem”. In this scenario we want to know who has been married the most often. Elizabeth Taylor was a Hollywood actress who married 8 times but she is not easy to find if we have millions of marriage certificates in our index and these are spread across many shards/nodes. If each of Elizabeth’s marriage certificates is stored on a different server then she appears no more remarkable than the millions of other names held on each node who only ever married once. That means each node has to stream a list of every unique name (even if they have only one marriage certificate) to a co-ordinating node to add up all the values. This co-ordinating node becomes a bottleneck and can throw a “too many buckets” error trying to squeeze all this data into memory.

This is expensive and rather than trying to exhaustively analyze all data at once you could break it down into 26 requests - “which people with a surname beginning with ‘A’ married most”, then repeat for B, then C etc

This is what partitioning does for you- it breaks one big job into N jobs to avoid resource constraints on heavy computations. It’s a compromise (your client may have to do the work to figure out the final most-married answer). The network and memory constraints introduce complications the same way the small boat constraint does in the classic “chicken, fox and the grain” problem. What seems simple becomes hard due to resource constraints

Topic		Replies	Views
Term aggrigation with partition not working Elasticsearch	8	2699	April 9, 2018
Terms aggregation bucket Sorting is not working when using with partition Elasticsearch	13	1692	April 22, 2022
Using partition returns unexpected result Elasticsearch	4	618	August 20, 2019
Elasticsearch terms aggregation with partition does not honor the “size” value Elasticsearch	5	1935	May 25, 2021
Elasticsearch terms aggregation with partition does not retuning equal bucket Elasticsearch	6	1323	April 11, 2022

Aggregated sorted search with top hits and paging - items missing

Related topics