Aggregated sorted search with top hits and paging - items missing

My intend is to have paged list of all categories presented by cheapest product, sorted by price. I have search terms aggregation by product category, sorted by min aggregation on price:

"aggs": {
        "count":{"cardinality":{"field":"category_id"}},
        "categories": {
            "terms": {
                "field": "category_id",
                "order": {
                    "sorting": "asc"
                },
                 "include": {
                    "num_partitions": 2,
                    "partition": 0
                },
                "size": 10
            }
        },
        "aggs": {
            "fields": {
                "top_hits": {
                    "size": 1,
                    "sort": [
                        {
                            "price": "asc"
                        }
                    ]
                }
            },
            "sorting": {
                "min": {
                    "field": "price"
                }
            }
        }
    },

There are 32 categories total. So far everything work, bud when I change the num_partitions to 3, some of categories witch was there, are suddenly missing. Namely the one, witch have cheapest product and was on first place when there was only 2 partitions.

I have no idea, what is going on there and why. Code itself is variation on Terms aggregation | Reference witch seams to be basic stuff.

Can someone please tell me, where the problem is and how to make it work as intended?

Thanks a lot.

@irkallacz first off welcome!

I’m not an expert on aggs. But my rough knowledge here and a quick review of the docs: Terms aggregation | Reference suggests that when you increase num_partitions you are bucketing your categories into further partitions but then by specifying partition you are asking for only those categories in the first or 0th partition. You’d need to make subsequent queries for the 1rst and 2nd partitions if you had set num_parititions to 3. So it makes sense to me that you report missing categories as you increase num_partitions without querying for the other partitionvalues. Let me know if that helps at all.

Hi. Partitions should only be used when there are very large numbers of unique values (maybe millions or more). 32 unique values should work fine without partitions.

Thanks for the response, but that is not the issue. I always ask for 0 partition ("partition": 0). And I have buckets sorted by price, so at first place should be same category. But when I change num_partitions to 3 instead of 2, at the first partition (0) at first position is something else.

To be expected. Each category’s chosen partition number is hash-of-the-category-id modulo num_partitions. If you change num_partitions then you change

  1. how many categories are in each partition and

  2. which categories go in which partition.

As I said before, if you only have a small number of unique categories you do not need to set num_partitions - that is only required when you have millions of unique values and get errors.

Categories are sorted by price of cheapest product. Even if there are different number of partitions, I assume that the order should be the same. And category with cheapest product should be on first partition (witch in my case is not). Or I am missing something?

I know, that in my specific case, the partitions are not needed, I am just trying to understand how this work.

Distributed data can be the enemy of analytics. I added partitions to elasticsearch to solve worst-case scenarios like the one I call the “Elizabeth Taylor problem”. In this scenario we want to know who has been married the most often. Elizabeth Taylor was a Hollywood actress who married 8 times but she is not easy to find if we have millions of marriage certificates in our index and these are spread across many shards/nodes. If each of Elizabeth’s marriage certificates is stored on a different server then she appears no more remarkable than the millions of other names held on each node who only ever married once. That means each node has to stream a list of every unique name (even if they have only one marriage certificate) to a co-ordinating node to add up all the values. This co-ordinating node becomes a bottleneck and can throw a “too many buckets” error trying to squeeze all this data into memory.

This is expensive and rather than trying to exhaustively analyze all data at once you could break it down into 26 requests - “which people with a surname beginning with ‘A’ married most”, then repeat for B, then C etc

This is what partitioning does for you- it breaks one big job into N jobs to avoid resource constraints on heavy computations. It’s a compromise (your client may have to do the work to figure out the final most-married answer). The network and memory constraints introduce complications the same way the small boat constraint does in the classic “chicken, fox and the grain” problem. What seems simple becomes hard due to resource constraints