Unusual aggregations size behaviour

We have an aggregation as follows:

{
    "from": 0,
    "size": 0,
    "aggregations": {
        "my_aggregation": {
            "terms": {
                "field": "SOME_FIELD",
                "min_doc_count": 2
            }
        }
    }
}

We use such a query to do some processing on the matched terms. We noticed that the aggregation is not returning any results initially:

{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 519088,
        "max_score": 0.0,
        "hits": []
    },
    "aggregations": {
        "my_aggregation": {
            "doc_count_error_upper_bound": 5,
            "sum_other_doc_count": 16124,
            "buckets": []
        }
    }
}

Through some experimentation, we ended up toying with size and setting it to a higher value:

{
    "from": 0,
    "size": 0,
    "aggregations": {
        "my_aggregation": {
            "terms": {
                "field": "SOME_FIELD",
                "min_doc_count": 2,
                "size": 100
            }
        }
    }
}

Strangely enough, this request started returning some results:

{
    "took": 3,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 519088,
        "max_score": 0.0,
        "hits": []
    },
    "aggregations": {
        "my_aggregation": {
            "doc_count_error_upper_bound": 5,
            "sum_other_doc_count": 15449,
            "buckets": [
                {
                    "key": "<HIDDEN VALUE>",
                    "doc_count": 3
                },
                {
                    "key": "<HIDDEN VALUE>",
                    "doc_count": 2
                },
                {
                    "key": "<HIDDEN VALUE>",
                    "doc_count": 2
                },
                {
                    "key": "<HIDDEN VALUE>",
                    "doc_count": 2
                },
                {
                    "key": "<HIDDEN VALUE>",
                    "doc_count": 2
                },
                {
                    "key": "<HIDDEN VALUE>",
                    "doc_count": 2
                },
                {
                    "key": "<HIDDEN VALUE>",
                    "doc_count": 2
                },
                {
                    "key": "<HIDDEN VALUE>",
                    "doc_count": 2
                },
                {
                    "key": "<HIDDEN VALUE>",
                    "doc_count": 2
                },
                {
                    "key": "<HIDDEN VALUE>",
                    "doc_count": 2
                }
            ]
        }
    }
}

I'm struggling to understand the logic of this behaviour (using ES 5.4) so any help here in terms of what we're missing would be greatly appreciated.

Many thanks.
Cos

It's an unfortunate side effect of distributed computing and the unevenness of your data. If you can put all your content in one shard (and half a million docs should easily do that) then all should work better.

However, let's assume the worst-case scenario where you need multiple shards which are distributed across multiple nodes. Worse still, it looks like your choice of field has some very low-frequency terms - so low in fact that no one shard has more than one document with the same field value. So when you ask for the top 10 results each shard will actually return 25 terms using the default shard_size [1] setting. Given the terms are so rare the "top" 25 values picked from each shard are actually unique across the whole dataset. It is only when you ask for size 100 (which would be a shard_size of 160) will you start to see terms that are not unique.
This is a pretty rare scenario - each shard sees so little of the signal (no local repetition whatsoever) it becomes impossible to cherry-pick a selection of values that when combined with other results meet your criteria of having min_doc_count 2. They are each taking a wild guess as to which of the potentially millions of unique local values they hold are not globally unique. In these situations try crank up the shard_size setting before tweaking the final size setting.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_shard_size_3

1 Like

Thanks Mark for the prompt and very clear response! I was actually reading about shard_size just after I posted it and things started to tie a bit better. It's clear why this happens, in our case particularly, where the field can be completely missing for many results, hence it can happen easily that the top docs in a shard are just not matching.

One question remains: would the sorting on the query of this search (didn't paste that here) have an influence on the order in which the top docs are being picked within a shard?

Don't confuse sorting of query "hits" with agg results. Remember the former was originally designed to produce the top 10 docs shown on a typical search results page while aggregations had their origins in showing sidebar summaries on e-commerce sites that allow customers to narrow by price, manufacturer, size etc. For this reason aggregations summarise all query matches therefore the sort order selected for your top 10 hits is irrelevant.

Remember aggs are generally not returning individual docs but things like top terms which are common to more than one matching doc.

Ah yes, of course, terms, average or sum for example can't take sorting into account as they apply on all the dataset. Sot it makes sense why sorting is immaterial to aggregations. Duh!
Thanks again!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.