Accurate Paging

Hi All,

We're currently implementing our search with help of the Elastic Stack, but we're having some trouble providing accurate paging for our users.

Since we are using collapsed hits/inner hits as part of our query, the total hits provided by the response does not match the actual hits that will be returned. Without that information, it seems impossible to actually implement that.

We have been trying various workarounds, however they all seem to be approximations, which would be a bitter pill to swallow for us and our customers.

As a bit of background, we are collapsing on one field (type_information.collapse_id), which always contains a GUID.

So far, we have tried:

  • Using a cardinality aggregation:
"collapsed_total": {
    "cardinality": {
        "field": "type_information.collapse_id"
    }
} 

However, since this is only an approximation, this was off quite a bit on larger resultsets (even with maxed out precision)

  • Using a terms aggregation, and then subtracting that from the total amount:
"collapsed_total": {
    "terms": {
        "field": "type_information.collapse_id",
        "min_doc_count": 2,
        "size": 10000
    }
}

However, this required us to increase limits on search.max_buckets to unnecessarily high levels, and if I understand it correctly, will also start being inaccurate as soon as we move our development environment to the intended setup with multiple shards.

I understand many of these issues only happen with a large number of results and should not occur if a user uses our search well, but we have a wide external user base with sometimes not very good "search skills".

Are there any ideas on how we could solve this issue in a way that leads us to accurate results even with searches that have a very large resultset?

Cheers,
Stefano

It seems to us that there is no exact solution to this issue. Hopefully, we'll have our actual multi-sharded setup up and running next week so we can see how far off the second method is in our real environment. If it's too inaccurate, we might just have to do a cutoff point and inform the user that there are more than X results, providing him some way to access them while still keeping our initial paging accurate.