Accurate Paging

NashSLX · June 5, 2019, 1:26pm

Hi All,

We're currently implementing our search with help of the Elastic Stack, but we're having some trouble providing accurate paging for our users.

Since we are using collapsed hits/inner hits as part of our query, the total hits provided by the response does not match the actual hits that will be returned. Without that information, it seems impossible to actually implement that.

We have been trying various workarounds, however they all seem to be approximations, which would be a bitter pill to swallow for us and our customers.

As a bit of background, we are collapsing on one field (type_information.collapse_id), which always contains a GUID.

So far, we have tried:

Using a cardinality aggregation:

"collapsed_total": {
    "cardinality": {
        "field": "type_information.collapse_id"
    }
}

However, since this is only an approximation, this was off quite a bit on larger resultsets (even with maxed out precision)

Using a terms aggregation, and then subtracting that from the total amount:

"collapsed_total": {
    "terms": {
        "field": "type_information.collapse_id",
        "min_doc_count": 2,
        "size": 10000
    }
}

However, this required us to increase limits on search.max_buckets to unnecessarily high levels, and if I understand it correctly, will also start being inaccurate as soon as we move our development environment to the intended setup with multiple shards.

I understand many of these issues only happen with a large number of results and should not occur if a user uses our search well, but we have a wide external user base with sometimes not very good "search skills".

Are there any ideas on how we could solve this issue in a way that leads us to accurate results even with searches that have a very large resultset?

Cheers,
Stefano

NashSLX · June 14, 2019, 7:01am

It seems to us that there is no exact solution to this issue. Hopefully, we'll have our actual multi-sharded setup up and running next week so we can see how far off the second method is in our real environment. If it's too inaccurate, we might just have to do a cutoff point and inform the user that there are more than X results, providing him some way to access them while still keeping our initial paging accurate.

system · July 12, 2019, 7:06am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Total hits with field collapsing Elasticsearch	4	10053	December 6, 2017
Pagination 10000 document total limit Elasticsearch	5	11024	February 17, 2020
[Posible bug] Re: Loss of count accuracy for term facets Elasticsearch	3	339	July 6, 2017
Total Hits in collapse has an upperbound of 10,000 Elasticsearch	1	803	August 7, 2020
POC elastic search - correctness & exactitude of stats Elasticsearch	9	1098	December 6, 2018

Accurate Paging

Related topics