Sampler aggregration overhead

Bruno_Lage_Aguiar · January 23, 2020, 9:23pm

Hi,

I have a search use case with a catalog over 20M products. If the search being made matches a lot of products the aggregations takes quite a lot of time, since they have to analyze millions of products.

Being so we though about using sampling aggregations, so that we could reduce the scope of the aggregations from potential millions to a max of 1 million for example.

For the queries that are matching millions of documents we see in fact an improvement. However, for the queries that match a small number of products, we noticed that having the sampling aggregation actually worsens the performance... Overall, this makes it so that having the sampling actually worsens the overall performance of the system.

We have made the tests to confirm that the shard_size parameter is the culprit, by lowering it we see this overhead decreasing. Still, for our use case we want to sample a considerable amount of products...

By looking at the code it seems like elastic is allocating memory according to the shard size parameter before knowing how many docs were returned. Not sure if that could be the root cause...

We are using version 6.6 but I already tested version 7.3 and got a similar behavior.

Any tips or feedback?

Best Regards,
Bruno Aguiar

Mark_Harwood · January 24, 2020, 10:26am

The sampler aggregation buffers a set of matching Lucene doc ids (integers) using a priority queue and once all collection is complete it plays back the surviving set of top-matching docs to the child aggregations nested underneath.
In terms of measuring overhead of using the sampler:

There's more compute to be done in buffering the sample set of top-matching docs but
less work to be done by child aggregations (e.g. less retrieval of doc values to compute things like max price).

Consequently if shard sample size is close to the number of all matching docs it may not be faster to use the sampler (is this the behaviour you were seeing?)

The diversified_sampler does more costly and complex buffering - removing high-scoring matches if they break the diversity requirements.

Bruno_Lage_Aguiar · January 24, 2020, 1:09pm

Yes, when shard size is close to the number of matching docs or even bigger.

The thing is that I don't know before doing the query if it will bring a lot of documents or not, so we are always applying a sampling (e.g. sampling 1M docs from our 20M catalog). But since we also have a lot of queries that do not even return 1M docs we end up having worst performance.

In summary, if I don't use sampling I will have some queries taking several seconds, but if I activate it most of my currently quick queries get slower.

So the use case we actually want (and we thought sampler would do that) is simply giving an upper bound to the number of docs to analyze, to avoid analyzing several millions of docs in the large queries.

Mark_Harwood · January 24, 2020, 1:52pm

Another consideration is if your non-sampled searches have size=0 for the hits and only return aggregations then the search is optimised and executed without any scoring logic (computed from term frequencies etc). However, if you add a sampler aggregation this introduces the need for scores (to get the "best" N results) and scoring logic is enabled. That could explain the speed discrepancy.

If you wrapped your query in a constant_score query that might help avoid running the expensive scoring logic for the contained query

Bruno_Lage_Aguiar · January 27, 2020, 9:06am

In our case we are always asking for hits (size > 0), so the speed discrepancy is not from this...

So from what I take, the sampler aggregation is not a good fit when considering bigger sampling values. Do you see the sampler evolving to cover this use case or should we be looking at another type of solution?

Mark_Harwood · January 27, 2020, 9:35am

Can we quantify this? Could you share benchmarks for :

Non-sampled aggs with size=1 hits (to ensure scoring is run)
Sampled aggs

...on the same query and index.

Do you see the sampler evolving to cover this use case

I'd first like to understand more of the problem you are seeing

Bruno_Lage_Aguiar · January 28, 2020, 3:29pm

So we set up an index with 16M documents in a single shard and running elasticsearch 7.5.

Benchmark without sampling:

20 hits: 18ms
11k hits: 45ms
200k hits: 220ms
16M hits: 9600 ms

Benchmark with sampling on 1M:

20 hits: 89ms, 5x slower
11k hits: 180ms, 4x slower
200k hits: 520ms, 2x slower
16M hits: 4200 ms, almost 2x faster

Benchmark with sampling on 100k:

20 hits: 28ms, 1.5x slower
11k hits: 101ms, 2x slower
200k hits: 220ms, same time
16M hits: 2600 ms, almost 4x faster

So here you can see that for small hit sizes (below the shard sample size) we have considerably slower results using sampling.

Mark_Harwood · January 28, 2020, 3:56pm

These numbers make sense. Sampling is beneficial when num hits > sample size.
When num hits is less than sample size you are paying an additional compute cost to buffer a load of doc IDs in RAM, only to then release them all for collection by child aggregations.

It's like feeding an audience into a TV studio by first gathering them in a holding room - the TV producers have no idea how many will turn up on the night so they first lead attendees into a holding area and turn away the undesirables if it looks like the studio is at capacity. Obviously it's much faster to lead attendees directly into the studio and cut out the holding area but:

you have no idea in advance how many will turn up and
you want to apply some quality control if you hit capacity.

So I doubt there's much that can be done here unless you relax one of the above constraints.
As an example - if you relax constraint 2 you could use the search_timeout feature to halt collection after a period of time rather than by using a quality threshold

Bruno_Lage_Aguiar · January 28, 2020, 4:31pm

Understood...

In fact we moved to use search timeouts recently to be able to cap those slow queries. Our only issue is that the search timeout is not applied only to the aggregations, so we are returning possible inaccurate search results.

Do you see elasticsearch supporting the timeout only in the aggregrations?

But if this is in fact the recommended solution, we'll need to stick to it.

Thanks for the help.

Mark_Harwood · January 28, 2020, 4:38pm

I don't expect to see that feature - it's intended to provide a cap on the time of the whole request rather than parts of it.

True, the hits may not be the best but it's also true that the aggs will be some arbitrary subset of the data (e.g. "whatever we could gather in less than a second"). I'm not sure how useful that is to end users?

Bruno_Lage_Aguiar · January 28, 2020, 5:05pm

In that case I am more worried about the quality of the hits.

Anyhow I think we will end up applying this timeout and then on background executing a full query to cache the aggregations on the application side.

Thanks for all the support!

system · February 25, 2020, 5:05pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Sampler aggregation performance vs 2 queries Elasticsearch	5	931	January 18, 2018
Sampler aggregation fails to optimize queries Elasticsearch	8	418	November 4, 2019
Aggregation query size? Elasticsearch	5	6477	July 5, 2017
How are the documents selected by the Sampler Aggregation Elasticsearch	5	1743	July 5, 2017
Circuit breaker while using sampler? Elasticsearch	4	765	July 5, 2017

Sampler aggregration overhead

Related topics