I have a search use case with a catalog over 20M products. If the search being made matches a lot of products the aggregations takes quite a lot of time, since they have to analyze millions of products.
Being so we though about using sampling aggregations, so that we could reduce the scope of the aggregations from potential millions to a max of 1 million for example.
For the queries that are matching millions of documents we see in fact an improvement. However, for the queries that match a small number of products, we noticed that having the sampling aggregation actually worsens the performance... Overall, this makes it so that having the sampling actually worsens the overall performance of the system.
We have made the tests to confirm that the shard_size parameter is the culprit, by lowering it we see this overhead decreasing. Still, for our use case we want to sample a considerable amount of products...
By looking at the code it seems like elastic is allocating memory according to the shard size parameter before knowing how many docs were returned. Not sure if that could be the root cause...
We are using version 6.6 but I already tested version 7.3 and got a similar behavior.
The sampler aggregation buffers a set of matching Lucene doc ids (integers) using a priority queue and once all collection is complete it plays back the surviving set of top-matching docs to the child aggregations nested underneath.
In terms of measuring overhead of using the sampler:
There's more compute to be done in buffering the sample set of top-matching docs but
less work to be done by child aggregations (e.g. less retrieval of doc values to compute things like max price).
Consequently if shard sample size is close to the number of all matching docs it may not be faster to use the sampler (is this the behaviour you were seeing?)
The diversified_sampler does more costly and complex buffering - removing high-scoring matches if they break the diversity requirements.
Yes, when shard size is close to the number of matching docs or even bigger.
The thing is that I don't know before doing the query if it will bring a lot of documents or not, so we are always applying a sampling (e.g. sampling 1M docs from our 20M catalog). But since we also have a lot of queries that do not even return 1M docs we end up having worst performance.
In summary, if I don't use sampling I will have some queries taking several seconds, but if I activate it most of my currently quick queries get slower.
So the use case we actually want (and we thought sampler would do that) is simply giving an upper bound to the number of docs to analyze, to avoid analyzing several millions of docs in the large queries.
Another consideration is if your non-sampled searches have size=0 for the hits and only return aggregations then the search is optimised and executed without any scoring logic (computed from term frequencies etc). However, if you add a sampler aggregation this introduces the need for scores (to get the "best" N results) and scoring logic is enabled. That could explain the speed discrepancy.
If you wrapped your query in a constant_score query that might help avoid running the expensive scoring logic for the contained query
In our case we are always asking for hits (size > 0), so the speed discrepancy is not from this...
So from what I take, the sampler aggregation is not a good fit when considering bigger sampling values. Do you see the sampler evolving to cover this use case or should we be looking at another type of solution?
These numbers make sense. Sampling is beneficial when num hits > sample size.
When num hits is less than sample size you are paying an additional compute cost to buffer a load of doc IDs in RAM, only to then release them all for collection by child aggregations.
It's like feeding an audience into a TV studio by first gathering them in a holding room - the TV producers have no idea how many will turn up on the night so they first lead attendees into a holding area and turn away the undesirables if it looks like the studio is at capacity. Obviously it's much faster to lead attendees directly into the studio and cut out the holding area but:
you have no idea in advance how many will turn up and
you want to apply some quality control if you hit capacity.
So I doubt there's much that can be done here unless you relax one of the above constraints.
As an example - if you relax constraint 2 you could use the search_timeout feature to halt collection after a period of time rather than by using a quality threshold
In fact we moved to use search timeouts recently to be able to cap those slow queries. Our only issue is that the search timeout is not applied only to the aggregations, so we are returning possible inaccurate search results.
Do you see elasticsearch supporting the timeout only in the aggregrations?
But if this is in fact the recommended solution, we'll need to stick to it.
I don't expect to see that feature - it's intended to provide a cap on the time of the whole request rather than parts of it.
True, the hits may not be the best but it's also true that the aggs will be some arbitrary subset of the data (e.g. "whatever we could gather in less than a second"). I'm not sure how useful that is to end users?