I'm running filters (so I do not care for scoring or sorting at all), and I would like to get all the results as fast as possible. The results are varied from a few to tens of thousands.
ElasticSearch does not provide an option to get all the results using the Search API, and setting a large size of results affects the query duration.
What is the best approach (performance wise) to deal with this problem?
Should I always use the Scroll API? Should I always use the Search API with size = 10K (for example) and then use paging to retrieve all other results, if any? Should I first use the Count API and then decide?
Does the Scroll API incur any substantial overhead in comparison to multiple queries?
What about the cases where there are less than 10K results? Those will incur the overhead of creating a scroll context. What if I'm running under a heavy load of queries (10K/sec), does scrolling has a negligible overhead?
I understand that the size parameter is more than just a limit. elasticsearch/lucene allocates memory according to this parameter. Maybe in the case of heavy load I should count the number of results first, I would not want to affect other queries just because I over estimated the number of results.
There is an open issue on github that suggests to put a hard limit on the size parameter:
That issue even has a pull request for it! The size I proposed was 10,000
though I'm not sure its right.
Scroll has server side costs. The nodes keep some metadata about the scroll
and they keep old segment files around after other searches need them so
they can return a consistent view for the scroll.
If you have 10,000 queries per second and you have a couple dozen
concurrent scrolls mixed in you'll be fine. Not a couple dozen scrolls per
second. A couple dozen concurrently.
If you expect 10,000 queries with huge depth per second I don't know what
to tell you. Maybe try to rejig what you are doing as an aggregation.
Though at that point you are far outside my area of expertise.
I expect most of the queries to return at most 1K results. However, I'd like to have all the results for each.
I'm still not sure what is the right approach. Does scroll have any benefit over count+search with exact size?
It disables scoring and ordering, but I think that filters do not calc scoring anyways.
Scroll will let you go beyond whatever that hard limit is once it's merged.
And before that it'll take up less memory. But remember that it leaves
things open on the elasticsearch side so its not safe to have tons and tons
of these. I don't know how many becomes unsafe though.
I'd probably do a search for 1k and if the total that comes back is larger
then retry the whole query as a scroll. But how many of these scrolls do
you expect? You might be better off searching for 1k and if that doesn't
get you everything telling the user they must narrow their search.
I'm actually running the queries for application side joins (many-to-many relations), so the query might return a large number of results but the joined data set might be significantly smaller. If it matters somehow, I'm using the Java API.
I'm feeling that the basic ability of getting back all the results is missing here. I think there should be an option to use a normal search with size = -1 (or something similar), and if there are too many results get an error back with a suggestions to use the scroll api.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.