SearchSourceBuilder search = new SearchSourceBuilder()
.query(...).from(0).size(Integer.MAX_VALUE);
index.search(search).hits();
-> Exception in thread "elasticsearch[search]-pool-47-thread-1"
java.lang.OutOfMemoryError: Java heap space
at
org.elasticsearch.search.SearchService.shortcutDocIdsToLoad(SearchService.java:
579)
at
org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:
317)
I'd be less surprised if the index didn't just have a dozen small
documents; looks like elasticsearch is preallocating a large array?
My Lucene internals may be out of date, but if it's the same as a while
back, the PriorityQueue used to hold the results is backed by an array with
size .
If you want all results, don't 'search', 'scroll' instead.
My Lucene internals may be out of date, but if it's the same as a while
back, the PriorityQueue used to hold the results is backed by an array with
size .
If you want all results, don't 'search', 'scroll' instead.
I can do that, but wouldn't it make sense to allocate no more than
Math.min(size, numberOfDocuments)?
My Lucene internals may be out of date, but if it's the same as a while
back, the PriorityQueue used to hold the results is backed by an array
with
size .
If you want all results, don't 'search', 'scroll' instead.
I can do that, but wouldn't it make sense to allocate no more than
Math.min(size, numberOfDocuments)?
Simple math like Math.min() can be dangerous because you allocate the array
and then start to fill it with hits of the index. While you do that your
index can become larger because of fresh documents arriving. This might be
no problem in your case but we index 1000 docs/s.
My Lucene internals may be out of date, but if it's the same as a while
back, the PriorityQueue used to hold the results is backed by an array
with
size .
If you want all results, don't 'search', 'scroll' instead.
I can do that, but wouldn't it make sense to allocate no more than
Math.min(size, numberOfDocuments)?
Lucene is optimizing for the common case of only the top X of hits being
needed. It's more efficient for scoring sort to do it this way. Far less
memory is used. How would you sort say a billion documents? Only documents
that match the query/filters are passed through the priority queue for
sorting and for common case of only needing top X we're talking a lot less
overall comparisons and a lot less memory. The larger the hit size the more
comparisons and memory.
even if you did the math.min, for us with large indexes it would be
allocating an array of huge size when we only want the first 25 most of the
time. That's a waste.
Simple math like Math.min() can be dangerous because you allocate the array
and then start to fill it with hits of the index. While you do that your
index can become larger because of fresh documents arriving. This might be
no problem in your case but we index 1000 docs/s.
Hadn't considered this case; I wouldn't mind if it was just new docs
that were missing, but having some new docs showing up in place of
older docs could indeed be confusing.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.