[0.19.1] OutOfMemoryError after setting a large result size

SearchSourceBuilder search = new SearchSourceBuilder()
.query(...).from(0).size(Integer.MAX_VALUE);
index.search(search).hits();

-> Exception in thread "elasticsearch[search]-pool-47-thread-1"
java.lang.OutOfMemoryError: Java heap space
at
org.elasticsearch.search.SearchService.shortcutDocIdsToLoad(SearchService.java:
579)
at
org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:
317)

I'd be less surprised if the index didn't just have a dozen small
documents; looks like elasticsearch is preallocating a large array?

My Lucene internals may be out of date, but if it's the same as a while
back, the PriorityQueue used to hold the results is backed by an array with
size .

If you want all results, don't 'search', 'scroll' instead.

On 28 March 2012 12:26, Eric Jain eric.jain@gmail.com wrote:

SearchSourceBuilder search = new SearchSourceBuilder()
.query(...).from(0).size(Integer.MAX_VALUE);
index.search(search).hits();

-> Exception in thread "elasticsearch[search]-pool-47-thread-1"
java.lang.OutOfMemoryError: Java heap space
at

org.elasticsearch.search.SearchService.shortcutDocIdsToLoad(SearchService.java:
579)
at

org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:
317)

I'd be less surprised if the index didn't just have a dozen small
documents; looks like elasticsearch is preallocating a large array?

On Mar 27, 7:00 pm, Paul Smith tallpsm...@gmail.com wrote:

My Lucene internals may be out of date, but if it's the same as a while
back, the PriorityQueue used to hold the results is backed by an array with
size .

If you want all results, don't 'search', 'scroll' instead.

I can do that, but wouldn't it make sense to allocate no more than
Math.min(size, numberOfDocuments)?

Hi Eric!

Am Mittwoch, 28. März 2012 08:40:09 UTC+2 schrieb Eric Jain:

On Mar 27, 7:00 pm, Paul Smith tallpsm...@gmail.com wrote:

My Lucene internals may be out of date, but if it's the same as a while
back, the PriorityQueue used to hold the results is backed by an array
with
size .

If you want all results, don't 'search', 'scroll' instead.

I can do that, but wouldn't it make sense to allocate no more than
Math.min(size, numberOfDocuments)?

If you really need all documents of the index you should really have a look
at the "scan/scroll"-API
(Elasticsearch Platform — Find real-time answers at scale | Elastic).
If you know that your index will hold only a few hundred documents then why
use MAX_VALUE?

Simple math like Math.min() can be dangerous because you allocate the array
and then start to fill it with hits of the index. While you do that your
index can become larger because of fresh documents arriving. This might be
no problem in your case but we index 1000 docs/s.

CU
Thomas

On Wednesday, 28 March 2012, Eric Jain eric.jain@gmail.com wrote:

On Mar 27, 7:00 pm, Paul Smith tallpsm...@gmail.com wrote:

My Lucene internals may be out of date, but if it's the same as a while
back, the PriorityQueue used to hold the results is backed by an array
with
size .

If you want all results, don't 'search', 'scroll' instead.

I can do that, but wouldn't it make sense to allocate no more than
Math.min(size, numberOfDocuments)?

Lucene is optimizing for the common case of only the top X of hits being
needed. It's more efficient for scoring sort to do it this way. Far less
memory is used. How would you sort say a billion documents? Only documents
that match the query/filters are passed through the priority queue for
sorting and for common case of only needing top X we're talking a lot less
overall comparisons and a lot less memory. The larger the hit size the more
comparisons and memory.

even if you did the math.min, for us with large indexes it would be
allocating an array of huge size when we only want the first 25 most of the
time. That's a waste.

On Tue, Mar 27, 2012 at 23:56, Thomas Peuss thomas.peuss@nterra.com wrote:

If you really need all documents of the index you should really have a look
at the "scan/scroll"-API
(Elasticsearch Platform — Find real-time answers at scale | Elastic).
If you know that your index will hold only a few hundred documents then why
use MAX_VALUE?

Simple math like Math.min() can be dangerous because you allocate the array
and then start to fill it with hits of the index. While you do that your
index can become larger because of fresh documents arriving. This might be
no problem in your case but we index 1000 docs/s.

Hadn't considered this case; I wouldn't mind if it was just new docs
that were missing, but having some new docs showing up in place of
older docs could indeed be confusing.