ArrayIndexOutOfBoundsException while searching index, the number of return documents shoud be very large


(Maomao Di) #1

I am using latest version of elasticsearch to search a index. The number of returned document should be around 4000000. I
don't want to use scroll cause I would like to record the time to return
the result without counting the networking time in multiple scrolls and compared the time with Solr.
However, when size in search increase to 712227, I got the
ArrayIndexOutOfBoundsException error. I increase the heap size to 64G
and increase the max open file to 65535. Neither of them works.
The stack trace are as below:
java.lang.ArrayIndexOutOfBoundsException: -131072

 at org.elasticsearch.common.util.BigByteArray.set(BigByteArray.java:97)
 at org.elasticsearch.common.io.stream.BytesStreamOutput.writeBytes(BytesStreamOutput.java:93)
 at org.elasticsearch.common.io.stream.StreamOutput.write(StreamOutput.java:299)
 at com.fasterxml.jackson.core.json.UTF8JsonGenerator._flushBuffer(UTF8JsonGenerator.java:2014)
 at com.fasterxml.jackson.core.json.UTF8JsonGenerator.flush(UTF8JsonGenerator.java:1027)
 at org.elasticsearch.common.xcontent.json.JsonXContentGenerator.flush(JsonXContentGenerator.java:436)
 at org.elasticsearch.common.xcontent.json.JsonXContentGenerator.writeRawField(JsonXContentGenerator.java:369)
at org.elasticsearch.common.xcontent.XContentBuilder.rawField(XContentBuilder.java:914)
at org.elasticsearch.common.xcontent.XContentHelper.writeRawField(XContentHelper.java:378)
at org.elasticsearch.search.internal.InternalSearchHit.toXContent(InternalSearchHit.java:476)
at org.elasticsearch.search.internal.InternalSearchHits.toXContent(InternalSearchHits.java:184)
at org.elasticsearch.search.internal.InternalSearchResponse.toXContent(InternalSearchResponse.java:111)
at org.elasticsearch.action.search.SearchResponse.toXContent(SearchResponse.java:195)
at 

org.elasticsearch.rest.action.support.RestStatusToXContentListener.buildResponse(RestStatusToXContentListener.java:43)
at
org.elasticsearch.rest.action.support.RestStatusToXContentListener.buildResponse(RestStatusToXContentListener.java:38)
at
org.elasticsearch.rest.action.support.RestStatusToXContentListener.buildResponse(RestStatusToXContentListener.java:30)
at org.elasticsearch.rest.action.support.RestResponseListener.processResponse(RestResponseListener.java:43)
at org.elasticsearch.rest.action.support.RestActionListener.onResponse(RestActionListener.java:49)
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:89)
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:85)
at
org.elasticsearch.action.search.SearchScrollQueryThenFetchAsyncAction.innerFinishHim(SearchScrollQueryThenFetchAsyncAction.java:223)
at
org.elasticsearch.action.search.SearchScrollQueryThenFetchAsyncAction.finishHim(SearchScrollQueryThenFetchAsyncAction.java:211)
at
org.elasticsearch.action.search.SearchScrollQueryThenFetchAsyncAction.access$100(SearchScrollQueryThenFetchAsyncAction.java:44)
at
org.elasticsearch.action.search.SearchScrollQueryThenFetchAsyncAction$2.onResponse(SearchScrollQueryThenFetchAsyncAction.java:191)
at
org.elasticsearch.action.search.SearchScrollQueryThenFetchAsyncAction$2.onResponse(SearchScrollQueryThenFetchAsyncAction.java:185)
at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:41)
at org.elasticsearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:836)
at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:820)
at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:810)
at org.elasticsearch.transport.DelegatingTransportChannel.sendResponse(DelegatingTransportChannel.java:58)
at
org.elasticsearch.transport.RequestHandlerRegistry$TransportChannelWrapper.sendResponse(RequestHandlerRegistry.java:140)
at
org.elasticsearch.search.action.SearchServiceTransportAction$FetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:409)
at
org.elasticsearch.search.action.SearchServiceTransportAction$FetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:405)

at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:77)
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

. at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

. at java.lang.Thread.run(Thread.java:745)


(David Pilato) #2

Please format your code using </> icon. It will make your post more readable.

Here the answer I gave at https://github.com/elastic/elasticsearch/issues/20959

So you changed the default settings in elasticsearch.yml file?

I don't want to use scroll cause I would like to record the time to return the result without counting the network time in multiple scrolls.

That's not how it works and I don't see why you would record such a time which is not accurate with what you would do in production.
Use scroll. That's definitely the way to go instead of trying to fill your memory.


(Maomao Di) #3

Hi, thanks for your reply. I added ES_HEAP_SIZE=64G in elasticsearch
file and I would like to record the time because I just want to compare
the time with Solr and other search server.


(David Pilato) #4

You can't compare orange and apples.

Basically you will never run that in production. So why would you care about the performance about something you won't use in production?

No please, don't do that. Use a system for what it has been built for.

Elasticsearch is not built to present to a user within a single page 4 000 000 of responses.
A end user wants to have may be 10 or 50 responses on a single page.
May be a user will go after the 2nd or 3rd page but he will never try to get the less relevant information for him. Actually I'm never doing that when I'm googling for something.

If a user wants to extract data to a CSV file for example, then using scroll is the thing to do.

Trying to allocate 4 000 000 data structure for every single shard, then copy the results over the wire to the coordinated node and so on does not make sense, really.

If you want to compare SOLR and elasticsearch, run end user queries on both systems when both systems have pretty much the expected activity you'll see in production, like 10 000 index operation per second and 100 search requests per second for instance.

Try to keep all elasticsearch defaults and adapt the defaults only if you know exactly what you are doing and the problem you are trying to solve.

We will be happy to help you here.
Hope this helps.


(system) #5