[indices:data/read/search[phase/fetch/id]]]; nested: ElasticsearchException[Java heap space]; nested: OutOfMemoryError[Java heap space]

Elasticsearch Version 2.3.2
ES_HEAP_SIZE=2G

I have a usecase to perform indexing of attachments to ES. Due to certain use case, i cannot split the docs/attachments and index them as separate docs within ES. I was able to index 14 attachments (each attachment 130MB. however when i try to query i get the below issue. When i query, I am not requesting all the fields of the documents particularly i am not requesting for attachment field.

sample json doc

{
"name": "xyz",
"title", "xx",
attachment: "............"
}

[2017-05-04 03:42:22,869][DEBUG][action.search ] [Doop] [17] Failed to execute fetch phase
RemoteTransportException[[Doop][slc12oxp.us.x.com/10.196.3.67:9300][indices:data/read/search[phase/fetch/id]]]; nested: ElasticsearchException[Java heap space]; nested: OutOfMemoryError[Java heap space];
Caused by: ElasticsearchException[Java heap space]; nested: OutOfMemoryError[Java heap space];
at org.elasticsearch.ExceptionsHelper.convertToRuntime(ExceptionsHelper.java:50)
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:604)
at org.elasticsearch.search.action.SearchServiceTransportAction$FetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:408)
at org.elasticsearch.search.action.SearchServiceTransportAction$FetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:405)
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Unknown Source)
at java.lang.String.(Unknown Source)
at java.lang.StringBuilder.toString(Unknown Source)
at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:356)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishAndReturnString(UTF8StreamJsonParser.java:2412)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:285)
at org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:84)
at org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:299)
at org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:274)
at org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:245)
at org.elasticsearch.common.xcontent.support.AbstractXContentParser.map(AbstractXContentParser.java:208)
at org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:83)
at org.elasticsearch.search.lookup.SourceLookup.sourceAsMapAndType(SourceLookup.java:88)
at org.elasticsearch.search.lookup.SourceLookup.loadSourceIfNeeded(SourceLookup.java:64)
at org.elasticsearch.search.lookup.SourceLookup.extractRawValues(SourceLookup.java:130)
at org.elasticsearch.search.fetch.FetchPhase.createSearchHit(FetchPhase.java:241)
at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:178)
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:592)
... 9 more

So you have a very big document containing some attachments encoded in BASE64? Everything is stored in the _source field I guess?

That might explain why it requires some memory to get the 1st 10 docs.

A suggestion is to remove non needed fields like attachment using the remove ingest feature.

BTW you did not tell how you are actually indexing your documents. Are you using ingest attachment?

Can you also show what your query looks like?

And finally, may be increase the HEAP size. :stuck_out_tongue:

Your heap size is too low,
what is the total RAM size?
how many docs total ?

Yes. I have to store it in _source document cause if I don't then during the update of the same document, I will not be able to retain it in the updated document.

A suggestion is to remove non needed fields like attachment using the remove ingest feature.

Are you talking about ingest Node?. I am using ES 2.3.2 is it available here too? No. Can you please elaborate it

BTW you did not tell how you are actually indexing your documents. Are you using ingest attachment?

I am using mapper_attachments as ingest_attachments isn't available in 2.3.2

Query

Is simple match query with attachment fields removed in the query

{
"fields" :["name", "title"],
"query": {
        "query_string" : {
                        "query" : "position"
        }
    }
}

Ram is 8 GB. I have only few documents at the max 1000 of which 14 are very big each of 130 MB .

Ok. First be aware that mapper-attachments is removed in 6.0. So I'd suggest you to upgrade to 5.4.0 and use ingest instead.

Check if this helps in that case.

I'd also store the binary document outside elasticsearch and would only add the URL to the doc in elasticsearch alongside with the extracted text.

I have read mapper-attachments would be removed starting 6.0. Upgrade to 5.4 is not smooth due to some business constraints.

Current design in our project makes heavy use of mapper_attachments for attachments processing. We don't want the original document in base 64 to be stored in ES, Initially we though if mapper_attachments supported copy_to, then we can avoid storing base64 in source, but mapper attachments doesn't support copy_to. The only reason we are storing the bas64 content in source is so that when updates happen to the same document, still i can retain the original extracted content in attachment.content field.

My question is it possible or is there a way to remove of base64 content, just have the extracted content in the source and thus make the document safe for updation too(i.e update the doc without losing any data).

You can exclude the BASE64 from the source with https://www.elastic.co/guide/en/elasticsearch/reference/2.4/mapping-source-field.html#include-exclude

Also have a look at FSCrawler project in cas it helps.

If I exclude it from source wht would happen when i update the document ?. I would loose it after the update right as for update contents needs to be in source field

Probably. That's why I'd advice to upgrade to 5.4 and use ingest-attachment.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.