Elastic search memory heap exception when trying to index large document by chunks


(kietory) #1

Hello there,
Setup:
the elastic search version that we are using is 2.3.5
we set es java heap size to be 4GB
and then set bootstrap.mlockall: true

UseCase: We have a use case where we would like to index very large document like 1GB using mapper-attachment plugin.

We know that memory is going to be an issue so we are sending the 1 GB files into smaller chunks of 50MB into elastic search synchronously.

To append the chunks to elastic i have used append script as shown below:

var updateRequest = new UpdateRequest<DocumentPOCO, object>(indexName, document.GetType(), document.Id)
{
Script = "ctx._source.documentContent += appendContent",
Params = new Dictionary<string, object>
{
{ "appendContent", document.DocumentContent }
}

                };

                var updateRes = _elasticClient.Update(updateRequest);

We are now consistently getting java heaps memory exception when trying to index more than 400 MB using above script.

We are hoping for the script above to add chunk by chunk into the index rather than load all the previous chunks and then append the next chunk. Because of the "load previous chunks before append the next chunk" behavior has defeated the purpose of our sending document in small chunks to index.

My first question is, is there any more that we could do to the above script to achieve what we want?

And if not the second question would be, can anyone recommend another way of doing this? that is ElasticSearch can index the same document chunk by chunk?

Please assist.

Regards


High elastic search heap memory consumption while indexing huge files
(Daniel Mitterdorfer) #2

Hi @kiet.tran,

a 1 GB document size is way beyond what Elasticsearch is designed for. Do you really want to search within this document and then retrieve 1 GB if it matches? Here are a few options what you could do depending on your use case:

  • Split the document into multiple smaller ones
  • Store the 1GB file on the file system or in a blob store and extract only the data that you want (e.g. if this is a video file it makes sense only to store the metadata but not the actual movie in ES)

I see what you want to achieve with your script but documents in Elasticsearch are immutable so in fact, it will retrieve the data that you've sent so far and issue another index request for a new (version of the) document.

Daniel


(kietory) #3

Hi @danielmitterdorfer ,
Thanks for the prompt response.

We are using highlighting to search for document so would expect to only to return the matched terms roughly 100 characters of where the term is found inside the document. so not expecting it to return 1GB.

The 1GB if is a .docx or .pdf extensions.

Will elastic search eventually provide indexing document by chunks in the future?

Cheers,

Kiet


(Daniel Mitterdorfer) #4

Hi @kiet.tran,

That's pretty large documents. :slight_smile: I'd split each docx / pdf document then accordingly, e.g. one section = 1 Elasticsearch document. Several MB per (Elasticsearch) document are definitely not a problem but 1 GB is definitely beyond the sweet spot.

I doubt it as this would require very significant (architectural) changes and Elasticsearch would be an entirely different product then. But never say never.

Daniel


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.