ES OutOfMemoryError while indexing a large number of attachments

Shane_Witbeck · March 30, 2012, 4:03pm

I've been working to stabilze a process where I need to index ~16K
attachments. The following is the latest result:

gist.github.com

https://gist.github.com/digitalsanctum/2252398

gistfile1.txt

[2012-03-30 15:11:05,528][WARN ][index.engine.robin       ] [dev1] [threads][0] failed engine
java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:2798)
	at org.elasticsearch.common.io.stream.BytesStreamOutput.writeByte(BytesStreamOutput.java:54)
	at org.elasticsearch.common.io.stream.StreamOutput.writeBoolean(StreamOutput.java:179)
	at org.elasticsearch.index.translog.Translog$Index.writeTo(Translog.java:483)
	at org.elasticsearch.index.translog.TranslogStreams.writeTranslogOperation(TranslogStreams.java:82)
	at org.elasticsearch.index.translog.fs.FsTranslog.add(FsTranslog.java:328)
	at org.elasticsearch.index.engine.robin.RobinEngine.innerIndex(RobinEngine.java:576)
	at org.elasticsearch.index.engine.robin.RobinEngine.index(RobinEngine.java:479)

This file has been truncated. show original

I'm running ES with ES_HEAP_SIZE=2g but still running into an issue of
running out of heap after indexing ~5K attachments. The attachments range
in size from a few KB to no more than 20MB.

The indexing process currently uses 2 workers to update already indexed
documents by adding the attachments via the update API. See code here:

gist.github.com

https://gist.github.com/digitalsanctum/2252482

gistfile1.java

byte[] contentBytes;
        String attachmentContent = null;
        try {
            contentBytes = Streams.copyToByteArray(downloadedFile);
            attachmentContent = Base64.encodeBytes(contentBytes);

        } catch (IOException e) {
            log.error("error reading downloaded file", e);
            return false;
        }

This file has been truncated. show original

I'd like suggestions on how to get this process to work without ES running
out of memory. 2GB seems plenty for what I'm trying to do and giving more
memory isn't really possible since the box only has 4GB.

Thanks,
Shane

Shane_Witbeck · March 30, 2012, 4:07pm

I forgot to mention that I also get a shard failure around the same time
the OutOfMemoryException occurs.

On Friday, March 30, 2012 12:03:57 PM UTC-4, Shane Witbeck wrote:

I've been working to stabilze a process where I need to index ~16K
attachments. The following is the latest result:

ES failure after large number of attachments indexed · GitHub

I'm running ES with ES_HEAP_SIZE=2g but still running into an issue of
running out of heap after indexing ~5K attachments. The attachments range
in size from a few KB to no more than 20MB.

The indexing process currently uses 2 workers to update already indexed
documents by adding the attachments via the update API. See code here:

indexAttachment · GitHub

I'd like suggestions on how to get this process to work without ES running
out of memory. 2GB seems plenty for what I'm trying to do and giving more
memory isn't really possible since the box only has 4GB.

Thanks,
Shane

kimchy · March 31, 2012, 9:18pm

Is there a chance that you have a singel document that you end up adding
many attachments to resulting in the OOM failure? Update API simply reads
the current doc and then reindex all of it back.

On Fri, Mar 30, 2012 at 7:07 PM, Shane Witbeck shane@digitalsanctum.comwrote:

I forgot to mention that I also get a shard failure around the same time
the OutOfMemoryException occurs.

On Friday, March 30, 2012 12:03:57 PM UTC-4, Shane Witbeck wrote:

I've been working to stabilze a process where I need to index ~16K
attachments. The following is the latest result:

https://gist.github.com/**2252398 https://gist.github.com/2252398

I'm running ES with ES_HEAP_SIZE=2g but still running into an issue of
running out of heap after indexing ~5K attachments. The attachments range
in size from a few KB to no more than 20MB.

The indexing process currently uses 2 workers to update already indexed
documents by adding the attachments via the update API. See code here:

https://gist.github.com/**2252482 https://gist.github.com/2252482

I'd like suggestions on how to get this process to work without ES
running out of memory. 2GB seems plenty for what I'm trying to do and
giving more memory isn't really possible since the box only has 4GB.

Thanks,
Shane

Shane_Witbeck · March 31, 2012, 11:36pm

Yes, each document may have several attachments associated with it.

Shane_Witbeck · April 2, 2012, 9:18pm

Given the scenario I've outlined, does it make more sense to put
attachments in their own index? It seems I've hit a limitation of the
attachment plugin with the limited amount of RAM that I have and the
potential of several mutli-MB attachments per document. I'm also curious if
you think increasing the amount of RAM on the machines would help in this
case?

I have just the one index and was hoping to avoid creating another index
for attachments but if this is the way to go what would be the best way to
associate them?

Thanks,
Shane

On Saturday, March 31, 2012 7:36:37 PM UTC-4, Shane Witbeck wrote:

Yes, each document may have several attachments associated with it.

kimchy · April 3, 2012, 2:32pm

Yea, probably breaking down the attachments to their own docs make more
sense.

On Tue, Apr 3, 2012 at 12:18 AM, Shane Witbeck shane@digitalsanctum.comwrote:

Given the scenario I've outlined, does it make more sense to put
attachments in their own index? It seems I've hit a limitation of the
attachment plugin with the limited amount of RAM that I have and the
potential of several mutli-MB attachments per document. I'm also curious if
you think increasing the amount of RAM on the machines would help in this
case?

I have just the one index and was hoping to avoid creating another index
for attachments but if this is the way to go what would be the best way to
associate them?

Thanks,
Shane

On Saturday, March 31, 2012 7:36:37 PM UTC-4, Shane Witbeck wrote:

Yes, each document may have several attachments associated with it.

Shane_Witbeck · April 3, 2012, 2:40pm

I don't have experience (yet) with searching on more than one index at a
time. Is the multisearch API
(Elasticsearch Platform — Find real-time answers at scale | Elastic) what I
need here or is there some way of associating one index with another?

Thanks

On Tuesday, April 3, 2012 10:32:10 AM UTC-4, kimchy wrote:

Yea, probably breaking down the attachments to their own docs make more
sense.

On Tue, Apr 3, 2012 at 12:18 AM, Shane Witbeck shane@digitalsanctum.comwrote:

Given the scenario I've outlined, does it make more sense to put
attachments in their own index? It seems I've hit a limitation of the
attachment plugin with the limited amount of RAM that I have and the
potential of several mutli-MB attachments per document. I'm also curious if
you think increasing the amount of RAM on the machines would help in this
case?

I have just the one index and was hoping to avoid creating another index
for attachments but if this is the way to go what would be the best way to
associate them?

Thanks,
Shane

On Saturday, March 31, 2012 7:36:37 PM UTC-4, Shane Witbeck wrote:

Yes, each document may have several attachments associated with it.

kimchy · April 3, 2012, 3:05pm

You just specify the two index names in the URI, or if you use the Java
API, specify both indices in the search API.

On Tue, Apr 3, 2012 at 5:40 PM, Shane Witbeck shane@digitalsanctum.comwrote:

I don't have experience (yet) with searching on more than one index at a
time. Is the multisearch API (
Elasticsearch Platform — Find real-time answers at scale | Elastic) what
I need here or is there some way of associating one index with another?

Thanks

On Tuesday, April 3, 2012 10:32:10 AM UTC-4, kimchy wrote:

Yea, probably breaking down the attachments to their own docs make more
sense.

On Tue, Apr 3, 2012 at 12:18 AM, Shane Witbeck shane@digitalsanctum.comwrote:

Given the scenario I've outlined, does it make more sense to put
attachments in their own index? It seems I've hit a limitation of the
attachment plugin with the limited amount of RAM that I have and the
potential of several mutli-MB attachments per document. I'm also curious if
you think increasing the amount of RAM on the machines would help in this
case?

I have just the one index and was hoping to avoid creating another index
for attachments but if this is the way to go what would be the best way to
associate them?

Thanks,
Shane

On Saturday, March 31, 2012 7:36:37 PM UTC-4, Shane Witbeck wrote:

Yes, each document may have several attachments associated with it.

ppearcy · April 3, 2012, 4:09pm

My hunch is that Tika is running out of memory on you. Tika moved from
using filesystem for temp storage to an in memory based approach for
PDFs, which led to some out of memory issues on my side. Really, it is
PDFBox under Tika that made the change and I believe tika 0.9 picked
this up. You should be able to confirm by analyzing the heap dump.

In Tika 1.1, you can control this behavior by passing in a file object
instead of a stream.

Best Regards,
Paul

On Apr 3, 9:05 am, Shay Banon kim...@gmail.com wrote:

You just specify the two index names in the URI, or if you use the Java
API, specify both indices in the search API.

On Tue, Apr 3, 2012 at 5:40 PM, Shane Witbeck sh...@digitalsanctum.comwrote:

I don't have experience (yet) with searching on more than one index at a
time. Is the multisearch API (
Elasticsearch Platform — Find real-time answers at scale | Elastic) what
I need here or is there some way of associating one index with another?

Thanks

On Tuesday, April 3, 2012 10:32:10 AM UTC-4, kimchy wrote:

Yea, probably breaking down the attachments to their own docs make more
sense.

On Tue, Apr 3, 2012 at 12:18 AM, Shane Witbeck sh...@digitalsanctum.comwrote:

Given the scenario I've outlined, does it make more sense to put
attachments in their own index? It seems I've hit a limitation of the
attachment plugin with the limited amount of RAM that I have and the
potential of several mutli-MB attachments per document. I'm also curious if
you think increasing the amount of RAM on the machines would help in this
case?

I have just the one index and was hoping to avoid creating another index
for attachments but if this is the way to go what would be the best way to
associate them?

Thanks,
Shane

On Saturday, March 31, 2012 7:36:37 PM UTC-4, Shane Witbeck wrote:

Yes, each document may have several attachments associated with it.

Topic		Replies	Views
[indices:data/read/search[phase/fetch/id]]]; nested: ElasticsearchException[Java heap space]; nested: OutOfMemoryError[Java heap space] Elasticsearch	10	1978	June 4, 2017
Heap Space, JAVA API Elasticsearch	1	371	July 6, 2017
java.lang.OutOfMemoryError: Java heap space - Bulk Indexing Elasticsearch	8	3862	June 29, 2017
Java.lang.OutOfMemoryError: Java heap space Elasticsearch	25	3393	July 6, 2017
Error while indexing -java heap space Elasticsearch	17	887	July 6, 2017

ES OutOfMemoryError while indexing a large number of attachments

Related topics