ES OutOfMemoryError while indexing a large number of attachments

I've been working to stabilze a process where I need to index ~16K
attachments. The following is the latest result:

I'm running ES with ES_HEAP_SIZE=2g but still running into an issue of
running out of heap after indexing ~5K attachments. The attachments range
in size from a few KB to no more than 20MB.

The indexing process currently uses 2 workers to update already indexed
documents by adding the attachments via the update API. See code here:

I'd like suggestions on how to get this process to work without ES running
out of memory. 2GB seems plenty for what I'm trying to do and giving more
memory isn't really possible since the box only has 4GB.

Thanks,
Shane

I forgot to mention that I also get a shard failure around the same time
the OutOfMemoryException occurs.

On Friday, March 30, 2012 12:03:57 PM UTC-4, Shane Witbeck wrote:

I've been working to stabilze a process where I need to index ~16K
attachments. The following is the latest result:

ES failure after large number of attachments indexed · GitHub

I'm running ES with ES_HEAP_SIZE=2g but still running into an issue of
running out of heap after indexing ~5K attachments. The attachments range
in size from a few KB to no more than 20MB.

The indexing process currently uses 2 workers to update already indexed
documents by adding the attachments via the update API. See code here:

indexAttachment · GitHub

I'd like suggestions on how to get this process to work without ES running
out of memory. 2GB seems plenty for what I'm trying to do and giving more
memory isn't really possible since the box only has 4GB.

Thanks,
Shane

Is there a chance that you have a singel document that you end up adding
many attachments to resulting in the OOM failure? Update API simply reads
the current doc and then reindex all of it back.

On Fri, Mar 30, 2012 at 7:07 PM, Shane Witbeck shane@digitalsanctum.comwrote:

I forgot to mention that I also get a shard failure around the same time
the OutOfMemoryException occurs.

On Friday, March 30, 2012 12:03:57 PM UTC-4, Shane Witbeck wrote:

I've been working to stabilze a process where I need to index ~16K
attachments. The following is the latest result:

https://gist.github.com/**2252398 https://gist.github.com/2252398

I'm running ES with ES_HEAP_SIZE=2g but still running into an issue of
running out of heap after indexing ~5K attachments. The attachments range
in size from a few KB to no more than 20MB.

The indexing process currently uses 2 workers to update already indexed
documents by adding the attachments via the update API. See code here:

https://gist.github.com/**2252482 https://gist.github.com/2252482

I'd like suggestions on how to get this process to work without ES
running out of memory. 2GB seems plenty for what I'm trying to do and
giving more memory isn't really possible since the box only has 4GB.

Thanks,
Shane

Yes, each document may have several attachments associated with it.

Given the scenario I've outlined, does it make more sense to put
attachments in their own index? It seems I've hit a limitation of the
attachment plugin with the limited amount of RAM that I have and the
potential of several mutli-MB attachments per document. I'm also curious if
you think increasing the amount of RAM on the machines would help in this
case?

I have just the one index and was hoping to avoid creating another index
for attachments but if this is the way to go what would be the best way to
associate them?

Thanks,
Shane

On Saturday, March 31, 2012 7:36:37 PM UTC-4, Shane Witbeck wrote:

Yes, each document may have several attachments associated with it.

Yea, probably breaking down the attachments to their own docs make more
sense.

On Tue, Apr 3, 2012 at 12:18 AM, Shane Witbeck shane@digitalsanctum.comwrote:

Given the scenario I've outlined, does it make more sense to put
attachments in their own index? It seems I've hit a limitation of the
attachment plugin with the limited amount of RAM that I have and the
potential of several mutli-MB attachments per document. I'm also curious if
you think increasing the amount of RAM on the machines would help in this
case?

I have just the one index and was hoping to avoid creating another index
for attachments but if this is the way to go what would be the best way to
associate them?

Thanks,
Shane

On Saturday, March 31, 2012 7:36:37 PM UTC-4, Shane Witbeck wrote:

Yes, each document may have several attachments associated with it.

I don't have experience (yet) with searching on more than one index at a
time. Is the multisearch API
(Elasticsearch Platform — Find real-time answers at scale | Elastic) what I
need here or is there some way of associating one index with another?

Thanks

On Tuesday, April 3, 2012 10:32:10 AM UTC-4, kimchy wrote:

Yea, probably breaking down the attachments to their own docs make more
sense.

On Tue, Apr 3, 2012 at 12:18 AM, Shane Witbeck shane@digitalsanctum.comwrote:

Given the scenario I've outlined, does it make more sense to put
attachments in their own index? It seems I've hit a limitation of the
attachment plugin with the limited amount of RAM that I have and the
potential of several mutli-MB attachments per document. I'm also curious if
you think increasing the amount of RAM on the machines would help in this
case?

I have just the one index and was hoping to avoid creating another index
for attachments but if this is the way to go what would be the best way to
associate them?

Thanks,
Shane

On Saturday, March 31, 2012 7:36:37 PM UTC-4, Shane Witbeck wrote:

Yes, each document may have several attachments associated with it.

You just specify the two index names in the URI, or if you use the Java
API, specify both indices in the search API.

On Tue, Apr 3, 2012 at 5:40 PM, Shane Witbeck shane@digitalsanctum.comwrote:

I don't have experience (yet) with searching on more than one index at a
time. Is the multisearch API (
Elasticsearch Platform — Find real-time answers at scale | Elastic) what
I need here or is there some way of associating one index with another?

Thanks

On Tuesday, April 3, 2012 10:32:10 AM UTC-4, kimchy wrote:

Yea, probably breaking down the attachments to their own docs make more
sense.

On Tue, Apr 3, 2012 at 12:18 AM, Shane Witbeck shane@digitalsanctum.comwrote:

Given the scenario I've outlined, does it make more sense to put
attachments in their own index? It seems I've hit a limitation of the
attachment plugin with the limited amount of RAM that I have and the
potential of several mutli-MB attachments per document. I'm also curious if
you think increasing the amount of RAM on the machines would help in this
case?

I have just the one index and was hoping to avoid creating another index
for attachments but if this is the way to go what would be the best way to
associate them?

Thanks,
Shane

On Saturday, March 31, 2012 7:36:37 PM UTC-4, Shane Witbeck wrote:

Yes, each document may have several attachments associated with it.

My hunch is that Tika is running out of memory on you. Tika moved from
using filesystem for temp storage to an in memory based approach for
PDFs, which led to some out of memory issues on my side. Really, it is
PDFBox under Tika that made the change and I believe tika 0.9 picked
this up. You should be able to confirm by analyzing the heap dump.

In Tika 1.1, you can control this behavior by passing in a file object
instead of a stream.

Best Regards,
Paul

On Apr 3, 9:05 am, Shay Banon kim...@gmail.com wrote:

You just specify the two index names in the URI, or if you use the Java
API, specify both indices in the search API.

On Tue, Apr 3, 2012 at 5:40 PM, Shane Witbeck sh...@digitalsanctum.comwrote:

I don't have experience (yet) with searching on more than one index at a
time. Is the multisearch API (
Elasticsearch Platform — Find real-time answers at scale | Elastic) what
I need here or is there some way of associating one index with another?

Thanks

On Tuesday, April 3, 2012 10:32:10 AM UTC-4, kimchy wrote:

Yea, probably breaking down the attachments to their own docs make more
sense.

On Tue, Apr 3, 2012 at 12:18 AM, Shane Witbeck sh...@digitalsanctum.comwrote:

Given the scenario I've outlined, does it make more sense to put
attachments in their own index? It seems I've hit a limitation of the
attachment plugin with the limited amount of RAM that I have and the
potential of several mutli-MB attachments per document. I'm also curious if
you think increasing the amount of RAM on the machines would help in this
case?

I have just the one index and was hoping to avoid creating another index
for attachments but if this is the way to go what would be the best way to
associate them?

Thanks,
Shane

On Saturday, March 31, 2012 7:36:37 PM UTC-4, Shane Witbeck wrote:

Yes, each document may have several attachments associated with it.