I've been working to stabilze a process where I need to index ~16K
attachments. The following is the latest result:
I'm running ES with ES_HEAP_SIZE=2g but still running into an issue of
running out of heap after indexing ~5K attachments. The attachments range
in size from a few KB to no more than 20MB.
The indexing process currently uses 2 workers to update already indexed
documents by adding the attachments via the update API. See code here:
I'd like suggestions on how to get this process to work without ES running
out of memory. 2GB seems plenty for what I'm trying to do and giving more
memory isn't really possible since the box only has 4GB.
I'm running ES with ES_HEAP_SIZE=2g but still running into an issue of
running out of heap after indexing ~5K attachments. The attachments range
in size from a few KB to no more than 20MB.
The indexing process currently uses 2 workers to update already indexed
documents by adding the attachments via the update API. See code here:
I'd like suggestions on how to get this process to work without ES running
out of memory. 2GB seems plenty for what I'm trying to do and giving more
memory isn't really possible since the box only has 4GB.
Is there a chance that you have a singel document that you end up adding
many attachments to resulting in the OOM failure? Update API simply reads
the current doc and then reindex all of it back.
I'm running ES with ES_HEAP_SIZE=2g but still running into an issue of
running out of heap after indexing ~5K attachments. The attachments range
in size from a few KB to no more than 20MB.
The indexing process currently uses 2 workers to update already indexed
documents by adding the attachments via the update API. See code here:
I'd like suggestions on how to get this process to work without ES
running out of memory. 2GB seems plenty for what I'm trying to do and
giving more memory isn't really possible since the box only has 4GB.
Given the scenario I've outlined, does it make more sense to put
attachments in their own index? It seems I've hit a limitation of the
attachment plugin with the limited amount of RAM that I have and the
potential of several mutli-MB attachments per document. I'm also curious if
you think increasing the amount of RAM on the machines would help in this
case?
I have just the one index and was hoping to avoid creating another index
for attachments but if this is the way to go what would be the best way to
associate them?
Thanks,
Shane
On Saturday, March 31, 2012 7:36:37 PM UTC-4, Shane Witbeck wrote:
Yes, each document may have several attachments associated with it.
Given the scenario I've outlined, does it make more sense to put
attachments in their own index? It seems I've hit a limitation of the
attachment plugin with the limited amount of RAM that I have and the
potential of several mutli-MB attachments per document. I'm also curious if
you think increasing the amount of RAM on the machines would help in this
case?
I have just the one index and was hoping to avoid creating another index
for attachments but if this is the way to go what would be the best way to
associate them?
Thanks,
Shane
On Saturday, March 31, 2012 7:36:37 PM UTC-4, Shane Witbeck wrote:
Yes, each document may have several attachments associated with it.
Given the scenario I've outlined, does it make more sense to put
attachments in their own index? It seems I've hit a limitation of the
attachment plugin with the limited amount of RAM that I have and the
potential of several mutli-MB attachments per document. I'm also curious if
you think increasing the amount of RAM on the machines would help in this
case?
I have just the one index and was hoping to avoid creating another index
for attachments but if this is the way to go what would be the best way to
associate them?
Thanks,
Shane
On Saturday, March 31, 2012 7:36:37 PM UTC-4, Shane Witbeck wrote:
Yes, each document may have several attachments associated with it.
Given the scenario I've outlined, does it make more sense to put
attachments in their own index? It seems I've hit a limitation of the
attachment plugin with the limited amount of RAM that I have and the
potential of several mutli-MB attachments per document. I'm also curious if
you think increasing the amount of RAM on the machines would help in this
case?
I have just the one index and was hoping to avoid creating another index
for attachments but if this is the way to go what would be the best way to
associate them?
Thanks,
Shane
On Saturday, March 31, 2012 7:36:37 PM UTC-4, Shane Witbeck wrote:
Yes, each document may have several attachments associated with it.
My hunch is that Tika is running out of memory on you. Tika moved from
using filesystem for temp storage to an in memory based approach for
PDFs, which led to some out of memory issues on my side. Really, it is
PDFBox under Tika that made the change and I believe tika 0.9 picked
this up. You should be able to confirm by analyzing the heap dump.
In Tika 1.1, you can control this behavior by passing in a file object
instead of a stream.
Given the scenario I've outlined, does it make more sense to put
attachments in their own index? It seems I've hit a limitation of the
attachment plugin with the limited amount of RAM that I have and the
potential of several mutli-MB attachments per document. I'm also curious if
you think increasing the amount of RAM on the machines would help in this
case?
I have just the one index and was hoping to avoid creating another index
for attachments but if this is the way to go what would be the best way to
associate them?
Thanks,
Shane
On Saturday, March 31, 2012 7:36:37 PM UTC-4, Shane Witbeck wrote:
Yes, each document may have several attachments associated with it.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.