Elasticsearch version: 5.4.3
Heap allocated: 4GB
Java version: Java 8
I am trying to index a large pdf file of 90 MB using the ingest-attachments plugin. I encode the pdf file into base64 format and then index it by calling the .index() function of the python client. I get this error:
ConnectionError(('Connection aborted.', error(104, 'Connection reset by peer'))) caused by: ProtocolError(('Connection aborted.', error(104, 'Connection reset by peer')))
I am successfully able to index smaller files but the process gets terminated when I encounter this file. Any help is appreciated. Thanks.
If you intend to send very big documents like this, may be 4GB of HEAP is going to be too small at some point. Monitor it to make sure you are not putting too much pressure.
You can also use an external process to do the document extraction, like FSCrawler or using Tika directly.
As per the point 3 of your comments, I guess the python client is getting timed out. I don't see anything in Elasticsearch logs regarding "http.max_content_length" or even the GCs kicking in if 4GB heap is too small.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.