Indexing a large pdf file (around 90MB) gives an exception


Elasticsearch version: 5.4.3
Heap allocated: 4GB
Java version: Java 8
I am trying to index a large pdf file of 90 MB using the ingest-attachments plugin. I encode the pdf file into base64 format and then index it by calling the .index() function of the python client. I get this error:

ConnectionError(('Connection aborted.', error(104, 'Connection reset by peer'))) caused by: ProtocolError(('Connection aborted.', error(104, 'Connection reset by peer')))

I am successfully able to index smaller files but the process gets terminated when I encounter this file. Any help is appreciated. Thanks.

Some comments:

If you intend to send very big documents like this, may be 4GB of HEAP is going to be too small at some point. Monitor it to make sure you are not putting too much pressure.

You can also use an external process to do the document extraction, like FSCrawler or using Tika directly.

My bad, its ingest-attachment plugin.

As per the point 3 of your comments, I guess the python client is getting timed out. I don't see anything in Elasticsearch logs regarding "http.max_content_length" or even the GCs kicking in if 4GB heap is too small.

I guess the python client is getting timed out.

Might be indeed.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.