Indexing a large pdf file (around 90MB) gives an exception

Abhilash_Bolla · January 18, 2018, 10:05am

Hi,

Elasticsearch version: 5.4.3
Heap allocated: 4GB
Java version: Java 8
I am trying to index a large pdf file of 90 MB using the ingest-attachments plugin. I encode the pdf file into base64 format and then index it by calling the .index() function of the python client. I get this error:

ConnectionError(('Connection aborted.', error(104, 'Connection reset by peer'))) caused by: ProtocolError(('Connection aborted.', error(104, 'Connection reset by peer')))

I am successfully able to index smaller files but the process gets terminated when I encounter this file. Any help is appreciated. Thanks.

dadoonet · January 18, 2018, 10:21am

Some comments:

ingest-mapper-attachments? Is it ingest-attachment or mapper-attachments plugin? I guess the former.
pdf file of 90 MB: which means probably more in term of JSON as you encode it using a BASE64. What is the size of the JSON doc?
Have a look at http.max_content_length on https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-http.html. By default, it's limited to 100mb

If you intend to send very big documents like this, may be 4GB of HEAP is going to be too small at some point. Monitor it to make sure you are not putting too much pressure.

You can also use an external process to do the document extraction, like FSCrawler or using Tika directly.

Abhilash_Bolla · January 18, 2018, 10:27am

My bad, its ingest-attachment plugin.

As per the point 3 of your comments, I guess the python client is getting timed out. I don't see anything in Elasticsearch logs regarding "http.max_content_length" or even the GCs kicking in if 4GB heap is too small.

dadoonet · January 18, 2018, 10:45am

I guess the python client is getting timed out.

Might be indeed.

system · February 15, 2018, 10:45am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Large file chunking with Ingest-Attachment Elasticsearch	2	1267	December 14, 2020
Ingest attachmnet increase file content size to index Elasticsearch	16	2028	May 18, 2018
Larger than the limit Elasticsearch	3	2957	November 20, 2018
Large zip content Elasticsearch	5	1152	January 23, 2018
How to use elasticsearch 8.x ingest-attachment with Python Elasticsearch ingest-pipeline	3	903	February 14, 2023

Indexing a large pdf file (around 90MB) gives an exception

Related topics