Large file chunking with Ingest-Attachment

MBResolution · November 11, 2020, 9:53pm

We have recently upgraded ElasticSearch from version 2 to version 7.9. We index attachments like PDFs, Word and Excel files. We previously used the “mapper-attachment” plugin for this. Mapper-attachment is not supported in ElasticSearch versions greater than 2. The replacement for this is the “ingest-attachment” so we have started using that.

When processing large PDF files (e.g. greater than 5MB) we need to send the file data (base64 encoded) for indexing in chunks, rather than the entire file in one request. We were able to do this do this with the mapper-attachment plugin. We are not able to do this with the ingest-attachment plugin as it expects the entire PDF file, not a chunk. If we send a chunk, we get a root-cause error message "Error parsing document in field [documentContent]" inside a TIKA exception "TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@48eae629". We assume that TIKA is finding that the PDF data is invalid or incomplete.

Our question:
Is the a way to get ingest-attachment to index a large pdf attachment in chunks or as a stream? Or, is there an alternative plugin or method to achieve this?

dadoonet · November 16, 2020, 10:49am

Why this?

No

You can look at FSCrawler project. Not a plugin but a community project.

system · December 14, 2020, 10:49am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Indexing a large pdf file (around 90MB) gives an exception Elasticsearch	4	1323	February 15, 2018
Best way to use Ingest Attachment Plugin Elasticsearch ingest-pipeline	4	514	December 31, 2021
Troubles with different file types using ingest attachment processor plugin Elasticsearch	8	3167	February 23, 2017
Ingest attachmnet increase file content size to index Elasticsearch	16	2013	May 18, 2018
Elasticsearch index on compressed string blob Elasticsearch	8	602	July 23, 2022

Large file chunking with Ingest-Attachment

Related topics