Large file chunking with Ingest-Attachment

We have recently upgraded ElasticSearch from version 2 to version 7.9. We index attachments like PDFs, Word and Excel files. We previously used the “mapper-attachment” plugin for this. Mapper-attachment is not supported in ElasticSearch versions greater than 2. The replacement for this is the “ingest-attachment” so we have started using that.

When processing large PDF files (e.g. greater than 5MB) we need to send the file data (base64 encoded) for indexing in chunks, rather than the entire file in one request. We were able to do this do this with the mapper-attachment plugin. We are not able to do this with the ingest-attachment plugin as it expects the entire PDF file, not a chunk. If we send a chunk, we get a root-cause error message "Error parsing document in field [documentContent]" inside a TIKA exception "TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@48eae629". We assume that TIKA is finding that the PDF data is invalid or incomplete.

Our question:
Is the a way to get ingest-attachment to index a large pdf attachment in chunks or as a stream? Or, is there an alternative plugin or method to achieve this?

Why this?


You can look at FSCrawler project. Not a plugin but a community project.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.