Is there way way to limit the size of a document after an ingest pipeline has processed it?
For example, I want to limit the size of indexed documents to 1mb. A 4mb PDF might be uploaded, the attachment processor extracts the text, the original PDF data is removed, now the size of the document to be indexed should be under the maximum 1mb
max_content_length seems to limit the upload of the document prior to the pipeline.
The system we are building is to ingest lots of documents that might range in size. Normally, the size of the document is quite small after the pipeline has finished. Occasionally, this is not the case so we then have a large document in Elastic that causes some issues (slow, unresponsive, etc) for our services when they come across it.
The tricky part is that we don't know what the size of the document is until it has been processed.
I guess we can add a script to check the size of the attachment processor result. I was just wondering what other options might exist.
There is no setting at the index level for maximum document size?
I have this processor stage, which drops the document if the output is over 512kb. I guess I was after a policy that could be applied on an index or multiple indices, rather than at the pipeline level.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.