Limit the max document size after ingest pipeline

Is there way way to limit the size of a document after an ingest pipeline has processed it?

For example, I want to limit the size of indexed documents to 1mb. A 4mb PDF might be uploaded, the attachment processor extracts the text, the original PDF data is removed, now the size of the document to be indexed should be under the maximum 1mb

max_content_length seems to limit the upload of the document prior to the pipeline.

Welcome to our community! :smiley:

You might be able to do this with a script, but I haven't see it done nor how to do it myself. Is there a reason you want to do this?

Thanks for the welcome :slight_smile:

The system we are building is to ingest lots of documents that might range in size. Normally, the size of the document is quite small after the pipeline has finished. Occasionally, this is not the case so we then have a large document in Elastic that causes some issues (slow, unresponsive, etc) for our services when they come across it.

The tricky part is that we don't know what the size of the document is until it has been processed.

I guess we can add a script to check the size of the attachment processor result. I was just wondering what other options might exist.

There is no setting at the index level for maximum document size?

Have a look at indexed_chars option.

It limits the number of extracted chars when running the attachment plugin.



Thanks David, I think this might limit the size of the incoming payload, rather than the output of an ingest pipeline?

I have this processor stage, which drops the document if the output is over 512kb. I guess I was after a policy that could be applied on an index or multiple indices, rather than at the pipeline level.

{ "drop": { "if": "!ctx.containsKey('attachment') || !ctx['attachment'].containsKey('content') || ctx['attachment']['content_length'] > 512000" } }

Nope. That limits the number of extracted characters whatever the size of the input.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.