Limit the max document size after ingest pipeline

Crazedpeanut · March 2, 2021, 2:33am

Is there way way to limit the size of a document after an ingest pipeline has processed it?

For example, I want to limit the size of indexed documents to 1mb. A 4mb PDF might be uploaded, the attachment processor extracts the text, the original PDF data is removed, now the size of the document to be indexed should be under the maximum 1mb

max_content_length seems to limit the upload of the document prior to the pipeline.

warkolm · March 2, 2021, 2:54am

Welcome to our community!

You might be able to do this with a script, but I haven't see it done nor how to do it myself. Is there a reason you want to do this?

Crazedpeanut · March 2, 2021, 3:26am

Thanks for the welcome

The system we are building is to ingest lots of documents that might range in size. Normally, the size of the document is quite small after the pipeline has finished. Occasionally, this is not the case so we then have a large document in Elastic that causes some issues (slow, unresponsive, etc) for our services when they come across it.

The tricky part is that we don't know what the size of the document is until it has been processed.

I guess we can add a script to check the size of the attachment processor result. I was just wondering what other options might exist.

There is no setting at the index level for maximum document size?

dadoonet · March 2, 2021, 3:54am

Have a look at indexed_chars option.

It limits the number of extracted chars when running the attachment plugin.

HTH

Crazedpeanut · March 2, 2021, 4:13am

Thanks David, I think this might limit the size of the incoming payload, rather than the output of an ingest pipeline?

Crazedpeanut · March 2, 2021, 4:20am

I have this processor stage, which drops the document if the output is over 512kb. I guess I was after a policy that could be applied on an index or multiple indices, rather than at the pipeline level.

{ "drop": { "if": "!ctx.containsKey('attachment') || !ctx['attachment'].containsKey('content') || ctx['attachment']['content_length'] > 512000" } }

dadoonet · March 2, 2021, 4:32am

Nope. That limits the number of extracted characters whatever the size of the input.

system · March 30, 2021, 4:32am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can I increase the maximum attachment size for ingestion? Elasticsearch	4	2035	November 4, 2017
How to control the "_indexed_chars" value on a Ingest Attachment pipeline? Elasticsearch	9	1131	April 12, 2018
IS there a maximum field size limit for content when using ingest plugin? Elasticsearch	3	2485	February 1, 2018
Indexing large pdf document Elasticsearch	10	5865	July 5, 2017
Ingest attachmnet increase file content size to index Elasticsearch	16	2101	May 18, 2018

Limit the max document size after ingest pipeline

Related topics