How to control the "_indexed_chars" value on a Ingest Attachment pipeline?

Cornoualis · March 8, 2018, 1:09pm

Hi,

I created a pipeline to ingest office/pdf files using the Ingest Attachment pipeline without defining a value for "indexed_chars" (so I guess that the default value of 100k chars is used).

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information. Used to parse pdf and office files",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}

Some of my users want to be able to use a per document value, as described here for the Mapper plugin.

Is it possible to the the same with the Ingest Attachment plugin?

dadoonet · March 8, 2018, 1:24pm

See https://www.elastic.co/guide/en/elasticsearch/plugins/6.2/using-ingest-attachment.html

Cornoualis · March 8, 2018, 2:30pm

Thank you David,
I read this doc already, but it doesn't answer the question about the "per document" aspect.

If the "indexed_chars" can only be set at the pipeline definition level, I would have to set it to "-1" to ensure that all my users will be able to index anything...but it's a risk to crash the ingesting node if somebody sends an extremely big file.

That's why I wanted to know if "this" was doable with Ingest Attachment plugin or not.

Thanks in advance!

dadoonet · March 8, 2018, 2:48pm

I see.

That's indeed not doable AFAIK. May be something we can support as an option like reading this limit value from the document itself by adding a setting like field_indexed_chars.

Then we could do something:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information. Used to parse pdf and office files",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "field_indexed_chars" : "size"
      }
    }
  ]
}

Then index either:

PUT index/doc/1?pipeline=attachment
{
  "data": "BASE64"
}

Which will use the default value (or the one defined by indexed_chars)

Or

PUT index/doc/2?pipeline=attachment
{
  "data": "BASE64",
  "size": 1000 
}

Would you like to open a feature request for it?

Cornoualis · March 8, 2018, 3:15pm

If it's possible, yes, I'd like a feature request!

Many thanks!

dadoonet · March 8, 2018, 4:07pm

I opened

Let's see how it goes.

Cornoualis · March 8, 2018, 4:08pm

Thank you David!

dadoonet · March 14, 2018, 6:33pm

FYI I merged this today:

Should be available in 6.3.0 and later.

Cornoualis · March 15, 2018, 11:40am

Great! Thanks a lot David!

system · April 12, 2018, 11:40am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Limit the max document size after ingest pipeline Elasticsearch ingest-pipeline	7	1206	March 30, 2021
Ingest Attachment Plugin update index Elasticsearch ingest-pipeline	7	1595	December 15, 2021
Tuning Attachment Ingest with arrays (get rid of the raw data!) Elasticsearch	5	1755	February 16, 2017
Ingest-attachment ingest local docs Elasticsearch	4	453	November 18, 2018
Can ingest-attachment-plugin reads all the contents of attachment? Elasticsearch	4	334	April 8, 2019

How to control the "_indexed_chars" value on a Ingest Attachment pipeline?

Related topics