I created a pipeline to ingest office/pdf files using the Ingest Attachment pipeline without defining a value for "indexed_chars" (so I guess that the default value of 100k chars is used).
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information. Used to parse pdf and office files",
"processors" : [
{
"attachment" : {
"field" : "data"
}
}
]
}
Some of my users want to be able to use a per document value, as described here for the Mapper plugin.
Is it possible to the the same with the Ingest Attachment plugin?
Thank you David,
I read this doc already, but it doesn't answer the question about the "per document" aspect.
If the "indexed_chars" can only be set at the pipeline definition level, I would have to set it to "-1" to ensure that all my users will be able to index anything...but it's a risk to crash the ingesting node if somebody sends an extremely big file.
That's why I wanted to know if "this" was doable with Ingest Attachment plugin or not.
That's indeed not doable AFAIK. May be something we can support as an option like reading this limit value from the document itself by adding a setting like field_indexed_chars.
Then we could do something:
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information. Used to parse pdf and office files",
"processors" : [
{
"attachment" : {
"field" : "data",
"field_indexed_chars" : "size"
}
}
]
}
Then index either:
PUT index/doc/1?pipeline=attachment
{
"data": "BASE64"
}
Which will use the default value (or the one defined by indexed_chars)
Or
PUT index/doc/2?pipeline=attachment
{
"data": "BASE64",
"size": 1000
}
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.