Hi there.
We run a simple ES cluster - one shard, one node, v5.6. It contains a single index, with a single mapping containing many different fields/field types. On every object index operation, we send some attachment data in base64 format, and our Ingest pipeline processes it before its contents are analyzed and stored in a separate field. We recently ran into heavy disk space usage issues, and decided to investigate.
Based on what we found, is it possible to setup an attachment processor pipeline, defined like below -:
ES_RESUME_PIPELINE_MAPPING = {
"description": "Extract resume information",
"processors": [{
"attachment": {
"field": "resume_data",
"target_field": "resume",
"indexed_chars": -1,
"ignore_missing": True
}
}]
}
but without keeping resume_data
in the index? The reason for this is that resume_data
takes up quite a bit of disk space. Additionally, it's used only once, by the pipeline during the indexing of an object. In one example, we found that the total data taken by a single object on the index was 123 KB with resume_data
, and around 16KB without. That's a huge difference.
We tried by setting index=False
and store=False
(which is so by default, so obviously this wouldn't really work) on the field, but those options don't do what we're looking for.
We'd really like to know if something like this was possible. If it isn't, our only other option is to empty the resume_data
field somehow after it is used, but without triggering the pipeline on the emptying request.
Thanks!