Ingest Attachment processor pipeline, but without storing base64 data

pritishc · February 8, 2018, 11:19am

Hi there.

We run a simple ES cluster - one shard, one node, v5.6. It contains a single index, with a single mapping containing many different fields/field types. On every object index operation, we send some attachment data in base64 format, and our Ingest pipeline processes it before its contents are analyzed and stored in a separate field. We recently ran into heavy disk space usage issues, and decided to investigate.

Based on what we found, is it possible to setup an attachment processor pipeline, defined like below -:

ES_RESUME_PIPELINE_MAPPING = {
    "description": "Extract resume information",
    "processors": [{
        "attachment": {
            "field": "resume_data",
            "target_field": "resume",
            "indexed_chars": -1,
            "ignore_missing": True
        }
    }]
}

but without keeping resume_data in the index? The reason for this is that resume_data takes up quite a bit of disk space. Additionally, it's used only once, by the pipeline during the indexing of an object. In one example, we found that the total data taken by a single object on the index was 123 KB with resume_data, and around 16KB without. That's a huge difference.

We tried by setting index=False and store=False (which is so by default, so obviously this wouldn't really work) on the field, but those options don't do what we're looking for.

We'd really like to know if something like this was possible. If it isn't, our only other option is to empty the resume_data field somehow after it is used, but without triggering the pipeline on the emptying request.

Thanks!

pritishc · February 8, 2018, 12:01pm

Thanks to Xyalakant on the freenode channel, learned that I could just remove the field at the end of the pipeline. So our new processor pipeline becomes -:

ES_RESUME_PIPELINE_MAPPING = {
    "description": "Extract resume information",
    "processors": [{
        "attachment": {
            "field": "resume_data",
            "target_field": "resume",
            "indexed_chars": -1,
            "ignore_missing": True
        },
        "remove": {
            "field": "resume_data"
        }
    }]
}

This seems to solve the issue. The field goes away after the index operation.

system · March 8, 2018, 12:01pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Ingest Attachment processor pipeline for arrays, without storing base64 data Elasticsearch	2	1242	February 7, 2020
Tuning Attachment Ingest with arrays (get rid of the raw data!) Elasticsearch	5	1755	February 16, 2017
Consequences of excluding fields from _source Elasticsearch	4	641	August 9, 2017
Problem with Ingest Attachment Processor Plugin Elasticsearch	8	1204	November 24, 2017
Excluding fields from _source to avoid storing files Elasticsearch	6	2880	December 11, 2017

Ingest Attachment processor pipeline, but without storing base64 data

Related topics