Ingest Attachment processor pipeline, but without storing base64 data


(Pritish C) #1

Hi there.

We run a simple ES cluster - one shard, one node, v5.6. It contains a single index, with a single mapping containing many different fields/field types. On every object index operation, we send some attachment data in base64 format, and our Ingest pipeline processes it before its contents are analyzed and stored in a separate field. We recently ran into heavy disk space usage issues, and decided to investigate.

Based on what we found, is it possible to setup an attachment processor pipeline, defined like below -:

ES_RESUME_PIPELINE_MAPPING = {
    "description": "Extract resume information",
    "processors": [{
        "attachment": {
            "field": "resume_data",
            "target_field": "resume",
            "indexed_chars": -1,
            "ignore_missing": True
        }
    }]
}

but without keeping resume_data in the index? The reason for this is that resume_data takes up quite a bit of disk space. Additionally, it's used only once, by the pipeline during the indexing of an object. In one example, we found that the total data taken by a single object on the index was 123 KB with resume_data, and around 16KB without. That's a huge difference.

We tried by setting index=False and store=False (which is so by default, so obviously this wouldn't really work) on the field, but those options don't do what we're looking for.

We'd really like to know if something like this was possible. If it isn't, our only other option is to empty the resume_data field somehow after it is used, but without triggering the pipeline on the emptying request.

Thanks!


(Pritish C) #2

Thanks to Xyalakant on the freenode channel, learned that I could just remove the field at the end of the pipeline. So our new processor pipeline becomes -:

ES_RESUME_PIPELINE_MAPPING = {
    "description": "Extract resume information",
    "processors": [{
        "attachment": {
            "field": "resume_data",
            "target_field": "resume",
            "indexed_chars": -1,
            "ignore_missing": True
        },
        "remove": {
            "field": "resume_data"
        }
    }]
}

This seems to solve the issue. The field goes away after the index operation.


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.