Consequences of excluding fields from _source

The documentation describes it as an 'expert-only feature' , and it makes it rather clear that it should be avoided. My case is the following: I use the ingest attachment plugin to index and analyze the contents of PDF files. Here is my pipeline definition:

{
  "description": "Extract attachment information",
  "processors":[
    {
      "attachment":{
        "field": "data",
        "target_field": "resume",
        "indexed_chars":-1
      }
    }
  ]
}

However the above results in the unwanted storing of the files's bytestream under the field attachment, together with the actual contents. In order to save some storage capacity I decided to not store it at all. Here is my mapping:

{  
   "_source":{  
      "excludes":[  
         "attachment"
      ]
   },
   "properties":{  
      "attachment":{  
         "type":"text",
         "fields":{  
            "keyword":{  
               "type":"keyword",
               "ignore_above":256
            }
         }
      },
      "resume":{  
         "properties":{  
            "content":{  
               "type":"text",
               "fields":{  
                  "keyword":{  
                     "type":"keyword",
                     "ignore_above":256
                  }
               },
               "analyzer":"some_language"
            }
         }
      }
   }
}

This solves my problem (the bytes stream is not returned with my searches), however I'm concerned with the following warning from the ES docs, on which I'm hoping for some elaboration:

'Removing fields from the _source has similar downsides to disabling _source, especially the fact that you cannot reindex documents from one Elasticsearch index to another'

How will excluding the attachment from my _source affect my ability to re-index if needed? And why? Is there another way to do it? Also will the result of the analysis of resume.content field be transferred to the new index?

I hope I'm specific enough, thank you in advance!

It's better IMO to remove the field with an ingest remove processor.
So the binary BASE64 is not part of the source anymore.

Dear David,

Thank you for your reply. I will look into it, but just so I understand the full idea, why do you think that this would be better? Does it eliminate the problem of re-indexing?

Yes. As it does not alter the _source field.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.