Storing binary files in Elastic

David,

Unfortunately I am on ES 2.1.1 and have to use the mapper-attachments plugin. If I have a mapping like this:

PUT /test/person/_mapping
{
  "person": {
    "properties": {
      "file": {
        "type": "attachment"
      }
    }
  }
}

Add a single document:
(Plain Text file - Content is: "God Save the Queen" (alternatively "God Save the King")

PUT /test/person/1?refresh=true
{
  "file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
}

And then run this query:

GET /test/person/_search
{
  "query": {
    "match": {
      "_all": "save"
    }
  },
  "fields": [ "_source","file.content_type", "file.content" ]
}

This is my result:

  "hits": {
    "total": 1,
    "max_score": 0.10848885,
    "hits": [
      {
        "_index": "test",
        "_type": "person",
        "_id": "1",
        "_score": 0.10848885,
        "_source": {
          "file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
        },
        "fields": {
          "file.content_type": [
            "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
          ],
          "file.content": [
            "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
          ]
        }
      }
    ]
  }

Three questions/comments:

  1. I see the base64 encoded value is stored in both the _source and the content fields. Is there a way in 2.1.1 to prevent it from being stored twice? Above you linked to a pipeline remove processor, but that is only ES5.
  2. How come content_type returns the base 64 encoded value as well? I noticed if I do add content_type to my mapping and say "store: yes", then the correct content type is auto discovered. Same applies for other fields such as content_length.
  3. In my mapping for the content field, if I say "store: yes", then that field only has the content of the document (in English). Is this likely what I want to do? Doing so, at least makes the content field much shorter (especially for larger, more complex Word docs).

Thanks again!
Steve