David,
Unfortunately I am on ES 2.1.1 and have to use the mapper-attachments plugin. If I have a mapping like this:
PUT /test/person/_mapping
{
"person": {
"properties": {
"file": {
"type": "attachment"
}
}
}
}
Add a single document:
(Plain Text file - Content is: "God Save the Queen" (alternatively "God Save the King")
PUT /test/person/1?refresh=true
{
"file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
}
And then run this query:
GET /test/person/_search
{
"query": {
"match": {
"_all": "save"
}
},
"fields": [ "_source","file.content_type", "file.content" ]
}
This is my result:
"hits": {
"total": 1,
"max_score": 0.10848885,
"hits": [
{
"_index": "test",
"_type": "person",
"_id": "1",
"_score": 0.10848885,
"_source": {
"file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
},
"fields": {
"file.content_type": [
"IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
],
"file.content": [
"IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
]
}
}
]
}
Three questions/comments:
- I see the base64 encoded value is stored in both the _source and the content fields. Is there a way in 2.1.1 to prevent it from being stored twice? Above you linked to a pipeline remove processor, but that is only ES5.
- How come content_type returns the base 64 encoded value as well? I noticed if I do add content_type to my mapping and say "store: yes", then the correct content type is auto discovered. Same applies for other fields such as content_length.
- In my mapping for the content field, if I say "store: yes", then that field only has the content of the document (in English). Is this likely what I want to do? Doing so, at least makes the content field much shorter (especially for larger, more complex Word docs).
Thanks again!
Steve