I use Elasticsearch 2.1.1 and have to handle large files which usually contain Base64 coded images. I want to completely skip this part of the documents when indexing them to Elasticsearch. I.e. NEITHER index them NOR add their _source.
The part of my JSONs that I want to omit looks like this:
screenshotdata":
{"interesting":
{"data": "iVBORwfdrdfd........ (very long string) } ,
more elements}
I have been battling with
"_source":{"excludes":["screenshotdata"]}
as well as with
"dynamic_templates":{"skipscreenshots":{"path_match":"screenshotdata.*", "mapping":{"store":"no","index":"no"}}}
but cannot get it to work.
This is really an issue for me because I definitely do not want the Base64 strings to uselessly inflate my DB.
Here is some more info on further stuff I have tried in the meantime (sorry if my examples use the Python API and not the classic CURL way). Which is just excluding a single field from the _source:
from elasticsearch import Elasticsearch
es = Elasticsearch([{'host': '127.0.0.1', 'port': 9200}])
body = {"settings":
{"index.mapping.ignore_malformed": "true"},
"mappings":
{"reports":{"_all":{"enabled": "false"},
"_source":{"excludes":["data"]}, ... (other mappings)
es.indices.create(index='test', body=body)
Besides that, I think I have posted all necessary code.
Another note: I also have some strings enclosed in an array, i.e. several Base64 encoded screenshots, so an element that looks like this:
shots:["long string","other long string"]
What I want to achieve is to exclude the complete branch which includes these elements instead of performing matches on single elements (so it wouldn't matter if these elements are strings, objects or arrays of strings).
@jprante, I see that this extra step is an option, but I really want to use the most efficient solution which is letting Elasticsearch skip parts of the JSON. I think this should be possible according to the documentation.
You mentioned large binary/base64 fields which you want to remove, so the most efficient way is to not transport them over the wire just to let ES consume memory for receive them and trash the fields anyway.
From what I can read in the documentation is that dynamic templates work on field name pattern matching, but can not deal with whole "subtrees" of fields.
OK, so I misunderstood what "path_match" does - it only specified the path to one specific field.
But still, what I tried above in my example, i.e. excluding "data" from _source, did not work. Any idea why this is the case?
As for sending "trash data" over the wire, this may be an issue when data is sent over the internet, but if all my data flow is e.g. in a LAN with 1 Gigabit bandwidth, then this is not really an issue. On the other hand, if I have to pre-parse every JSON before sending it to Elasticsearch, this also requires CPU, RAM and additional Read+Write access to the HD on the client, so I would rather send it over the wire even if it will be trashed. (Think of thousands of documents being processed every day)
Parsing JSON and building compact JSON takes some microseconds, while network transport of unprocessed data takes at least 1000x as much, at least 5ms But you are right, it's your decision if you want client do the work or the server, which has more important things to do, let assume searching/indexing.
@jprante Thanks for pointing this out.
But still, I want to at least get this working on the server once, even if I decide to go with the client option.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.