Exclude complete path from indexing and _source

Hi there,

I use Elasticsearch 2.1.1 and have to handle large files which usually contain Base64 coded images. I want to completely skip this part of the documents when indexing them to Elasticsearch. I.e. NEITHER index them NOR add their _source.

The part of my JSONs that I want to omit looks like this:

screenshotdata": 
    {"interesting": 
         {"data": "iVBORwfdrdfd........ (very long string) } ,
    more elements}

I have been battling with
"_source":{"excludes":["screenshotdata"]}
as well as with
"dynamic_templates":{"skipscreenshots":{"path_match":"screenshotdata.*", "mapping":{"store":"no","index":"no"}}}
but cannot get it to work.

This is really an issue for me because I definitely do not want the Base64 strings to uselessly inflate my DB.

Here is some more info on further stuff I have tried in the meantime (sorry if my examples use the Python API and not the classic CURL way). Which is just excluding a single field from the _source:

from elasticsearch import Elasticsearch

es = Elasticsearch([{'host': '127.0.0.1', 'port': 9200}])

body = {"settings": 
        {"index.mapping.ignore_malformed": "true"}, 
        "mappings":
        {"reports":{"_all":{"enabled": "false"}, 
                    "_source":{"excludes":["data"]}, ... (other mappings)

es.indices.create(index='test', body=body)

Then I index a document to "reports"

report = open("somepath/report.json",'rb').read()
print es.index(index='test', doc_type="reports", body=report)

which gives me some ID which I use to check the index and source:

import json
print json.dumps(es.get('test',id="ID"),indent=4, sort_keys=True)
print json.dumps(es.get_source('test',doc_type="reports",id="ID"),indent=4, sort_keys=True)

Now if I look at the output of the get_source, the "data" part is still there.

Can you post the mappings, but make sure it's code formatted so it maintains its structure :slight_smile:

Hi Mark,

Here is the complete body from my second post:

body = {"settings": 
    {"index.mapping.ignore_malformed": "true"}, 
    "mappings":
    {"reports":{"_all":{"enabled": "false"}, 
               # "_source":{"excludes":["data"]},
                "dynamic_templates": [{"entropy": {"match":"entropy","mapping": {"type": "double"}}},
                          {"offset": {"match":"offset","mapping": {"type": "string"}}},
                         #  {"skipscreenshots":{"match":"data", "mapping":{"type":"string","store":"no","index":"no"}}}
                          ]      
                }}}

Besides that, I think I have posted all necessary code.

Another note: I also have some strings enclosed in an array, i.e. several Base64 encoded screenshots, so an element that looks like this:

shots:["long string","other long string"]

What I want to achieve is to exclude the complete branch which includes these elements instead of performing matches on single elements (so it wouldn't matter if these elements are strings, objects or arrays of strings).

Hello!!! Anybody there who knows this situation and can help me?

The usual method is to process the JSON on client side and submit only the data for ES over the wire.

@jprante, I see that this extra step is an option, but I really want to use the most efficient solution which is letting Elasticsearch skip parts of the JSON. I think this should be possible according to the documentation.

The most efficient way is what @jprante suggested IMO.

Why would you send useless data to a system?

You mentioned large binary/base64 fields which you want to remove, so the most efficient way is to not transport them over the wire just to let ES consume memory for receive them and trash the fields anyway.

From what I can read in the documentation is that dynamic templates work on field name pattern matching, but can not deal with whole "subtrees" of fields.

https://www.elastic.co/guide/en/elasticsearch/guide/current/custom-dynamic-mapping.html

OK, so I misunderstood what "path_match" does - it only specified the path to one specific field.
But still, what I tried above in my example, i.e. excluding "data" from _source, did not work. Any idea why this is the case?

As for sending "trash data" over the wire, this may be an issue when data is sent over the internet, but if all my data flow is e.g. in a LAN with 1 Gigabit bandwidth, then this is not really an issue. On the other hand, if I have to pre-parse every JSON before sending it to Elasticsearch, this also requires CPU, RAM and additional Read+Write access to the HD on the client, so I would rather send it over the wire even if it will be trashed. (Think of thousands of documents being processed every day)

Parsing JSON and building compact JSON takes some microseconds, while network transport of unprocessed data takes at least 1000x as much, at least 5ms :slight_smile: But you are right, it's your decision if you want client do the work or the server, which has more important things to do, let assume searching/indexing.

@jprante Thanks for pointing this out. :slight_smile:
But still, I want to at least get this working on the server once, even if I decide to go with the client option.