Exclude complete path from indexing and _source

Benjamin_Gathmann · December 23, 2015, 12:14pm

Hi there,

I use Elasticsearch 2.1.1 and have to handle large files which usually contain Base64 coded images. I want to completely skip this part of the documents when indexing them to Elasticsearch. I.e. NEITHER index them NOR add their _source.

The part of my JSONs that I want to omit looks like this:

screenshotdata": 
    {"interesting": 
         {"data": "iVBORwfdrdfd........ (very long string) } ,
    more elements}

I have been battling with
"_source":{"excludes":["screenshotdata"]}
as well as with
"dynamic_templates":{"skipscreenshots":{"path_match":"screenshotdata.*", "mapping":{"store":"no","index":"no"}}}
but cannot get it to work.

This is really an issue for me because I definitely do not want the Base64 strings to uselessly inflate my DB.

Benjamin_Gathmann · December 23, 2015, 2:19pm

Here is some more info on further stuff I have tried in the meantime (sorry if my examples use the Python API and not the classic CURL way). Which is just excluding a single field from the _source:

from elasticsearch import Elasticsearch

es = Elasticsearch([{'host': '127.0.0.1', 'port': 9200}])

body = {"settings": 
        {"index.mapping.ignore_malformed": "true"}, 
        "mappings":
        {"reports":{"_all":{"enabled": "false"}, 
                    "_source":{"excludes":["data"]}, ... (other mappings)

es.indices.create(index='test', body=body)

Then I index a document to "reports"

report = open("somepath/report.json",'rb').read()
print es.index(index='test', doc_type="reports", body=report)

which gives me some ID which I use to check the index and source:

import json
print json.dumps(es.get('test',id="ID"),indent=4, sort_keys=True)
print json.dumps(es.get_source('test',doc_type="reports",id="ID"),indent=4, sort_keys=True)

Now if I look at the output of the get_source, the "data" part is still there.

warkolm · December 24, 2015, 12:01am

Can you post the mappings, but make sure it's code formatted so it maintains its structure

Benjamin_Gathmann · December 25, 2015, 8:22am

Hi Mark,

Here is the complete body from my second post:

body = {"settings": 
    {"index.mapping.ignore_malformed": "true"}, 
    "mappings":
    {"reports":{"_all":{"enabled": "false"}, 
               # "_source":{"excludes":["data"]},
                "dynamic_templates": [{"entropy": {"match":"entropy","mapping": {"type": "double"}}},
                          {"offset": {"match":"offset","mapping": {"type": "string"}}},
                         #  {"skipscreenshots":{"match":"data", "mapping":{"type":"string","store":"no","index":"no"}}}
                          ]      
                }}}

Besides that, I think I have posted all necessary code.

Another note: I also have some strings enclosed in an array, i.e. several Base64 encoded screenshots, so an element that looks like this:

shots:["long string","other long string"]

What I want to achieve is to exclude the complete branch which includes these elements instead of performing matches on single elements (so it wouldn't matter if these elements are strings, objects or arrays of strings).

Benjamin_Gathmann · January 2, 2016, 9:12am

Hello!!! Anybody there who knows this situation and can help me?

jprante · January 2, 2016, 10:24am

The usual method is to process the JSON on client side and submit only the data for ES over the wire.

Benjamin_Gathmann · January 2, 2016, 11:03am

@jprante, I see that this extra step is an option, but I really want to use the most efficient solution which is letting Elasticsearch skip parts of the JSON. I think this should be possible according to the documentation.

dadoonet · January 2, 2016, 11:19am

The most efficient way is what @jprante suggested IMO.

Why would you send useless data to a system?

jprante · January 2, 2016, 11:23am

You mentioned large binary/base64 fields which you want to remove, so the most efficient way is to not transport them over the wire just to let ES consume memory for receive them and trash the fields anyway.

From what I can read in the documentation is that dynamic templates work on field name pattern matching, but can not deal with whole "subtrees" of fields.

https://www.elastic.co/guide/en/elasticsearch/guide/current/custom-dynamic-mapping.html

Benjamin_Gathmann · January 2, 2016, 11:55am

OK, so I misunderstood what "path_match" does - it only specified the path to one specific field.
But still, what I tried above in my example, i.e. excluding "data" from _source, did not work. Any idea why this is the case?

As for sending "trash data" over the wire, this may be an issue when data is sent over the internet, but if all my data flow is e.g. in a LAN with 1 Gigabit bandwidth, then this is not really an issue. On the other hand, if I have to pre-parse every JSON before sending it to Elasticsearch, this also requires CPU, RAM and additional Read+Write access to the HD on the client, so I would rather send it over the wire even if it will be trashed. (Think of thousands of documents being processed every day)

jprante · January 2, 2016, 12:02pm

Parsing JSON and building compact JSON takes some microseconds, while network transport of unprocessed data takes at least 1000x as much, at least 5ms But you are right, it's your decision if you want client do the work or the server, which has more important things to do, let assume searching/indexing.

Benjamin_Gathmann · January 2, 2016, 12:27pm

@jprante Thanks for pointing this out.
But still, I want to at least get this working on the server once, even if I decide to go with the client option.

Topic		Replies	Views
_source excludes not working Elasticsearch	3	2183	July 5, 2017
Update _source excludes in place in documents Elasticsearch	1	418	July 6, 2017
Store and index "no" not working Elasticsearch	9	1701	July 5, 2017
Don't store certain fields by default Elasticsearch	12	434	July 6, 2017
How to index parts of json files using Java API? Elasticsearch	7	2359	July 6, 2017

Exclude complete path from indexing and _source

Related topics