Retrieve _parent field with spark


(Paweł Chabierski) #1

Hi,

I have problem with retrieving _parent field from elasticsearch using pyspark. Rdd does not contains _parent field if I specify that field in fields. My Code:

es_query = {

    "fields": ["_parent", "_source"]

}

es_read_conf = {
     "es.nodes" : "localhost",
     "es.resource" : "crm/event",
     "es.nodes.wan.only": "true",
     "es.query": json.dumps(es_query),
      "es.read.metadata": "true"
}

 es_rdd = SparkContext().newAPIHadoopRDD(
     inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
     keyClass="org.apache.hadoop.io.NullWritable",
     valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
     conf=es_read_conf)

    print(es_rdd.first())

Rdd looks like:

(u'wn4U76ggQKKEKjBNEqiVyg', {u'isp': None, u'tags': None, u'url': None, u'ip': None, u'website_id': 4,      u'_metadata': {u'_score': 0.0, u'_type': u'event', u'_id': u'wn4U76ggQKKEKjBNEqiVyg', u'_index': u'crm'}, u'create_timestamp': 1462300665000, u'fields': {}, u'type': u'met_scenario_condition', u'additional_data': {u'block_id': 261, u'scenario_id': 13, u'block_name': u'Warunek wej\u015bciowy', u'action_type': u'mail', u'scenario_name': u'SP Ostatnio przegl\u0105dane', u'action_id': 261}})

When I use this query in elasticsearch-hammer _parent field is present in response.

I've tried every version, including alpha version. What's wrong?


(Paweł Chabierski) #2

I found that by default fields not present in mapping are deleted from result. It can be disabled by setting es.read.unmapped.fields.ignore to false.


(system) #3