In pypspark the only way I can get data returned from ES is by leaving es.query default. Why is this?
es_query = {"match" : {"key" : "value"}}
es_conf = {"es.nodes" : "localhost", "es.resource" : "index/type", "es.query" : json.dumps(es_query)}
rdd = sc.newAPIHadoopRDD(inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",keyClass="org.apache.hadoop.io.NullWritable",valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=es_conf)
rdd.count()
...
0
rdd.first()
ValueError: RDD is empty
Yet when,
es_query = {"match_all" : {}}
rdd.first()
(u'2017-09-01 01:02:03)
*I have tested the queries by directly querying elastic search and they work so it is something wrong with spark/es-hadoop.