Query filter not working with SparkSql


#1

I'm trying to pull data from Elasticsearch using below two commands, both returned data with the same record count, however, sql_rdd returned all fields in elasticsearch, while es_rdd only returned timestamp, host and message fields as specified in the query filter. Query strings are the the same. Is the way I use SparkSql correct? How to make the filter work for SparkSql? Thanks a lot!

sql_rdd = sqlContext.read.format("org.elasticsearch.spark.sql").option("es.nodes", "serverA").option("es.query", "{"fields": ["@timestamp", "host", "message"], "query": { "filtered": { "query": {"match_all": {}}, "filter": {"range": { "@timestamp": { "gte": 1485050400000, "lt": 1485050430000} } } } } }").load("logstash-2017.01.22")

es_rdd = sc.newAPIHadoopRDD(
inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf = {"es.nodes": "serverA", "es.resource": "logstash-2017.01.22", "es.query": "{"fields": ["@timestamp", "host", "message"], "query": { "filtered": { "query": {"match_all": {}}, "filter": {"range": { "@timestamp": { "gte": 1485050400000, "lt": 1485050430000} } } } } }"})


(James Baiera) #2

@zpp Could you include the versions of the technologies used?


#3

I'm using elasticsearch 2.4.0, elasticsearch-hadoop 2.4.0, and Spark 1.6.2


#4

Anyone can help? thanks a lot!


(James Baiera) #5

@zpp Use of anything other than the "query" element in the es.query configuration is not officially supported. In 5.0, the logic for handling the es.query option was unified across all integrations. In 5.2.0 there will be an officially supported avenue for specifying the desired source fields from the request.


#6

Thanks a lot for the information, James.
I assume you're referring to ElasticSearch-Hadoop 5.2.0, which was just released on Jan 31. I did a quick testing on this version, without upgrading the elasticsearch cluster (not sure this is ok, but there are no errors). Using the same commands above, however, both returned all fields rather than the seletcted ones, even for newAPIHadoopRDD command, which was working with version 2.4.


(James Baiera) #7

@zpp You can enable the field selections by using the es.read.source.filter as detailed in the docs for 5.2.0.

ES-Hadoop should be backwards compatible with ES 2.4.0, so no sweat :slight_smile:


(system) #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.