I am using elasticsearch-spark connector to query data from Elasticsearch and get it into a PySpark DataFrame. Going through materials online and few Stackoverflow answers, I understood that filters will be pushed during run-time to Elasticsearch QueryDSL. But I am not sure how to specify a Geo Distance Query to filter data into PySpark DataFrame.
My query currently looks like this
q ="""{
"query": {
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_distance" : {
"distance" : "2.5km",
"location" : {
"lat" : 30.825,
"lon" : -86.99
}
}
}
}
}
}"
And I am currently using it to create an RDD with the following code
es_rdd = sc.newAPIHadoopRDD(
inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf=es_conf)
** I specified the query in the es_conf dictionary
I would like to replicate the Geo Distance Query in PySpark DataFrame operation. Any help would be appreciated. Thank you.