Elasticsearch Geo Distance query in PySpark DataFrame

Pramod_Sripada · February 13, 2018, 2:27pm

I am using elasticsearch-spark connector to query data from Elasticsearch and get it into a PySpark DataFrame. Going through materials online and few Stackoverflow answers, I understood that filters will be pushed during run-time to Elasticsearch QueryDSL. But I am not sure how to specify a Geo Distance Query to filter data into PySpark DataFrame.

My query currently looks like this

    q ="""{
      "query": {
            "bool" : {
                "must" : {
                    "match_all" : {}
                },
               "filter" : {
                    "geo_distance" : {
                        "distance" : "2.5km",
                        "location" : {
                            "lat" : 30.825,
                            "lon" : -86.99
                        }
                    }
                }
            }
        }
    }"

And I am currently using it to create an RDD with the following code

es_rdd = sc.newAPIHadoopRDD(
        inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
        keyClass="org.apache.hadoop.io.NullWritable",
        valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
        conf=es_conf)

** I specified the query in the es_conf dictionary

I would like to replicate the Geo Distance Query in PySpark DataFrame operation. Any help would be appreciated. Thank you.

james.baiera · March 1, 2018, 9:36pm

For Geo Distance queries this is the only way to push them down at the moment. Spark does not really offer any mechanisms for translating Geo operations in their SQL dialect.

system · March 29, 2018, 9:36pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.