I have a Dataframe containing a list of cities as following :
val cities = sc.parallelize(Seq("New York")).toDF()
Now , for each city , I would like to query Elastic and build a set of results similar to the following logic :
val cities = sc.parallelize(Seq("New York")).toDF()
cities.foreach(r => {
val city = r.getString(0)
val dfs = sqlContext.esDF("cities/docs", "?q=" + city) //returns a DataFrame which triggers the exception
})
Problem is that Spark does not allow nested operations that return dataframes. What options do I have to iterate a dataframe and get the results?
This seems like something better suited to using the regular java rest client to perform the search from within the foreach function. If you decide to do that, I would suggest doing foreachPartition instead of foreach so that you can batch up the query to send to Elasticsearch. Also, that would allow you to tear down the client after all the data in the partition is consumed.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.