How do I build results dinamically from a Dataframe? (Apache Spark)

I have a Dataframe containing a list of cities as following :

val cities    = sc.parallelize(Seq("New York")).toDF()

Now , for each city , I would like to query Elastic and build a set of results similar to the following logic :

val cities    = sc.parallelize(Seq("New York")).toDF()
   cities.foreach(r => {
    val city = r.getString(0)
    val dfs = sqlContext.esDF("cities/docs", "?q=" + city) //returns a DataFrame which triggers the exception 

Problem is that Spark does not allow nested operations that return dataframes. What options do I have to iterate a dataframe and get the results?

Another option , is it possible to get a regular data structure that is not a dataframe using this connector?

This seems like something better suited to using the regular java rest client to perform the search from within the foreach function. If you decide to do that, I would suggest doing foreachPartition instead of foreach so that you can batch up the query to send to Elasticsearch. Also, that would allow you to tear down the client after all the data in the partition is consumed.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.