Querying nested objects using Dataframes in Spark

Cyril_Scetbon · May 13, 2016, 2:19am

Hi,

I have the following mapping :

"mappings": {
    "locations": {
      "properties": {
        "addresses": {
          "type": "nested",
          "properties": {
            "id":    { "type": "string"  },
            "type":    { "type": "string"  }
          }
        }
      }
    }
  }

I already have a DF containing address ids, and I'd like for all of them to get the corresponding document ids by looking for the "id" in addresses fields.
I've found the current solution :
df.select(explode(df("addresses.id")).as("aid"), df("id"))
.join(df_aids, $"aid" === df_aids("id"))
.select(df("id"), df_aids("id"))

I'm concerned about performance. Is it the best way to find documents in df containing in "addresses.id" ids from df_aids ?

Thanks

costin · May 18, 2016, 7:00am

Try to enable logging and see the resulting ES query DSL. Typically joins are handled by spark alone w/o pushdown. The 'in' filter is supported so having such a query should result in good results.

Cyril_Scetbon · May 18, 2016, 2:47pm

but how would I do to get couples of (id, aid) using in operation. It seems isin works with a list not a DF. It seems to need more work to do it no ? it'd be great If you have a sample using my code as base
Keep also in mind that addresses.id is an array

costin · May 24, 2016, 10:21am

That's more of a question for Spark itself on how to translate a certain query into a range/In query. It's the planner in the end that decides how to translate a query (DF methods as well) into basic operations...

Topic		Replies	Views
Query elasticseatrch with pyspark and nested fields Elasticsearch	0	12	December 10, 2024
Cannot index nested documents with ES-Hadoop 6.x.x jar Elasticsearch es-hadoop	2	1315	May 10, 2018
Pyspark - read nested Object field from elasticsearch Elasticsearch es-hadoop	1	966	June 30, 2020
Having trouble executing isin query from Pyspark to elastic Elasticsearch es-hadoop	1	929	April 23, 2018
ES query from spark returns all despite of filter Elasticsearch es-hadoop	9	596	November 22, 2022

Querying nested objects using Dataframes in Spark

Related topics