Querying nested objects using Dataframes in Spark


(Cyril Scetbon) #1

Hi,

I have the following mapping :

"mappings": {
    "locations": {
      "properties": {
        "addresses": {
          "type": "nested",
          "properties": {
            "id":    { "type": "string"  },
            "type":    { "type": "string"  }
          }
        }
      }
    }
  }

I already have a DF containing address ids, and I'd like for all of them to get the corresponding document ids by looking for the "id" in addresses fields.
I've found the current solution :
df.select(explode(df("addresses.id")).as("aid"), df("id"))
.join(df_aids, $"aid" === df_aids("id"))
.select(df("id"), df_aids("id"))

I'm concerned about performance. Is it the best way to find documents in df containing in "addresses.id" ids from df_aids ?

Thanks


(Costin Leau) #2

Try to enable logging and see the resulting ES query DSL. Typically joins are handled by spark alone w/o pushdown. The 'in' filter is supported so having such a query should result in good results.


(Cyril Scetbon) #3

but how would I do to get couples of (id, aid) using in operation. It seems isin works with a list not a DF. It seems to need more work to do it no ? it'd be great If you have a sample using my code as base
Keep also in mind that addresses.id is an array


(Costin Leau) #4

That's more of a question for Spark itself on how to translate a certain query into a range/In query. It's the planner in the end that decides how to translate a query (DF methods as well) into basic operations...


(system) #5