Sub queries with Spark hadoop


Hey, first time here, so apologies if it is a duplicate.

I use ES-Hadoop to run over all index data, but I need to perform sub-queries inside my mapPartition.
I query for entire index, then run over it with mapPartition, but I need to perform another query for each item from other index.

For example, I have an index of dogs and another index of dogs-relations (their band). The first index of dogs is the index that I read with ES-Hadoop as RDD and I need to query the second query for each dog separately.

How can I solve this without perform a query per dog?
I want to narrow queries number as possible, but I don't know how to merge those RDDs in a smart way.

BTW - I opened a StackOverflow question

Any Idea will help,

(James Baiera) #2

@shaimr you could always try a join operation in Spark using RDD's sourced from both indices.


The first index contains 50M documents and the relation index contains 300M documents,
Isn't it too much to perform memory join?

(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.