Sub queries with Spark hadoop

shaimr · January 30, 2017, 10:39am

Hey, first time here, so apologies if it is a duplicate.

I use ES-Hadoop to run over all index data, but I need to perform sub-queries inside my mapPartition.
I query for entire index, then run over it with mapPartition, but I need to perform another query for each item from other index.

For example, I have an index of dogs and another index of dogs-relations (their band). The first index of dogs is the index that I read with ES-Hadoop as RDD and I need to query the second query for each dog separately.

How can I solve this without perform a query per dog?
I want to narrow queries number as possible, but I don't know how to merge those RDDs in a smart way.

BTW - I opened a StackOverflow question

Any Idea will help,
Thanks.

james.baiera · February 2, 2017, 7:46pm

@shaimr you could always try a join operation in Spark using RDD's sourced from both indices.

shaimr · February 5, 2017, 8:00am

The first index contains 50M documents and the relation index contains 300M documents,
Isn't it too much to perform memory join?

system · March 5, 2017, 8:01am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Use cases Elasticsearch and Spark Elasticsearch es-hadoop	5	3352	July 6, 2017
Spark SQL advices for performance Elasticsearch es-hadoop	3	1441	July 6, 2017
ES hadoop Spark query returns too many partitions Elasticsearch es-hadoop	1	399	January 17, 2021
Aggregation running very slow Elasticsearch es-hadoop	8	2998	July 6, 2017
Spark - querying ElasticSearch cluster over a RDD Elasticsearch es-hadoop	5	2273	July 6, 2017

Sub queries with Spark hadoop

Related topics