ES hadoop Spark query returns too many partitions

Greg_Jensen · December 20, 2020, 8:04pm

I have a Java Spark application that is querying 15K _ids. The query is against an alias, that is backed by 15 or more indexes, each with ~ 9 shards. This query is taking far longer then is necessary because based on the query, it will only return 1 record/index/_id. If I do the same query using a put elasticsearch high level rest api query it comes back very fast. The query time represent almost 50% of the total processing time.

I know the problem is due to they way the connector creates 1 query/shard as see in https://github.com/elastic/elasticsearch-hadoop/blob/169be30e4243763efdc227f896e78f2bf3cf6930/mr/src/main/java/org/elasticsearch/hadoop/rest/RestService.java#L272

I see no way around this. Ideally I'd want to to do a single query to produce a single partition. At a minimum I'd want 1 partition/index

So my question is: Is there any way around this problem? I've started to look into writing my own "connector" that did the HL Rest API and shoved it into a Dataset but had some trouble and put it on the back burner.

I am using ES 7.9.2 and Spark 2.4.7

system · January 17, 2021, 8:05pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Spark, analogy between shard and partition is wrong Elasticsearch es-hadoop	3	2019	May 24, 2017
ES ignores queries through Spark Elasticsearch	3	821	July 6, 2017
ES/Spark RDD Partitioning Strangeness Elasticsearch es-hadoop	4	793	July 22, 2020
Elasticsearch + Spark read performance issues Elasticsearch es-hadoop	3	2307	May 24, 2016
Question about Elasticsearch and Spark Elasticsearch	3	1383	July 6, 2017

ES hadoop Spark query returns too many partitions

Related topics