I have a Java Spark application that is querying 15K _ids. The query is against an alias, that is backed by 15 or more indexes, each with ~ 9 shards. This query is taking far longer then is necessary because based on the query, it will only return 1 record/index/_id. If I do the same query using a put elasticsearch high level rest api query it comes back very fast. The query time represent almost 50% of the total processing time.
I know the problem is due to they way the connector creates 1 query/shard as see in https://github.com/elastic/elasticsearch-hadoop/blob/169be30e4243763efdc227f896e78f2bf3cf6930/mr/src/main/java/org/elasticsearch/hadoop/rest/RestService.java#L272
I see no way around this. Ideally I'd want to to do a single query to produce a single partition. At a minimum I'd want 1 partition/index
So my question is: Is there any way around this problem? I've started to look into writing my own "connector" that did the HL Rest API and shoved it into a Dataset but had some trouble and put it on the back burner.
I am using ES 7.9.2 and Spark 2.4.7