I am using ES Spark plugin and reading multiple indices...
sparkContext.esJsonRDD("index1, index2")
gives me
WARN RestRepository: Read resource [index1,index2/] includes multiple indices or/and aliases; to avoid duplicate results (caused by shard overlapping), parallelism is reduced from 2 to 1
It does not have any problem running the application, but I am wondering what exactly does the warning mean?
Could anyone explain about this warning?
It is a warning. The reason behind it has to do with how the indices shards are spread across the various nodes and how ES can select data. Currently ES does not allow one to refer to multiple indices in a query yet select only one shard from a given index (and not all of them) - hence why ES-Hadoop behind the scenes tries the various combinations and in case, there isn't one, falls back to reduced parallelism.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.