In Elasticsearch for Hadoop on page 137 it says
Here, we can see the correlation between the units of parallelism of two systems.
The preceding diagram shows that there are two discrete-distributed systems on two sides: Hadoop and Elasticsearch. The Hadoop cluster can have several nodes in it, but only three nodes with Hadoop tasks are highlighted for simplicity. On the other hand, Elasticsearch also has three nodes in the cluster with three shards in it. This has one replica each. Each node or task on one side of the diagram can directly talk to the other node or task. Thus, it provides the peer-to-peer architecture.
ES-Hadoop respects the analogy of splits and shards to enable dynamic parallelism. It means that when you import data from Hadoop to Elasticsearch, ES-Hadoop doesn't just blindly dump the data into any of the Elasticsearch node, but it takes into account the number of shards and divides the indexing request between the available shards. The same holds true when we read the data from Elasticsearch and write it to HDFS using ES-Hadoop.
I'm trying to get a better understanding on the code behind this because my Spark job is overloading ES and I'm seeing the error Could not write all entries [27/324416] (maybe ES was overloaded?).
My ES cluster consist of 15x c3.8xlarge and I have the indexes and settings setup for for bulk loading. On the other side I have a 13 node Spark cluster. My job has two RDDs, one with 200 partitions and the other 600. Passing these directly to EsSpark.saveWithMetadata overloads ES very quickly.
I attempted using rdd.repartition on these RDDs to scale them down to match the number of ES nodes. That was a big improvement but after an hour of loading it overloads ES.
My ES cluster topology consists of 2x master and client nodes with 13 data nodes.
Do you have any advice for debugging or getting better log output so I can better manage the number of Spark task?