I have a cluster with 5 nodes, 4 data nodes, one node which works as balancer, 8 shards per index, 1 replica, 7 indices.
Another Spark cluster with 3 nodes which executes 8 executors. It indexes documents in ES with the ES-Hadoop library.
I use Spark 1.6, ES 2.2.0.
When I execute Spark I configure "es.nodes" with my four data nodes of ES, although I have do tests with three of two nodes.
If I miss one node thinks starts going pretty bad. Once, Spark was enabled to index documents any more others times It doesn't work right. it usually indexed about 25K/35K documents per second. After missed a node it begins to index about 500 documents per seconds.
If I check Marvel, the chart with the right behavior is close to a line between 25K/30K index rate. When I miss a node is a line with 0 index rate and sometimes it has peaks with 15K/20K index rate.
I couldn't figure out anything in the Spark and ES logs.
I did another test as well. I stopped Spark and one ES node, deleted all indices in ES (my indices and marvel indices) and started Spark again and works right. A couple minutes later I restart the another ES node and things start working bad. What before took about 3 seconds to be indexed now takes 1.2min. I thought it could be about replicating the shards but after a while, when shards are right it keeps the same behavior. Anyway how I deleted all indices they doesn't have too many data.
Any clue about what it could be happening?
I thought with the last test that when Spark starts it could be see three nodes to the ES cluster and caches where each shard is per each index.
So, when I start the another node it goes to the old node when the shard has been relocated to the restarted node. Anyway, I should see a lot of fails in the logs if that's right and I don't see anything.
The general question is. How is Spark affected with you miss a ES node?