ES-Hadoop and Spark. It works bad when you miss a ES node

Guillermo_Ortiz · April 22, 2016, 8:25am

I have a cluster with 5 nodes, 4 data nodes, one node which works as balancer, 8 shards per index, 1 replica, 7 indices.
Another Spark cluster with 3 nodes which executes 8 executors. It indexes documents in ES with the ES-Hadoop library.
I use Spark 1.6, ES 2.2.0.

When I execute Spark I configure "es.nodes" with my four data nodes of ES, although I have do tests with three of two nodes.

If I miss one node thinks starts going pretty bad. Once, Spark was enabled to index documents any more others times It doesn't work right. it usually indexed about 25K/35K documents per second. After missed a node it begins to index about 500 documents per seconds.
If I check Marvel, the chart with the right behavior is close to a line between 25K/30K index rate. When I miss a node is a line with 0 index rate and sometimes it has peaks with 15K/20K index rate.

I couldn't figure out anything in the Spark and ES logs.

I did another test as well. I stopped Spark and one ES node, deleted all indices in ES (my indices and marvel indices) and started Spark again and works right. A couple minutes later I restart the another ES node and things start working bad. What before took about 3 seconds to be indexed now takes 1.2min. I thought it could be about replicating the shards but after a while, when shards are right it keeps the same behavior. Anyway how I deleted all indices they doesn't have too many data.

Any clue about what it could be happening?
I thought with the last test that when Spark starts it could be see three nodes to the ES cluster and caches where each shard is per each index.
So, when I start the another node it goes to the old node when the shard has been relocated to the restarted node. Anyway, I should see a lot of fails in the logs if that's right and I don't see anything.

The general question is. How is Spark affected with you miss a ES node?

costin · April 26, 2016, 1:41am

Not sure what you mean by missing a node. From your post it looks like you are wondering about performance when nodes are joining or leaving the cluster.
It's highly depend on your configuration.
What version of ES are you using (not ES-Hadoop)?

Guillermo_Ortiz · April 26, 2016, 7:03am

Sorry, I think that I didn't explain me correctly.

I have a cluster with 5 ES nodes (2.2.0). Suddenly, I node get broken and I only have a cluster with 4 nodes.
From that moment, the cluster doesn't work correctly. I don't mean that I have lost 20% of capacity indexing, I mean that the cluster or doesn't work or finally I get an exception that the cluster is not able to index so many documents, etc.

Even, if I restart ES with these four nodes and restart Spark as well, it doesn't work. I have to delete all indices in ES and restart it to work correctly with the left 4 nodes.

I have to say, that with a cluster of three nodes is enough to deal with this use case. I don't really need four or five nodes, it's just to have better performance and HA.