Rebalancing and Spark Ingest Slow Down

nathann · June 10, 2016, 1:04am

Hi, I have a decently sized cluster and notice that while re-balancing my Spark ingest job crushes only the nodes with lower shard counts. It appears that the only nodes that the new shards are allocated to are the ones that are out of balance in terms of disk use compared to the rest of the cluster. This leads to a couple questions:

1 Is this type of behavior expected? This definitely makes it more challenging to add nodes to an existing cluster without impacting ongoing ingest.

2 Are there tweaks that can be done to better take advantage of more nodes when ingesting even when there are nodes that do not currently contain approximately the same amount of data as others?

NOTE: The ingest jobs ran great before the addition of nodes as well as after shards were re-balanced to the new nodes.

versions:
ES 2.2.0
Spark 1.5.2
ES Hadoop 2.2.0 (using Map/Reduce layer with PySpark)

-Nathan

nathann · July 20, 2016, 8:11pm

UPDATE: The following was a solution to the issue:
https://www.elastic.co/guide/en/elasticsearch/reference/current/allocation-total-shards.html#allocation-total-shards

Topic		Replies	Views
Load unfairly distributed during large ingest Elasticsearch	6	661	May 8, 2017
Unexpected rebalancing behavior Elasticsearch	4	409	July 6, 2017
Data balancing problem Elasticsearch	2	386	July 6, 2017
Best practice for shard rebalancing when adding nodes Elasticsearch	3	3237	November 4, 2019
ES-Hadoop and Spark. It works bad when you miss a ES node Elasticsearch es-hadoop	3	1042	July 6, 2017

Rebalancing and Spark Ingest Slow Down

Related topics