I have a cluster with over 20k total shards (each index is 30 shards + 1 replication and is oh about 300GB each) on 18 Data nodes with ~24 cores on each node. oh and we are indexing 10K message per second all day long (about 1TB a day of data)
When _open a couple of indexes at a time the cluster re balances for a while
when doing maintenance on a node it takes for ever for it to rebalance
When the re-routing/rebalancing/recovering is happening my indexing slows way down.
So here are my questions
I know there are Heuristics on when the cluster chooses to re balance but I don't understand the meaning of the numbers so I am afraid to touch them. Any resources that can help describe this better (or should I look somewhere else)
I have looked at the Thread Queues but don't see any threads being maxed out during the re-balancing. and I have played with the concurrent load balancing settings at the cluster level. but doing it slow (concurrent rebalancing 2 or three) or fast at +30 seems to have the same impact.
well, I did find one issue as that I was CPU/IO Bound. I added 4 more luns and went from 50Mbps to 500Mbps so that will help once I finish spliting the 15 nodes.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.