I am running a 16 node cluster, broken down as such:
3 master nodes
3 search nodes
10 data nodes
~17k shards
Elasticsearch version 2.0.0
Over the last few weeks, we have been noticing something odd. Whatever node has been elected as the Master would not garbage collected when it's heap hit 75%, as expected. Eventually the heap would rise in till it hit 100%, and then the node would time out, causing cluster instability. Restarting the node at any point would fix this (including restarting it preemptively) but, from my understanding, it should be GCing itself during that time. The other master nodes are doing very little (their heap percentage is around 2-4%), and the data nodes are GCing perfectly fine.
You should have something in the log about GCs running. The JVM gets pretty desperate about running them. The next time this happens you can monitor it from the outside with jstat gcutil <pid> 1s 100. Post that here. You will probably see that it is trying to GC over and over and over again and not freeing up any memory.
~17k shards is quite a few. I don't have hard and fast figures on how much space that is likely to take up. How many fields are in each index? Can you, like, just dump the size of the mapping and get the size? It is stored in memory as java objects so the size as json is only indirectly related.
You can also take a heap snapshot when the node is busted and crack it open with MAT and see what is in there. Analyzing those is "fun" but possible.
What about pending tasks. Do you have lots of those when things go bad?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.