How's it going guys/gals.
I am running a production elasticsearch ( elasticsearch-1.5.0.jar ) cluster consisting of 6 nodes.
One such node is a so-called "load-balancer" node ( lb.es ). It is not a data node, and all the http and transport client traffic goes through here. Afterwards, it is routed to the data nodes ( esdata-001 through esdata-002). The data nodes are also all possible masters, with the minimum master nodes size being 3.
Each data node has 31G ram allocated to it ( 31, so that I don't hit the 32GB limit with the address size differences... ) out of a total of 128G available.
The load balancer node has 14G of 31G available RAM allocated for the JVM.
On this cluster we index about 300 million documents per day. The documents indexed are apache log files and we use logstash to do the indexing, rotating the index once per day on midnight. We delete the index which is older than 30 days ever day, also.
The max insert rate is around 20,000 documents/sec and the average one is about 5-10 thousands of docs per second.
We also run 1-5 queries per second which extract data from indexes ( usually indexes from within the last 5 days, so these are cache warmed ) and then this data is processed via scripts etc.
Anyway, the system works fine but as I monitor the es heap size I see that it grows on average across all the nodes at a certain rate ( after optimizations its about 1-2% per day starting from around 50% size average across nodes ).
When the used heap % reaches 80->90% the number of lines in the slowlog for both indexing and search increase ( search seems to have more lines on average at this point, probably due to the fact that I reduced the number of shards from 100 to 30 per index, which in turn increased the indexing performance, so its a tradeoff).
As the slowlog lines increase eveutually we get to a place where the system is more or less unresponsive. After that its a question of hours before the heap is completely full and the system starts thrashing and ultimatelly becomes unresponsive.
Anyway - I currently do scheduled cluster restarts ( which take about 10 minutes ) whenever the average heap % is around 85%. That way the system experiences about 10 minutes of downtime per week but its really a pain in the ass.
This is where I need you help ( as I have applied pretty much all the suggestions given by the documentation and most of the production-geared help sites...)
Please let me know if you have any ideas that I can test on production ( I also have one free machine where I can run other types of tests, so I can set that up as a test environment, if needed.)