I currently run an 11 node (5 of which are data nodes) cluster, consisting of about 58,000 shards consisting of 5TBs. We are running Elasticsearch 1.7.1 ATM. We are experiencing a long amount of Garbage Collecting times (40-60 seconds) on a lot of our data nodes. This is causing certain API functions, like searching, to timeout, and for the cluster to fall into a funky state. We encountered this a few months back and solved this by scaling from 3 data nodes to 5. Right now, scaling out additional data nodes ends up becoming a hard process (takes a while to get the hardware). So I have a few questions:
- Is there a way to decrease GC time, outside of scaling data nodes?
- Is there a way to capture something like 'Last GC Time Per Node' from Elasticsearch's API? Right now we throw everything from /_nodes/stats into Grafana, but the only GC stats are 'Total time spent GC' and 'GC Count', neither of which tells us anything. It would be nice to have an API call to check it, so we can be alerted when it crosses a threshold. The only place I can find it is log files.