Last week we had to restart a few data nodes in our cluster. These three nodes contained our main index (~1Billion documents, ~450GB across 10 shards with 1 replica).
Since the restart, all queries to the cluster have been sluggish, specially to the main index. Just a simple search is either timing out or taking between 100 and 300 seconds.
I added 2 more data nodes into the mix, but they don't seem to help.
After the restart, one node got all the primary shards of the main index. Not sure if that is the cause of the performance degradation.
Another curiosity is that Kibana is failing to load now. It errors out while fetching its settings, even though the same _mget query is successful when I manually perform it. Not sure if its related.
Using Elasticsearch 1.7.1, and Kibana 4.1.2.
The optimizations we made through trial and error (breaker limit settings) have persisted, so I'm not sure what is causing this.
There are 4 other data nodes in the cluster, but they don't hold the indices that are problematic and they weren't restarted either.
Can anyone tell me how can I diagnose why queries are so slow? I have slow logging enabled on all the nodes.
We have allocated more than 50% of the RAM to the ES heap (12G/15.25G). This was done incrementally, over a period of a few months (from 7G to 12G). I'm guessing it could be a problem, but before last week we haven't had any problems with our queries. Not sure how this would suddenly become a factor.
When making a query, the load average spikes, but CPU usage remains the same.
Our use case is mostly indexing heavy. Most of the heavy queries/aggregations run as scripts once a day that put aggregated data into other indices which are then queried (using either Kibana or Grafana)