I set up a new small cluster:
2 x r3.large boxes (2 cores, 15GB RAM). One is a master/data, and the second is data only. Heap is 8GB on both.
I reindexed a smallish (1.5GB) index from a cluster we're moving away from, as a starting point to test with.
In Kibana (installed on a 3rd server in the same cluster), if I go "discover" and set the time picker to "this week" (with index pattern picking up just the single index):
a) The query takes 7-10 minutes to run
b) The monitoring and management pages become unresponsive (in the same or a different tab, until the query has completed), dumping a bunch of "timed out" error messages after 30s
c) Only one node (it varies between them) seems to handle the query, and spits out a warning about gc overload every second for the duration; e.g.:
[2017-01-27T21:29:48,477][WARN ][o.e.m.j.JvmGcMonitorService] [node_001] [gc] overhead, spent [622ms] collecting in the last [1s]
[2017-01-27T21:29:49,498][WARN ][o.e.m.j.JvmGcMonitorService] [node_001] [gc] overhead, spent [601ms] collecting in the last [1s]
[2017-01-27T21:29:50,520][WARN ][o.e.m.j.JvmGcMonitorService] [node_001] [gc] overhead, spent [607ms] collecting in the last [1s]
d) CPU on the node doing the work goes to 100% for the duration
Looking at the monitoring page in retrospect, there's no heap pressure at all (around 2GB used out of 8GB, perhaps expected given the index size).
What's going on with the cluster? How do I debug further? Obviously something is misconfigured; it seems that there should be plenty of margin.