Since a few days I am seeing a lot of these messages in the logfile:
[2018-03-20T14:35:04,269][INFO ][o.e.m.j.JvmGcMonitorService] [7xHqegG] [gc][3661] overhead, spent [373ms] collecting in the last [1.1s]
[2018-03-20T14:36:03,578][INFO ][o.e.m.j.JvmGcMonitorService] [7xHqegG] [gc][3720] overhead, spent [286ms] collecting in the last [1s]
[2018-03-20T14:36:05,608][INFO ][o.e.m.j.JvmGcMonitorService] [7xHqegG] [gc][3722] overhead, spent [297ms] collecting in the last [1s]
These eventually turn into these:
2018-03-20T13:06:20,875][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [7xHqegG] collector [cluster-stats-collector] timed out when collecting data
[2018-03-20T13:08:21,702][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [7xHqegG] collector [cluster-stats-collector] timed out when collecting data
[2018-03-20T13:09:18,651][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [7xHqegG] collector [cluster-stats-collector] timed out when collecting data
[2018-03-20T13:10:30,348][ERROR][o.e.x.m.c.i.IndexStatsCollector] [7xHqegG] collector [index-stats-collector] timed out when collecting data
[2018-03-20T13:13:32,115][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [7xHqegG] collector [cluster-stats-collector] timed out when collecting data
[2018-03-20T13:31:10,813][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [7xHqegG] collector [cluster-stats-collector] timed out when collecting data
[2018-03-20T13:51:23,536][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [7xHqegG] collector [cluster-stats-collector] timed out when collecting data
I read a suggestion on the forum, that increasing heap size would solve this .. (which I did from 30GB to 45GB) (15GB per server)
A 30GB heap is already incredibly large. The ~1300 shards jumps out at me first. That is a big number especially for so little data. I would try to reduce the number of indices and the number of shards per index by 10x and see if your problem goes away.
Have you looked at using size-based Rollover Indices instead of time-based indices? Dividing so little data across so many indices and shards is like cutting birthday cake into slices: the more cuts you make, the more cake ends up on the knife.
In most discussions about appropriate index and shard size, it is the shard size that is most interesting and used. An average shard size of 12Gb sounds quite reasonable, but based on the data you provided, you seem to have an average shard size of less than 500MB, which is small.
If you are using the rollover index API and target a certain shard size, you naturally need to consider how many shards it will have when you make your calculation.
That depends on the workload. Once you reduce the number of shards you should be able to install monitoring and see if it is too much from looking at heap usage over time.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.