Garbage collector question

Nodes: 3
Indices: 142
Memory: 17GB / 45GB
Total Shards: 1284
Documents: 442,160,568
Data: 578GB
Version: 5.3.0

Since a few days I am seeing a lot of these messages in the logfile:

[2018-03-20T14:35:04,269][INFO ][o.e.m.j.JvmGcMonitorService] [7xHqegG] [gc][3661] overhead, spent [373ms] collecting in the last [1.1s]
[2018-03-20T14:36:03,578][INFO ][o.e.m.j.JvmGcMonitorService] [7xHqegG] [gc][3720] overhead, spent [286ms] collecting in the last [1s]
[2018-03-20T14:36:05,608][INFO ][o.e.m.j.JvmGcMonitorService] [7xHqegG] [gc][3722] overhead, spent [297ms] collecting in the last [1s]

These eventually turn into these:
2018-03-20T13:06:20,875][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [7xHqegG] collector [cluster-stats-collector] timed out when collecting data
[2018-03-20T13:08:21,702][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [7xHqegG] collector [cluster-stats-collector] timed out when collecting data
[2018-03-20T13:09:18,651][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [7xHqegG] collector [cluster-stats-collector] timed out when collecting data
[2018-03-20T13:10:30,348][ERROR][o.e.x.m.c.i.IndexStatsCollector] [7xHqegG] collector [index-stats-collector] timed out when collecting data
[2018-03-20T13:13:32,115][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [7xHqegG] collector [cluster-stats-collector] timed out when collecting data
[2018-03-20T13:31:10,813][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [7xHqegG] collector [cluster-stats-collector] timed out when collecting data
[2018-03-20T13:51:23,536][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [7xHqegG] collector [cluster-stats-collector] timed out when collecting data

I read a suggestion on the forum, that increasing heap size would solve this .. (which I did from 30GB to 45GB) (15GB per server)

But I am still seeing the errors occur.

Any ideas on how to debug/resolve this?

thanks,

I have also tried to reduce heap size, but I am still seeing the gc overhead messages.

A 30GB heap is already incredibly large. The ~1300 shards jumps out at me first. That is a big number especially for so little data. I would try to reduce the number of indices and the number of shards per index by 10x and see if your problem goes away.

Thanks for the reply.

The indices are just daily logstash and metricbeat - so I don't I think I can reduce indices.

But I guess I can reduce shards.

I've bumped shard allocation to 2 for logstash for now.

(Average logstash index is about 12GB)

Have you looked at using size-based Rollover Indices instead of time-based indices? Dividing so little data across so many indices and shards is like cutting birthday cake into slices: the more cuts you make, the more cake ends up on the knife.

1 Like

I hadn’t. But I am now :slight_smile:

However, I thought ~12GB indices were decently sized? The rollover API doc also references 5GB turnover. That might just be a bad example though

Thanks

In most discussions about appropriate index and shard size, it is the shard size that is most interesting and used. An average shard size of 12Gb sounds quite reasonable, but based on the data you provided, you seem to have an average shard size of less than 500MB, which is small.

If you are using the rollover index API and target a certain shard size, you naturally need to consider how many shards it will have when you make your calculation.

1 Like

Indeed. I've cleaned up the indices.

Now you are talking about shard sizes, and I was talking about index size.

So average logstash index is about 12GB. And with the modifications I made yesterday - this index is allocated 2 shards and 1 replicate.

Does that sound reasonable?

It you had an average shard size of 6GB, you would have around 100 shards instead of 1300, which sounds much more reasonable for that data volume.

2 Likes

Do you think 10GB heap per node is too much?

Cause I am still seeing the “[INFO ][o.e.m.j.JvmGcMonitorService] [7xHqegG] [gc][3722] overhead, spent [297ms] collecting in the last [1s]”

That depends on the workload. Once you reduce the number of shards you should be able to install monitoring and see if it is too much from looking at heap usage over time.

I already have monitoring installed.

The only noticable thing are peak waves on the heap usage :

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.