Garbage collector question

michbsd · March 20, 2018, 1:42pm

Nodes: 3
Indices: 142
Memory: 17GB / 45GB
Total Shards: 1284
Documents: 442,160,568
Data: 578GB
Version: 5.3.0

Since a few days I am seeing a lot of these messages in the logfile:

[2018-03-20T14:35:04,269][INFO ][o.e.m.j.JvmGcMonitorService] [7xHqegG] [gc][3661] overhead, spent [373ms] collecting in the last [1.1s]
[2018-03-20T14:36:03,578][INFO ][o.e.m.j.JvmGcMonitorService] [7xHqegG] [gc][3720] overhead, spent [286ms] collecting in the last [1s]
[2018-03-20T14:36:05,608][INFO ][o.e.m.j.JvmGcMonitorService] [7xHqegG] [gc][3722] overhead, spent [297ms] collecting in the last [1s]

These eventually turn into these:
2018-03-20T13:06:20,875][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [7xHqegG] collector [cluster-stats-collector] timed out when collecting data
[2018-03-20T13:08:21,702][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [7xHqegG] collector [cluster-stats-collector] timed out when collecting data
[2018-03-20T13:09:18,651][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [7xHqegG] collector [cluster-stats-collector] timed out when collecting data
[2018-03-20T13:10:30,348][ERROR][o.e.x.m.c.i.IndexStatsCollector] [7xHqegG] collector [index-stats-collector] timed out when collecting data
[2018-03-20T13:13:32,115][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [7xHqegG] collector [cluster-stats-collector] timed out when collecting data
[2018-03-20T13:31:10,813][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [7xHqegG] collector [cluster-stats-collector] timed out when collecting data
[2018-03-20T13:51:23,536][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [7xHqegG] collector [cluster-stats-collector] timed out when collecting data

I read a suggestion on the forum, that increasing heap size would solve this .. (which I did from 30GB to 45GB) (15GB per server)

But I am still seeing the errors occur.

Any ideas on how to debug/resolve this?

thanks,

michbsd · March 20, 2018, 5:05pm

I have also tried to reduce heap size, but I am still seeing the gc overhead messages.

loren · March 20, 2018, 6:12pm

A 30GB heap is already incredibly large. The ~1300 shards jumps out at me first. That is a big number especially for so little data. I would try to reduce the number of indices and the number of shards per index by 10x and see if your problem goes away.

michbsd · March 20, 2018, 7:17pm

Thanks for the reply.

The indices are just daily logstash and metricbeat - so I don't I think I can reduce indices.

But I guess I can reduce shards.

michbsd · March 20, 2018, 7:52pm

I've bumped shard allocation to 2 for logstash for now.

(Average logstash index is about 12GB)

loren · March 20, 2018, 9:16pm

Have you looked at using size-based Rollover Indices instead of time-based indices? Dividing so little data across so many indices and shards is like cutting birthday cake into slices: the more cuts you make, the more cake ends up on the knife.

michbsd · March 21, 2018, 10:51am

I hadn’t. But I am now

However, I thought ~12GB indices were decently sized? The rollover API doc also references 5GB turnover. That might just be a bad example though

Thanks

Christian_Dahlqvist · March 21, 2018, 11:01am

In most discussions about appropriate index and shard size, it is the shard size that is most interesting and used. An average shard size of 12Gb sounds quite reasonable, but based on the data you provided, you seem to have an average shard size of less than 500MB, which is small.

If you are using the rollover index API and target a certain shard size, you naturally need to consider how many shards it will have when you make your calculation.

michbsd · March 21, 2018, 12:01pm

Indeed. I've cleaned up the indices.

Now you are talking about shard sizes, and I was talking about index size.

So average logstash index is about 12GB. And with the modifications I made yesterday - this index is allocated 2 shards and 1 replicate.

Does that sound reasonable?

Christian_Dahlqvist · March 21, 2018, 12:17pm

It you had an average shard size of 6GB, you would have around 100 shards instead of 1300, which sounds much more reasonable for that data volume.

michbsd · March 22, 2018, 7:59am

Do you think 10GB heap per node is too much?

Cause I am still seeing the “[INFO ][o.e.m.j.JvmGcMonitorService] [7xHqegG] [gc][3722] overhead, spent [297ms] collecting in the last [1s]”

Christian_Dahlqvist · March 22, 2018, 8:04am

That depends on the workload. Once you reduce the number of shards you should be able to install monitoring and see if it is too much from looking at heap usage over time.

michbsd · March 22, 2018, 10:18am

I already have monitoring installed.

The only noticable thing are peak waves on the heap usage :

system · April 19, 2018, 10:19am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Garbage collection times out? Elasticsearch	3	816	March 5, 2018
Gc takes a lot of time Elasticsearch	14	6328	February 5, 2018
Overhead and heap issues Elasticsearch	7	26745	March 14, 2018
Elasticsearch overhead all the time Elasticsearch	6	2733	December 14, 2018
Elasticsearch gc overhead Elasticsearch	1	1300	March 23, 2020

Garbage collector question

Related topics