Frequency old gc of some nodes in cluster

shjdwxy · May 9, 2019, 6:07am

hi
During busy hours of Day, heap usage of some of nodes in Es cluster began to raise and old gc was more frequency. Full gc was also triggered and last about 20-40 seconds.

old gc times:

Only If I reduce index rate, the heap usage was back to normal. I made a dump of heap and used MAT to do Leak Suspects.

Es Cluster Info:
Es version 5.4.3
3 * master node, 26 hot node, 52 code node.
Only one hot node and 3 code node meet gc frequency problem at the same time.

Is there any anomaly according to the Leak Suspects reports? I will provide more info if needed.

Thanks.

dadoonet · May 9, 2019, 6:30am

Hey.

Not sure it will solve your current problem but what about upgrading to the latest 5.x version which contains a lot of bug fixes?
Even better, upgrade to 6.x or better than better, upgrade to 7.0?

What is the output of:

GET /_cat/health?v
GET /_cat/indices?v
GET /_cat/shards?v

shjdwxy · May 9, 2019, 6:48am

Thanks @dadoonet
During problem time, this cluster is in Green state and also there is no shard relocation.

Doing elasticsearch version update is a huge task for us at this moment. We have 6 Es clusters and use tribe node as proxy. So we have to update all clusters.

Only one cluster meet gc problem recently and the index load of this cluster is not very high in my opinion.

Do you think that it is some bug of 5.4.3 ES caused this gc problem?

dadoonet · May 9, 2019, 7:23am

But could you answer the questions I asked?

shjdwxy · May 9, 2019, 7:28am

I will list the output of these requests next time when gc problem happen.

Christian_Dahlqvist · May 9, 2019, 7:37am

Is this issue related to the same cluster discussed in this thread? If so, how much data do you have in the cluster? Have you followed the guidelines laid out in this webinar? It also seems like you have a quite high index and shard count, which could be contributing to heap pressure. Please see this blog post for some practical guidelines.

dadoonet · May 9, 2019, 7:41am

Please do it now. No need to wait.

shjdwxy · May 9, 2019, 7:48am

This issus is not related to this thread
The ES cluster INFO:
Es version 5.4.3
3 * master node, 26 * hot node, 52 * code node.
each node has 31GB heap.
2,910 indices 6,518 shards

dadoonet · May 9, 2019, 7:49am

Is it different than your initial question?

Es Cluster Info:
Es version 5.4.3
3 * master node, 20 hot node, 40 code node.

Also in which node the GC is happening?

shjdwxy · May 9, 2019, 7:52am

GET /_cat/health?v

gist.github.com

https://gist.github.com/wangxiangyu/32a3ac9de2496923782a31bf982d984f

gistfile1.txt

epoch      timestamp cluster                status node.total node.data shards  pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1557388215 15:50:15  billions-online-jssz03 green          74        69   6518 4314    0    0        0             0                  -                100.0%

GET /_cat/indices?v

gist.github.com

https://gist.github.com/wangxiangyu/58d309b3ae0e7e106ab35ba2e240b341

gistfile1.txt

health status index                                                                    uuid                   pri rep  docs.count docs.deleted store.size pri.store.size
       close  billions-bplus-access-dynamic_bstore_comm_biz-@2019.04.28-jssz03-1       3hsq2QVVSdGfSjyHOmbcDA                                                           
green  open   billions-main.app-svr.resource-service-@2019.05.09-jssz03-1              HrNR7TFiShCjHlHe3wGaVw   2   0    29037170            0       49gb           49gb
       close  billions-ops.billions.fake_flow.000161-@2019.04.26-jssz03-1              PfovucqXTraoOycP9wfyuA                                                           
green  open   billions-ops.apm.cdn-qos-report-@2019.01.10                              3Wo_N9kCQpWlymFgrMIsqw   1   1       10000            0     12.9mb          6.4mb
green  open   billions-open-reconciliation-@2019.04.19-jssz03-1                        Y0HRckhaTNWaZlj72K2AaA   2   1      119668            0     43.6mb         21.8mb
green  open   billions-sli-@2019.05.08-jssz03-1                                        Qh0XFZtlS-iy4LpyNU1zsQ  20   1 29693136793            0      3.4tb          1.7tb
green  open   billions-mall-hawkeye-@2019.05.03-jssz03-1                               N67nS3klRIyLXZX_gpjqtw   2   1       18005            0      2.7mb          1.3mb
green  open   billions-mall-eureka-server-@2019.05.08-jssz03-1                         Q2XBh32-Rw2QjQ8_OMPnNg   2   1       44523            0      5.5mb          2.7mb
       close  billions-main.app-svr.app-view-@2019.05.03-jssz03-1                      0X83HU9wQECMB2jceT2NfA

This file has been truncated. show original

GET /_cat/shards?v

gist.github.com

https://gist.github.com/wangxiangyu/c05df65e762b86095c5f11e176bf8a41

gistfile1.txt

index                                                                    shard prirep state         docs    store ip           node
billions-sjptb-lancer-gateway-logstream-@2019.05.08-jssz03-1             1     p      STARTED   17071457      1gb 10.69.67.11  jssz-billions-es-16-datanode_stale
billions-sjptb-lancer-gateway-logstream-@2019.05.08-jssz03-1             1     r      STARTED   17071457      1gb 10.69.175.19 jssz-billions-es-48-datanode_stale
billions-sjptb-lancer-gateway-logstream-@2019.05.08-jssz03-1             0     r      STARTED   17053877      1gb 10.69.175.20 jssz-billions-es-49-datanode_stale
billions-sjptb-lancer-gateway-logstream-@2019.05.08-jssz03-1             0     p      STARTED   17053877      1gb 10.69.175.19 jssz-billions-es-48-datanode_stale01
billions-openplatform-gobase-@2019.05.08-jssz03-1                        1     r      STARTED     317903   52.3mb 10.69.175.19 jssz-billions-es-48-datanode_stale01
billions-openplatform-gobase-@2019.05.08-jssz03-1                        1     p      STARTED     317903   52.3mb 10.69.175.31 jssz-billions-es-55-datanode_stale01
billions-openplatform-gobase-@2019.05.08-jssz03-1                        0     p      STARTED     318170   52.3mb 10.69.34.16  jssz-billions-es-39-datanode_stale01
billions-openplatform-gobase-@2019.05.08-jssz03-1                        0     r      STARTED     318170   52.3mb 10.69.67.23  jssz-billions-es-21-datanode_stale
billions-open.ticket.open-settle-@2018.12.20-jssz03-1                    1     p      STARTED          1    5.2kb 10.69.67.24  jssz-billions-es-22-datanode_stale

This file has been truncated. show original

shjdwxy · May 9, 2019, 7:58am

I make a mistake in initial question

NOTE: Date is only index to hot node. The data is moved from hot node to cold node daily.

The nodes with gc problems are:
jssz-billions-es-40-datanode_hot
jssz-billions-es-22-datanode_stale
jssz-billions-es-39-datanode_stale01
jssz-billions-es-48-datanode_stale
jssz-billions-es-26-datanode_stale01
jssz-billions-es-24-datanode_stale

dadoonet · May 9, 2019, 8:32am

Thanks for sharing.
It looks like you have plenty of small shards. May be something you should consider.
Some shards are overloaded IMO. Like

billions-video.vod.playurl-@2019.05.09-jssz03-1	1	p	STARTED	89054241	131.2gb	10.69.175.31	jssz-billions-es-55-datanode_hot
billions-video.vod.playurl-@2019.05.09-jssz03-1	0	p	STARTED	89083881	131.2gb	10.69.175.32	jssz-billions-es-56-datanode_hot
billions-video.vod.playurl-@2019.05.09-jssz03-1	5	p	STARTED	88789888	131.3gb	10.69.67.14	jssz-billions-es-19-datanode_hot
billions-video.vod.playurl-@2019.05.09-jssz03-1	2	p	STARTED	88941625	133.4gb	10.69.34.17	jssz-billions-es-40-datanode_hot
billions-video.vod.playurl-@2019.05.09-jssz03-1	9	p	STARTED	88851385	133.5gb	10.69.67.20	jssz-billions-es-28-datanode_hot
billions-video.vod.playurl-@2019.05.09-jssz03-1	8	p	STARTED	89018956	133.8gb	10.69.67.18	jssz-billions-es-26-datanode_hot

We recommend no more than 50gb per shard.

Not sure if you are using rollover API but I'd use it in your case to reduce the number of shards and try to keep them around 50gb per shard.

The total number of shards/indices you have in your cluster has also the consequence I think of a very big cluster state. Those big "objects" needs to be Gc'ed sometime. Because you have a very big HEAP (31gb), the old GC can take several minutes sadly.

My opinion is that you should consider at some point to upgrade elasticsearch and your JVM. In 7.x you will have a more recent JVM which different GC algorithms.

But I'll be happy to hear other thoughts.

shjdwxy · May 9, 2019, 9:19am

Thanks for your reply.
I think "big cluster state" maybe is not the case of gc problem. I have another cluster ( B for short)which is the same size(hardware size) as this cluster( A for short) with gc problem. Cluster B has 13,508 indices and 21,615 shards as much as twice of cluster A. Index load of Cluster B is also larger then Cluster A. But cluster B never met the gc problem.

According to the leak suspect report, this suspect is very suspicious.

what's your opinion?

Christian_Dahlqvist · May 9, 2019, 11:28am

What is the output of the cluster stats API?

shjdwxy · May 10, 2019, 8:24am

gist.github.com

https://gist.github.com/wangxiangyu/ceae49818d29d6ae67fcd11f3aa550f7

gistfile1.txt

{
  "_nodes": {
    "total": 74,
    "successful": 74,
    "failed": 0
  },
  "cluster_name": "billions-online-jssz03",
  "timestamp": 1557468503796,
  "status": "green",
  "indices": {

This file has been truncated. show original

Christian_Dahlqvist · May 10, 2019, 8:53am

It looks like you have a 3rd party SQL plugin installed. Do all environments have this? Is usage of this plugin consistent across the environments?

shjdwxy · May 10, 2019, 10:02am

I will try to remove sql plugin, it is useless now.

system · June 7, 2019, 10:02am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
GC out of control on 12 node elastic 5.4 cluster Elasticsearch	1	480	August 7, 2017
Continous GC on Master Node Elasticsearch	7	879	October 4, 2018
ES 1.7.2 frequent and long old GC Problem Elasticsearch	2	784	July 5, 2017
GC Problem Elasticsearch	3	346	July 6, 2017
Elasticsearch High CPU Usage - GC Not Working Elasticsearch	26	7071	July 5, 2017

Frequency old gc of some nodes in cluster

Related topics