One node is stuck with gc

Yogesh_BG · January 21, 2019, 2:40pm

Hi

I am using elasticsearch 2.4 version. Have a setup with 3 nodes each having 36cpu's, 32gc(es) + 32gb(lucen) setup.

Setup was running fine for 7days. huge and constant traffic is going on continuously. after 7 days one node started performing lots of young gen gc and it takes lot of time like 40sec. below are attached logs and hot threads link to check. Any leads can help a lot.

cpu utilisation is going very high. wondering what this node is doing compared to other nodes. initially we had rep fact set to 2, after seeing lot of delay caused by ES reduced rep fact to 1.

https://drive.google.com/drive/folders/10DVHNOq1yDxUUL3eas6W9PyYE-HC9jIF?usp=sharing

Christian_Dahlqvist · January 21, 2019, 2:49pm

One of the nodes seem to be using swap, which could cause performance problems. It is recommend to disable swap on Elasticsearch nodes.

Yogesh_BG · January 21, 2019, 3:03pm

Hi

swap has been disabled in all machines from beginning. I have attached the configs, logs and hot threads merger seems to be very busy. And we use spin disk not ssd. merge thread config is also 1.

Christian_Dahlqvist · January 21, 2019, 3:54pm

Why do we then se this in the node stats for node xCgbhMTmThK28AZG1nx7AA;

"os": {
  "timestamp": 1548077656890,
  "cpu_percent": 28,
  "load_average": 11.84,
  "mem": {
    "total_in_bytes": 126568382464,
    "free_in_bytes": 3714973696,
    "used_in_bytes": 122853408768,
    "free_percent": 3,
    "used_percent": 97
  },
  "swap": {
    "total_in_bytes": 8589930496,
    "free_in_bytes": 5121765376,
    "used_in_bytes": 3468165120
  }
}

Yogesh_BG · January 21, 2019, 4:25pm

each machine has kubernetis containers running, the swap happening due to other process. xCgbhMTmThK28AZG1nx7AA this node doesn't have any issue, its running properly. Issue is with node XNo15jgrSQupkQYFkpwU1Q

Yogesh_BG · January 23, 2019, 6:33am

Hi I restarted the elastic search node which was having the issue, issue seems to be not appearing now. Can u give me some guideline or idea on what i can suspect or look into.

Yogesh_BG · January 25, 2019, 6:59am

I would see that issue repeats after almost 18 hrs, and only in that particular node. Any help to see what could be the issue will be helpful

Yogesh_BG · January 26, 2019, 12:03pm

lots of gc happening in only one node and same particular node always. all nodes are similar. wondering what could be the cause

[2019-01-26 11:49:09,963][WARN ][monitor.jvm ] [metrics-datastore-2] [gc][young][152972] . 
[3298] duration [11.6s], collections [1]/[12.5s], total [11.6s]/[14m], memory [17.8gb]-&gt; 
[6.9gb]/[29.4gb], all_pools {[young] [12.2gb]-&gt;[40.2mb]/[12.4gb]}{[survivor] [1.5gb]-&gt; 
[1.5gb]/[1.5gb]}{[old] [4gb]-&gt;[5.3gb]/[15.5gb]}

[2019-01-26 11:49:46,400][WARN ][monitor.jvm ] [metrics-datastore-2] [gc][young][153003] . 
[3299] duration [5.4s], collections [1]/[6.3s], total [5.4s]/[14.1m], memory [18.2gb]-&gt; 
[8.3gb]/[29.4gb], all_pools {[young] [11.3gb]-&gt;[52.6mb]/[12.4gb]}{[survivor] [1.5gb]-&gt; 
[1.5gb]/[1.5gb]}{[old] [5.3gb]-&gt;[6.7gb]/[15.5gb]}

[2019-01-26 11:50:17,880][WARN ][monitor.jvm ] [metrics-datastore-2] [gc][young][153034] . 
[3300] duration [1s], collections [1]/[1.4s], total [1s]/[14.1m], memory [20.4gb]-&gt; 
[8.4gb]/[29.4gb], all_pools {[young] [12.1gb]-&gt;[29.1mb]/[12.4gb]}{[survivor] [1.5gb]-&gt; 
[1.5gb]/[1.5gb]}{[old] [6.7gb]-&gt;[6.8gb]/[15.5gb]}

 [2019-01-26 11:50:44,650][WARN ][monitor.jvm ] [metrics-datastore-2] [gc][young][153058] . 
[3301] duration [2.7s], collections [1]/[3.7s], total [2.7s]/[14.1m], memory [20.4gb]-&gt; 
[9.5gb]/[29.4gb], all_pools {[young] [11.9gb]-&gt;[17.9mb]/[12.4gb]}{[survivor] [1.5gb]-&gt; 
[1.5gb]/[1.5gb]}{[old] [6.8gb]-&gt;[7.9gb]/[15.5gb]}

[2019-01-26 11:52:11,776][WARN ][monitor.jvm ] [metrics-datastore-2] [gc][young][153130][3302] duration [3.5s], collections [1]/[3.8s], total [3.5s]/[14.2m], memory [19.2gb]->[7.4gb]/[29.4gb], all_pools {[young] [12.3gb]->[6.5mb]/[12.4gb]}{[survivor] [1.5gb]->[1.5gb]/[1.5gb]}{[old] [5.3gb]->[5.9gb]/[15.5gb]}

[2019-01-26 11:54:05,202][WARN ][monitor.jvm ] [metrics-datastore-2] [gc][young][153202][3303] duration [22.8s], collections [1]/[23.2s], total [22.8s]/[14.6m], memory [17.3gb]->[6.6gb]/[29.4gb], all_pools {[young] [12.2gb]->[94.9mb]/[12.4gb]}{[survivor] [1.5gb]->[1.5gb]/[1.5gb]}{[old] [3.5gb]->[5gb]/[15.5gb]}

[2019-01-26 11:57:03,648][WARN ][monitor.jvm ] [metrics-datastore-2] [gc][young][153294][3305] duration [43.7s], collections [2]/[44.2s], total [43.7s]/[15.3m], memory [18.9gb]->[8.7gb]/[29.4gb], all_pools {[young] [12.3gb]->[271.3mb]/[12.4gb]}{[survivor] [1.5gb]->[378.6mb]/[1.5gb]}{[old] [5gb]->[8.1gb]/[15.5gb]}

[2019-01-26 12:00:03,996][WARN ][monitor.jvm ] [metrics-datastore-2] [gc][young][153406][3306] duration [14.4s], collections [1]/[14.9s], total [14.4s]/[15.5m], memory [20.4gb]->[9.7gb]/[29.4gb], all_pools {[young] [12.3gb]->[24.4mb]/[12.4gb]}{[survivor] [378.6mb]->[1.5gb]/[1.5gb]}{[old] [7.7gb]->[8.1gb]/[15.5gb]}

Yogesh_BG · February 13, 2019, 4:39pm

Hi

Can anyone pls help me with this, i am continuously hitting this issue and blocked to continue. Any guidance on how to proceed could help

Yogesh_BG · February 13, 2019, 4:44pm

Is this much of data shown below is too much in case of ES 3 data nodes, each having 31+32(lucen)gb memory

curl localhost:9200/_cat/indices
green open metrics-2019.02.13   3 2   5799316 0  10.1gb   3.3gb 
green open metrics-2019.02.06   3 2   4400282 0   9.2gb     3gb 
green open metrics-2019.02.07   3 2  10657640 0  22.3gb   7.4gb 
green open metrics-2019.02.08   3 2  10652504 0  22.4gb   7.4gb 
green open metrics-2019.02.09   3 2  10652500 0  22.6gb   7.5gb 
green open metrics-2019.02.07-1 3 2  73252166 0  40.2gb  13.4gb 
green open logs-2019.02.13.13   3 2  26727277 0  24.5gb   8.1gb 
green open metrics-2019.02.06-1 3 2  30291630 0  16.4gb   5.4gb 
green open logs-2019.02.13.14   3 2  20068073 0  18.4gb   6.2gb 
green open metrics-2019.02.09-1 3 2  74120995 0  40.8gb  13.5gb 
green open logs-2019.02.13.15   3 2    278152 0 201.3mb 104.3mb 
green open metrics-2019.02.08-1 3 2  74500069 0  40.7gb  13.6gb 
green open metrics-2019.02-10   3 2  64485524 0  78.1gb  26.1gb 
green open metrics-2019.02.10   3 2  10652206 0  22.7gb   7.5gb 
green open logs-2019.02.13.10   3 2  26697205 0  24.5gb   8.1gb 
green open logs-2019.02.13.11   3 2  22676059 0  20.8gb   6.9gb 
green open metrics-2019.02.11   3 2  10525632 0  22.2gb   7.4gb 
green open metrics-2019.02.12   3 2   9606808 0  17.9gb   5.9gb 
green open logs-2019.02.13.12   3 2  22353715 0  22.3gb   6.8gb 
green open metrics-2019.02.10-1 3 2 102968475 0  57.2gb    19gb 
green open metrics-2019.02.12-1 3 2  71503558 0  38.7gb  12.9gb 
green open metrics-2019.02.11-1 3 2  88552753 0  48.9gb  16.2gb 
 green open metrics-2019.02.13-1 3 2  67064281 0  34.7gb  11.8gb

Yogesh_BG · February 13, 2019, 5:15pm

 index.codec.bloom.load=false

is this going to impact anyway?

Yogesh_BG · February 14, 2019, 1:39pm

Hi we also delete one index every hour, so elasticsearch maintains any tombstone kind of things in elasticsearch? if so what is the safest way to remove a index every hour? should i need to worry about this ???

DavidTurner · February 14, 2019, 4:45pm

Elasticsearch 2.4 was released in August 2016, saw about a year of maintenance and bug fixes, and reached the end of its supported life nearly a year ago. I don't think I have a development environment in which I can even build it any more, let alone dig into this kind of memory issue. The best path forward is to upgrade to a more recent, supported, version since these incorporate many improvements in resilience.

Yogesh_BG · February 14, 2019, 4:58pm

We are sure going to upgrade ES to latest release. Now we are in a stage where we cannot upgrade immediately. If u can confirm that these kind off issues are there in ES 2.4 is also helpful to us... any work around guidance will be great...

Yogesh_BG · February 25, 2019, 5:37pm

Hi

are these related to my case?

https://issues.apache.org/jira/browse/LUCENE-7647

system · March 25, 2019, 5:37pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Frequently gc in one node Elasticsearch	6	1252	May 28, 2018
Suspect GC sync'ed between nodes cause simultaneous performance hit Elasticsearch	14	542	July 6, 2017
Elasticsearch dies every other day Elasticsearch	15	1636	July 6, 2017
When one node goes down, memory usage jumps several gigabytes on other nodes Elasticsearch	7	565	July 6, 2017
Node is getting hurt pretty bad from diagnostics information Elasticsearch	3	434	July 6, 2017

One node is stuck with gc

Related topics